+ All Categories
Home > Documents > arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the...

arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the...

Date post: 08-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
10
Know2Look: Commonsense Knowledge for Visual Search Sreyasi Nag Chowdhury Niket Tandon Gerhard Weikum Max Planck Institute for Informatics Saarbr¨ ucken, Germany sreyasi, ntandon, [email protected] Abstract With the rise in popularity of social media, im- ages accompanied by contextual text form a huge section of the web. However, search and retrieval of documents are still largely depen- dent on solely textual cues. Although visual cues have started to gain focus, the imperfec- tion in object/scene detection do not lead to significantly improved results. We hypothe- size that the use of background commonsense knowledge on query terms can significantly aid in retrieval of documents with associated images. To this end we deploy three different modalities - text, visual cues, and common- sense knowledge pertaining to the query - as a recipe for efficient search and retrieval. 1 Introduction Motivation: Image retrieval by querying visual con- tents has been on the agenda of the database, infor- mation retrieval, multimedia, and computer vision communities for decades (Liu et al., 2007; Datta et al., 2008). Search engines like Baidu, Bing or Google perform reasonably well on this task, but crucially rely on textual cues that accompany an im- age: tags, caption, URL string, adjacent text etc. In recent years, deep learning has led to a boost in the quality of visual object recognition in images with fine-grained object labels (Simonyan and Zis- serman, 2014; LeCun et al., 2015; Mordvintsev et al., 2015). Methods like LSDA (Hoffman et al., 2014) are trained on more than 15,000 classes of Im- ageNet (Deng et al., 2009) (which are mostly leaf- level synsets of WordNet (Miller, 1995)), and anno- tate newly seen images with class labels for bound- Detected visual objects: traffic light, car, person, bicycle, bus, car, grille, radiator grille (a) Good object detection Detected visual objects: tv or monitor, cargo door, piano (b) Poor object detection Figure 1: Example cases where visual object detec- tion may or may not aid in search and retrieval. ing boxes of objects. For the image in Figure 1a, for example, object labels traffic light, car, person, bi- cycle and bus have been recognized making it easily retrievable for queries with these concepts. How- ever, these labels come with uncertainty. For the image in Figure 1b, there is much higher noise in its visual object labels; so querying by visual labels would not work here. Opportunity and Challenge: These limitations of text-based search, on one hand, and visual-object search, on the other hand, suggest combining the cues from text and vision for more effective re- trieval. Although each side of this combined feature space is incomplete and noisy, the hope is that the arXiv:1909.00749v1 [cs.IR] 2 Sep 2019
Transcript
Page 1: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

Know2Look: Commonsense Knowledge for Visual Search

Sreyasi Nag Chowdhury Niket Tandon Gerhard WeikumMax Planck Institute for Informatics

Saarbrucken, Germanysreyasi, ntandon, [email protected]

Abstract

With the rise in popularity of social media, im-ages accompanied by contextual text form ahuge section of the web. However, search andretrieval of documents are still largely depen-dent on solely textual cues. Although visualcues have started to gain focus, the imperfec-tion in object/scene detection do not lead tosignificantly improved results. We hypothe-size that the use of background commonsenseknowledge on query terms can significantlyaid in retrieval of documents with associatedimages. To this end we deploy three differentmodalities - text, visual cues, and common-sense knowledge pertaining to the query - as arecipe for efficient search and retrieval.

1 Introduction

Motivation: Image retrieval by querying visual con-tents has been on the agenda of the database, infor-mation retrieval, multimedia, and computer visioncommunities for decades (Liu et al., 2007; Dattaet al., 2008). Search engines like Baidu, Bing orGoogle perform reasonably well on this task, butcrucially rely on textual cues that accompany an im-age: tags, caption, URL string, adjacent text etc.

In recent years, deep learning has led to a boostin the quality of visual object recognition in imageswith fine-grained object labels (Simonyan and Zis-serman, 2014; LeCun et al., 2015; Mordvintsev etal., 2015). Methods like LSDA (Hoffman et al.,2014) are trained on more than 15,000 classes of Im-ageNet (Deng et al., 2009) (which are mostly leaf-level synsets of WordNet (Miller, 1995)), and anno-tate newly seen images with class labels for bound-

Detected visual objects:

traffic light, car, person,

bicycle, bus, car, grille,

radiator grille

(a) Good object detection

Detected visual objects:

tv or monitor, cargo

door, piano

(b) Poor object detection

Figure 1: Example cases where visual object detec-tion may or may not aid in search and retrieval.

ing boxes of objects. For the image in Figure 1a, forexample, object labels traffic light, car, person, bi-cycle and bus have been recognized making it easilyretrievable for queries with these concepts. How-ever, these labels come with uncertainty. For theimage in Figure 1b, there is much higher noise inits visual object labels; so querying by visual labelswould not work here.

Opportunity and Challenge: These limitations oftext-based search, on one hand, and visual-objectsearch, on the other hand, suggest combining thecues from text and vision for more effective re-trieval. Although each side of this combined featurespace is incomplete and noisy, the hope is that the

arX

iv:1

909.

0074

9v1

[cs

.IR

] 2

Sep

201

9

Page 2: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

“environment friendlytraffic”

“downsides of moun-taineering”

“street-side soulful mu-sic”

Figure 2: Sample queries containing abstract con-cepts and expected results of image retrieval.

combination can improve retrieval quality.Unfortunately, images that show more sophisti-

cated scenes, or emotions evoked on the viewer arestill out of reach. Figure 2 shows three examples,along with query formulations that would likely con-sider these sample images as relevant results. Theseanswers would best be retrieved by queries with ab-stract words (e.g. “environment friendly”) or ac-tivity words (e.g. “traffic”) rather than words thatdirectly correspond to visual objects (e.g. “car” or“bike”). So there is a vocabulary gap, or even con-cept mismatch, between what users want and ex-press in queries and the visual and textual cues thatcome directly with an image. This is the key prob-lem addressed in this paper.

Approach and Contribution: To bridge the con-cepts and vocabulary between user queries and im-age features, we propose an approach that har-nesses commonsense knowledge (CSK). Recent ad-vances in automatic knowledge acquisition haveproduced large collections of CSK: physical (e.g.color or shape) as well as abstract (e.g. abili-ties) properties of everyday objects (e.g. bike, bird,sofa, etc.) (Tandon et al., 2014), subclass and part-whole relations between objects (Tandon et al.,

2016), activities and their participants (Tandon etal., 2015), and more. This kind of knowledge al-lows us to establish relationships between our ex-ample queries and observable objects or activitiesin the image. For example, the following CSKtriples establish relationships between ‘backpack’,‘tourist’ and ‘travel map’: (backpacks, are

carried by, tourists), (tourists, use,

travel maps). This allows for retrieval of imageswith generic queries like “travel with backpack”.

This idea is worked out into a query expansionmodel where we leverage a CSK knowledge basefor automatically generating additional query words.Our model unifies three kinds of features: textualfeatures from the page context of an image, visualfeatures obtained from recognizing fine-grained ob-ject classes in an image, and CSK features in theform of additional properties of the concepts re-ferred to by query words. The weighing of the dif-ferent features is crucial for query-result ranking. Tothis end, we have devised a method based on statis-tical language models (Zhai, 2008).

The paper’s contribution can be characterized asfollows. We present the first model for incorporat-ing CSK into image retrieval. We develop a full-fledged system architecture for this purpose, alongwith a query processor and an answer-ranking com-ponent. Our system Know2Look, uses common-sense knowledge to look for images relevant to aquery by looking at the components of the imagesin greater detail. We further discuss experimentsthat compare our approach to state-of-the-art imagesearch in various configurations. Our approach sub-stantially improves the query result quality.

2 Related Work

Existing Commonsense Knowledge Bases: Tra-ditionally commonsense knowledge bases were cu-rated manually through experts (Lenat, 1995) orthrough crowd-sourcing (Singh et al., 2002). Mod-ern methods of CSK acquisition are automatic, ei-ther from test corpora (Liu and Singh, 2004) or fromthe web (Tandon et al., 2014).

Vision and NLP: Research at the intersection ofNatural Language Processing and Computer Visionis in limelight in the recent past. There have beenwork on automatic image annotations (Wang et al.,

Page 3: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

2014), description generation (Vinyals et al., 2014;Ordonez et al., 2011; Mitchell et al., 2012), sceneunderstanding (Farhadi et al., 2010), image retrievalthrough natural language queries (Malinowski andFritz, 2014) etc.

Commonsense knowledge from text and vision:There have been attempts for learning CSK fromreal images (Chen et al., 2013) as well as from non-photo-realistic abstractions (Vedantam et al., 2015).Recent work have also leveraged CSK for visual ver-ification of relational phrases (Sadeghi et al., 2015)and for non-visual tasks like fill-in-the-blanks byintelligent agents (Lin and Parikh, 2015). Learn-ing commonsense from visual cues continue to bea challenge in itself. The CSK used in our work ismotivated by research on CSK acquisition from theweb (Tandon et al., 2014).

3 Multimodal document retrieval

Adjoining text of images may or may not explic-itly annotate their visual contents. Search enginesrelying on only textual matches ignore informationwhich may be solely available in the visual cues.Moreover, the intuition behind using CSK is that hu-mans innately interpolate visual or textual informa-tion with associated latent knowledge for analysisand understanding. Hence we believe that leverag-ing CSK in addition to textual and visual informa-tion would take results closer to human users’ pref-erences. In order to use such background knowl-edge, curating a CSK knowledge base is of pri-mary importance. Since automatic acquisition ofcanonicalized CSK from the web can be costly, weconjecture that noisy subject-predicate-object (SPO)triples extracted through Open Information Extrac-tion (Banko et al., 2007) may be used as CSK. Wehypothesize that the combination of the noisy ingre-dients – CSK, object-classes, and textual descrip-tions – would create an ensemble effect providingfor efficient search and retrieval. We describe thecomponents of our architecture in the following sec-tions.

3.1 Data, Knowledge and Features

We consider a document x from a collection X withtwo kinds of features:

• Visual features xvj: labels of object classes rec-ognized in the image, including their hypernyms(e.g., king cobra, cobra, snake).

• Textual features xxj: words that occur in thetext that accompanies the image, for exampleimage caption.

We assume that the two kinds of features canbe combined into a single feature vector x =〈x1 . . . xM 〉 with hyper-parameters αv and αx toweigh visual vs. textual features.

CSK is denoted by a set Y of triples yk(k = 1..j)with components ysk, ypk, yok (s - subject, p - pred-icate, o - object). Each component consists of one ormore words. This yields a feature vector ykj(j =1..M) for the triple yk.

3.2 Language Models for RankingWe study a variety of query-likelihood languagemodels (LM) for ranking documents x with regardto a given query q. We assume that a query is simplya set of keywords qi(i = 1..L). In the following weformulate equations for unigram LMs, which can besimply extended to bigram LMs by using word pairsinstead of single ones.

Basic LM:

Pbasic[q|x] =∏i

P [qi|x] (1)

where we set the weight of word qi in x as follows:

P [qi|x] = αxP [qi|xxj ]P [xxj |x]+αvP [qi|xvj ]P [xvj |x](2)

Here, xxj and xvj are unigrams in the textual orvisual components of a document; αx and αv arehyper-parameters to weigh the textual and visualfeatures respectively.

Smoothed LM:

Psmoothed[q|x] = αPbasic[q|x]+(1−α)P [q|B] (3)

where B is a background corpus model andP [q|B] =

∏i P [qi|B]. We use Flickr tags from

the YFCC100M dataset (Thomee et al., 2015) alongwith their frequency of occurrences as a backgroundcorpus.

Commonsense-aware LM (a translation LM):

PCS [q|x] =∏i

[∑k P [qi|yk]P [yk|x]

|k|

](4)

Page 4: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

The summation ranges over all yk that can bridgethe query vocabulary with the image-feature vo-cabulary; so both of the probabilities P [qi|yk]and P [yk|x] must be non-zero. For example,when the query asks for “electric car” and animage has features “vehicle” (visual) and “en-ergy saving” (textual), triples such as (car, is a

type of, vehicle) and (electric engine,

saves, energy) would have this property. Thatis, we consider only commonsense triples that over-lap with both the query and the image features.

The probabilities P [qi|yk] and P [yk|x] are esti-mated based on the word-wise overlap between qiand yk and yk and x, respectively. They also con-sider the confidence of the words in yk and x.

Mixture LM (the final ranking LM):Since a document x can capture a query term or itscommonsense expansion, we formulate a mixturemodel for the ranking of a document with respectto a query:

P [q|x] = βCSPCS [q|x] + (1− βCS)Psmoothed[q|x](5)

where βCS is a hyper-parameter weighing the com-monsense features of the expanded query.

3.3 Feature Weights

By casting all features into word-level unigrams, wehave a unified feature space with hyper-parameters(αx, αv, and βCS). For this submission the hyper-parameters are manually chosen.

For weights of visual object class xvj of doc-ument x, we consider the confidence score fromLSDA (Hoffman et al., 2014). We extend theseobject classes with their hypernyms from WordNetwhich are set to the same confidence as their de-tected hyponyms. Although not in common par-lance this kind of expansion can also be consideredas CSK. We define the weight for a textual uni-gram xxj as its informativeness – the inverse docu-ment frequency with respect to a background corpus(Flickr tags with frequencies).

The words in a CSK triple yk have non-uniformweights proportional to their similarity with thequery words, their idf with respect to a backgroundcorpus, and the salience of their position – boost-ing the weight of words in s and o components of

y. The function computing similarity between twounigrams favors exact matches to partial matches.

3.4 ExampleQuery string: travel with backpackCommonsense triples to expand query:t1:(tourists, use, travel maps)

t2:(tourists, carry, backpacks)

t3:(backpack, is a type of, bag)

Say we have a document x with features:Textual - “A tourist reading a map by the road.”Visual - person, bag, bottle, bus

The query will now successfully retrieve the abovedocument, whereas it would have been missed bytext-only systems.

4 Datasets

For the purpose of demonstration we choose a top-ical domain – Tourism. Our CSK knowledge baseand image dataset obey this constraint.

CSK acquisition through OpenIE: We consider aslice of Wikipedia pertaining to the domain tourismas the text corpus to extract CSK from. Nouns fromthe Wikipedia article titled ‘Tourism’(seed docu-ment) constitute our basic language model. We col-lect articles by traversing the Wiki Category hier-archy tree while pruning out those with substantialtopic drift. The Jaccard Distance (Equation 6) of adocument from the seed document is used as a met-ric for pruning.

JaccardDistance = 1−WeightedJaccardSimilarity(6)

where,

WeightedJaccardSimilarity =

Σnmin[f(di, wn), f(D,wn)]

Σnmax[f(di, wn), f(D,wn)](7)

In Equation 7, acquired Wikipedia articles di arecompared to the seed document D; f(d′, w) is thefrequency of occurrence of word w in document d′.For simplicity only articles with Jaccard distance of1 from the seed document are pruned out. The cor-pus of domain-specific pages thus collected consti-tute ~5000 Wikipedia articles.

Page 5: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

Table 1: Query Benchmark for evaluation

aircraft international diesel transportairport vehicle dog parkbackpack travel fish marketball park housing townbench high lamp homebicycle road old clockbicycle trip road signalbird park table homeboat tour tourist busbridge road van road

The OpenIE tool ReVerb (Fader et al., 2011) runagainst our corpus produces around 1 million noisySPO triples. After filtering with our basic languagemodel we have ~22,000 moderately clean assertions.

Image Dataset: For the purpose of experimentswe construct our own image dataset. ~50,000 im-ages with descriptions are collected from the follow-ing datasets: Flickr30k (Young et al., 2014), Pas-cal Sentences (Rashtchian et al., 2010), SBU Cap-tioned Photo Dataset (Ordonez et al., 2011), andMSCOCO (Lin et al., 2014). The images are col-lected by comparing their textual descriptions withour basic language model for Tourism. An existingobject detection algorithm – LSDA (Hoffman et al.,2014) – is used for object detection in the images.The detected object classes are based on the 7000leaf nodes of ImageNet (Deng et al., 2009). We alsoexpand these classes by adding their super-classes orhypernyms with the same confidence score.

Query Benchmark: We construct a benchmark of20 queries from co-occurring Flickr tags from theYFCC100M dataset (Thomee et al., 2015). Thisbenchmark is shown in Table 1. Each query con-sists of two keywords that have appeared togetherwith high frequency as user tags in Flickr images.

5 Experiments

Baseline Google search results on our image datasetform the baseline for the evaluation of Know2Look.We consider the results in two settings – searchonly on original image caption (Vanilla Google),and on image captions along with detected objectclasses (Extended Google). The later is done to aid

Table 2: Comparison of Know2Look with baselines

Average Precision@10

Vanilla Google 0.47Extended Google 0.64Know2Look 0.85

Google in its search by providing additional visualcues. We exploit the domain restriction facility ofGoogle search (query string site:domain name) toget Google search results explicitly on our dataset.

Know2Look In addition to the setup for ExtendedGoogle, Know2Look also performs query expansionwith CSK. In most cases we win over the baselinesince CSK captures additional concepts related toquery terms enhancing latent information that maybe present in the images. We consider the top 10 re-trieval results of the two baselines and Know2Lookfor the 20 queries in our query benchmark1. Wecompare the three systems by Precision@10. Ta-ble 2 shows the values of Precision@10 averagedover 20 queries for each of the three systems –Know2Look performs better than the baselines.

6 ConclusionIn this paper we propose the incorporation of com-monsense knowledge for image retrieval. Ourarchitecture, Know2Look, expands queries by re-lated commonsense knowledge and retrieves im-ages based on their visual and textual contents. Byutilizing the visual and commonsense modalitieswe make search results more appealing to the hu-mans than traditional text-only approaches. We sup-port our claim by comparing Know2Look to Googlesearch on our image data set. The proposed conceptcan be easily extrapolated to document retrieval.Moreover, in addition to using noisy OpenIE triplesas commonsense knowledge, we aim to leverageexisting commonsense knowledge bases for futureevaluations of Know2Look.

Acknowledgment: We would like to thank AnnaRohrbach for her assistance with visual object de-tection of our image data set using LSDA. We alsothank Ali Shah for his help with visualization of theevaluation results.

1http://mpi-inf.mpg.de/~sreyasi/queries/evaluation.html

Page 6: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

References[Banko et al.2007] Michele Banko, Michael J Cafarella,

Stephen Soderland, Matthew Broadhead, and Oren Etzioni.2007. Open information extraction for the web. In IJCAI,volume 7, pages 2670–2676.

[Chen et al.2013] Xinlei Chen, Abhinav Shrivastava, and Abhi-nav Gupta. 2013. Neil: Extracting visual knowledge fromweb data. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 1409–1416.

[Datta et al.2008] Ritendra Datta, Dhiraj Joshi, Jia Li, andJames Z Wang. 2008. Image retrieval: Ideas, influences, andtrends of the new age. ACM Computing Surveys (CSUR),40(2):5.

[Deng et al.2009] Jia Deng, Wei Dong, Richard Socher, Li-JiaLi, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scalehierarchical image database. In Computer Vision and Pat-tern Recognition, 2009. CVPR 2009. IEEE Conference on,pages 248–255. IEEE.

[Fader et al.2011] Anthony Fader, Stephen Soderland, and OrenEtzioni. 2011. Identifying relations for open informationextraction. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages 1535–1545. Association for Computational Linguistics.

[Farhadi et al.2010] Ali Farhadi, Mohsen Hejrati, Moham-mad Amin Sadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every picture tellsa story: Generating sentences from images. In ComputerVision–ECCV 2010, pages 15–29. Springer.

[Hoffman et al.2014] Judy Hoffman, Sergio Guadarrama, Eric STzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, TrevorDarrell, and Kate Saenko. 2014. Lsda: Large scale detec-tion through adaptation. In Advances in Neural InformationProcessing Systems, pages 3536–3544.

[LeCun et al.2015] Yann LeCun, Yoshua Bengio, and GeoffreyHinton. 2015. Deep learning. Nature, 521(7553):436–444.

[Lenat1995] Douglas B Lenat. 1995. Cyc: A large-scale invest-ment in knowledge infrastructure. Communications of theACM, 38(11):33–38.

[Lin and Parikh2015] Xiao Lin and Devi Parikh. 2015. Don’tjust listen, use your imagination: Leveraging visual commonsense for non-visual tasks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2984–2993.

[Lin et al.2014] Tsung-Yi Lin, Michael Maire, Serge Belongie,James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, andC Lawrence Zitnick. 2014. Microsoft coco: Common ob-jects in context. In Computer Vision–ECCV 2014, pages740–755. Springer.

[Liu and Singh2004] Hugo Liu and Push Singh. 2004. Con-ceptneta practical commonsense reasoning tool-kit. BT tech-nology journal, 22(4):211–226.

[Liu et al.2007] Ying Liu, Dengsheng Zhang, Guojun Lu, andWei-Ying Ma. 2007. A survey of content-based imageretrieval with high-level semantics. Pattern Recognition,40(1):262–282.

[Malinowski and Fritz2014] Mateusz Malinowski and MarioFritz. 2014. A multi-world approach to question answer-ing about real-world scenes based on uncertain input. InAdvances in Neural Information Processing Systems, pages1682–1690.

[Miller1995] George A Miller. 1995. Wordnet: a lexi-cal database for english. Communications of the ACM,38(11):39–41.

[Mitchell et al.2012] Margaret Mitchell, Xufeng Han, JesseDodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Ya-maguchi, Tamara Berg, Karl Stratos, and Hal Daume III.2012. Midge: Generating image descriptions from computervision detections. In Proceedings of the 13th Conference ofthe European Chapter of the Association for ComputationalLinguistics, pages 747–756. Association for ComputationalLinguistics.

[Mordvintsev et al.2015] Alexander Mordvintsev, ChristopherOlah, and Mike Tyka. 2015. Inceptionism: Going deeperinto neural networks. Google Research Blog. RetrievedJune.

[Ordonez et al.2011] Vicente Ordonez, Girish Kulkarni, andTamara L. Berg. 2011. Im2text: Describing images using 1million captioned photographs. In Neural Information Pro-cessing Systems (NIPS).

[Rashtchian et al.2010] Cyrus Rashtchian, Peter Young, MicahHodosh, and Julia Hockenmaier. 2010. Collecting imageannotations using amazon’s mechanical turk. In Proceedingsof the NAACL HLT 2010 Workshop on Creating Speech andLanguage Data with Amazon’s Mechanical Turk, pages 139–147. Association for Computational Linguistics.

[Sadeghi et al.2015] Fereshteh Sadeghi, Santosh K Divvala, andAli Farhadi. 2015. Viske: Visual knowledge extraction andquestion answering by visual verification of relation phrases.In Computer Vision and Pattern Recognition (CVPR), 2015IEEE Conference on, pages 1456–1464. IEEE.

[Simonyan and Zisserman2014] Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional net-works for large-scale image recognition. arXiv preprintarXiv:1409.1556.

[Singh et al.2002] Push Singh, Thomas Lin, Erik T Mueller,Grace Lim, Travell Perkins, and Wan Li Zhu. 2002. Openmind common sense: Knowledge acquisition from the gen-eral public. In On the move to meaningful internet systems2002: Coopis, doa, and odbase, pages 1223–1237. Springer.

[Tandon et al.2014] Niket Tandon, Gerard de Melo, FabianSuchanek, and Gerhard Weikum. 2014. Webchild: Harvest-ing and organizing commonsense knowledge from the web.In Proceedings of the 7th ACM international conference onWeb search and data mining, pages 523–532. ACM.

[Tandon et al.2015] Niket Tandon, Gerard de Melo, Abir De,and Gerhard Weikum. 2015. Knowlywood: Mining activityknowledge from hollywood narratives. In Proc. CIKM.

[Tandon et al.2016] Niket Tandon, Charles Hariman, JacopoUrbani, Anna Rohrbach, Marcus Rohrbach, and GerhardWeikum. 2016. Commonsense in parts: Mining part-wholerelations from the web and image tags. AAAI.

[Thomee et al.2015] Bart Thomee, David A. Shamma, GeraldFriedland, Benjamin Elizalde, Karl Ni, Douglas Poland,Damian Borth, and Li-Jia Li. 2015. The new data andnew challenges in multimedia research. arXiv preprintarXiv:1503.01817.

[Vedantam et al.2015] Ramakrishna Vedantam, Xiao Lin, Tan-may Batra, C Lawrence Zitnick, and Devi Parikh. 2015.Learning common sense through visual abstraction. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 2542–2550.

[Vinyals et al.2014] Oriol Vinyals, Alexander Toshev, SamyBengio, and Dumitru Erhan. 2014. Show and tell: A neuralimage caption generator. arXiv preprint arXiv:1411.4555.

[Wang et al.2014] Josiah K Wang, Fei Yan, Ahmet Aker, andRobert Gaizauskas. 2014. A poodle or a dog? evaluating

Page 7: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

automatic image annotation using human descriptions at dif-ferent levels of granularity. V&L Net 2014, page 38.

[Young et al.2014] Peter Young, Alice Lai, Micah Hodosh, andJulia Hockenmaier. 2014. From image descriptions to visualdenotations: New similarity metrics for semantic inferenceover event descriptions. Transactions of the Association forComputational Linguistics, 2:67–78.

[Zhai2008] ChengXiang Zhai. 2008. Statistical language mod-els for information retrieval. Synthesis Lectures on HumanLanguage Technologies, 1(1):1–141.

Page 8: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

Appendix

The mathematical formulas, function definitions and details about hyper-parameters used are listed in tables3, 4, 5, and 6.

Formula Description

Textual feature weight P [xxj |x] =idf(xxj)∑ν idf(xxν)

The informativeness or weight of aword/phrase xxj in a document is cap-tured by calculating it’s idf in a largebackground corpus ν.

Visual feature weight P [xvj |x] =conf(xvj)∑ν conf(xvν)

× idf(xvj)∑ν idf(xvν)

The weight of a object class xvj in a doc-ument is calculated by the product of it’sconfidence (from LSDA) and it’s infor-mativeness.

CSK feature weight P [yk|x] =

∑i

∑j sim(ykj , xi)sal(ykj)inf(ykj)

|i||j|The relevance of a commonsense triple yto a document is decided by the similar-ity of its words/phrases yk to the featuresof the document, the salience (or impor-tance) of the match, and the informative-ness of the word/phrase.

Table 3: Mathematical formulations of Feature Weights

Hyper-parameter Description

α Weight of the basic document features; (1 − α)being the weight for smoothing.

αx Weight associated with the textual features of adocument.

αv Weight associated with the visual features of adocument.

βCS Weight pertaining to the commonsense knowledgefeatures of an expanded document.

Table 4: Definition of Hyper-parameters

Page 9: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

Formula Description

Basic LM Pbasic[q|x] =∏i P [qi|x]; A unigram/bigram LM described by the

probability of generation of a query qfrom a document x. The weight of theith word in q is given by P [qi|x]. Theproduct over all words of the query en-sures a conjunctive query.

P [qi|x] =αx|j|∑j sim(qi, xxj)P [xxj |x] +

αv|l|∑l sim(qi, xvl)P [xvl|x];

A word in the query may match with thetextual or visual features of a documentweighted by αx and αv , and normalisedwith number of matches |j| and |l| re-spectively.

Smoothed LM Psmoothed[q|x] = αPbasic[q|x] + (1− α)P [q|B];P [q|B] =

∏i P [qi|B]

The Basic LM after smoothing on back-ground corpusB. The relative frequencyof qi in B (P [qi|B]) is used for smooth-ing the LM.

Commonsense-aware LM PCS [q|x] =∏i

[∑k P [qi|yk]P [yk|x]

|k|

]; A translation LM describing the proba-

bility of generation of a query from the kcommonsense knowledge triples yk. Thesummation over k includes all triplesbridging the gap between the query vo-cabulary and the document vocabulary;it is normalized by the total number ofsuch triples.

P [qi|yk] =∑j sim(qi, ykj) The probability that the query word

qi has been generated from the CSKtriple yk is the sum of similarity scoresbetween the two words/phrases, nor-malised by the number of words/phrases(|j|) in the CSK triples.

Mixture LM P [q|x] = βCSPCS [q|x] + (1− β)Psmoothed[q|x] Combination of the weightedCommonsense-aware LM and SmoothedLM for ranking a document x for a queryq.

Table 5: Mathematical formulations of Language Models for Ranking

Page 10: arXiv:1909.00749v1 [cs.IR] 2 Sep 2019 · The summation ranges over all y k that can bridge the query vocabulary with the image-feature vo-cabulary; so both of the probabilities P[q

Function Description

Confidence conf(w) A score output by the LSDA to depictthe confidence of detection of an objectclass. The hypernyms of the detected vi-sual object classes are assigned the sameconfidence score.

Informative-ness inf(w) = idfB(w) We measure informative-ness of a wordby its idf value in a larger corpus, suchthat common terms are penalised.

Similarity sim(w1, w2) =|substring(w1, w2)|

max[length(w1), length(w2)]This function calculates the amount ofstring overlap between w1 and w2.

Salience sal(w) = λs if w ∈ subject= λp if w ∈ predicate= λo if w ∈ object

where tcsk = 〈subject, predicate, object〉

The importance of the string match posi-tion in a commonsense knowledge tripletcsk is captured by this function. Intu-itively, the textual features in the subjectand the object are more important thatthose in the predicate. Therefor we as-sign λs = λo > λp and λs+λp+λo = 1

Table 6: Function definitions


Recommended