Chat with Illustration: A Chat System with Visual Aids

Chat with Illustration: A Chat System with Visual Aids

Yu Jiang, Jing Liu, Zechao Li, Changsheng Xu, Hanqing LuNational Laboratory Of Pattern Recognition, Institute Of Automation, Chinese Academy Of Sciences

No.95, Zhongguancun East Road, Haidian District, Beijing, China{yjiang, jliu, csxu, luhq}@nlpr.ia.ac.cn, [email protected]

ABSTRACTTraditional instant messaging service mainly transfers tex-tual message, while the visual message is ignored to a greatextent. In this paper, we propose a novel instant messag-ing scheme with visual aids named Chat with Illustration(CWI), which presents users visual messages associated withchat content automatically. When users start their chat,the system first identifies meaningful keywords from dia-logue content and analyzes context relations. Then CWIexplores keyword-based image search in a image databasewith cluster-based index. Finally, according to context rela-tions, CWI assembles these images properly and presents anoptimal visual message for each dialogue sentence. With thecombination of textual and visual message, users could enjoya more interesting and vivid communication experience. Es-pecially for different native language speakers, CWI can helpthem cross language barrier to some degree. The in-depthuser studies demonstrate the effectiveness of our approach.

KeywordsChat system, Text-to-Illustration, Layout

1. INTRODUCTIONInstant messaging service on Internet, like Tencent QQ,

Windows Live Messenger, is the most convenient and fastestway to communicate with family and friends online. How-ever, information interaction through such kinds of tradi-tional instant messaging service (TIMS) is usually restrictedon textual message. Thus many problems cannot be avoided,just like in the following aspects: 1) Talk is just a talk. Plaintext which is abstract and monotonous, decides that such akind of communication lacks of interest. 2) Talk may bemisunderstood. A simple example, when an American talkabout “football”, he means “rugby football”, while in opin-ion of a Chinese, “football” is just “soccer”. 3) Talk may beimpossible. Talk between different native language speakersthrough TIMS is difficult even impossible.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICIMCS’ 12, September 9-11, 2012, Wuhan, Hubei, ChinaCopyright 2012 ACM 978-1-4503-1600-2/12/09 ...$15.00.

Figure 1: The interface of CWI. A tourist is talking to

help center and asking the way to Merlion.

As the saying goes: “one image is worth of thousands ofwords”. We argue that enriching the pure-text chat con-tent with visual information is a more natural and efficientsolution for human interaction. In this paper, we proposea novel instant messaging service named Chat with Illus-tration (CWI), to produce a smooth, intelligible and vividconversation, especially for users speak in different nativelanguages. Different from TIMS, the CWI system can notonly provide textual message but also visual message aboutthe chat content. Furthermore, a machine translation mod-ule is also integrated into the system, and the result of trans-lation will be delivered if users talk in different languages.When users start their dialogues, CWI first identifies mean-ingful keywords and analyzes context relations. A two-stepapproach, is developed to discover the representative imagesof visualable keywords. Afterward, some well-designed tem-plates corresponding to different context relations are usedto assemble these images into a visual message. Finally, vi-sual message will be presented, as well as textual message.

The work most closely related to ours are word-to-pictureconversion and text-to-picture conversion. Word-to-pictureconversion [7], [8], [9] try to find some representative imagesfor certain query words. While text-to-picture conversion[5], [4], [6] try to construct pictorial representations for sen-tences, such as the story picture engine proposed in [4] byDhiraj et al., which presented an unsupervised approach toautomated story picturing. But all of the above related worklack reasonable visual layout. Besides, differ from CWI, theyare not specially designed as an instant messaging system.

2. SYSTEM OVERVIEWThe framework of CWI is illustrated in Figure 2. CWI

comprises of four main components: 1) image collection and

96

Figure 2: The framework of CWI. There are four com-

ponents: 1) image collection and index; 2) dialogue anal-

ysis; 3) image selection and 4) visual layout.

index, 2) dialogue analysis, 3) image selection and 4) vi-sual layout. As an off-line operation, we collect a large-scaleimage set from web and build a cluster-based index for vi-sualable keywords1 by considering the visual and tag infor-mation simultaneously. Once users start their dialogue onCWI, the module of dialogue analysis identifies meaningfulkeywords and analyzes context relations. Then, CWI dis-covers the representative image of a visualable keyword bysub-cluster selection and image ranking. In order to deliverthe meaning of sentences, CWI performs a visual layout withwell-designed templates to generate the visual message. Fi-nally, the visual message together with textual message ispresented to users.

3. CHAT WITH ILLUSTRATION

3.1 Image Collection and IndexIn order to meet the real-time characteristic of CWI, an

image database should be well collected and indexed in ad-vance. The image database is divided into two parts, accord-ing to whether the query word is visualable. It is hard toautomatically find proper images to represent a few abstractconcepts, such as some verbs, adjectives, fixed phrases andinterrogatives etc. But they convey important informationin the conversation. As a result, we label a few unvisual-able concepts manually as Figure 3. In this sub-section, weelaborate the sub-database of visualable query words, whichis built automatically. The biggest problem of building thissub-database is polysemy, which is very common. For ex-ample, “pitcher” is a player who throws the baseball, whileit’s also the meaning of ewer, an open vessel with a handle.Hence, a cluster-based method is adopted by exploring bothsemantic and visual information, and some sub-clusters ofimages with specific meaning are obtained. It is worth tomention that, such a kind of clustering is also meaningfuleven for words without ambiguity, since it is helpful to filtersome noise images.For each visualable query word q, we collect the top k1 im-

ages with their tags from Flickr with the API2. As a result,an image set Iq and a tag set Tq are formed. Therefore, se-mantic and visual features spaces of images will be explored,and an advanced clustering algorithm world be performed.For the semantic representation, top k2 most relevant tags

in Tq are picked up as the dictionary of q, according to thescore of term frequency-inverse document frequency (tf-idf)

1A “visualable word” means that it is easy to find a properimage to present this word.2http://www.flickr.com/groups/api

Figure 3: Some example images of unvisualable query

words. In the first row are adjectives; second are verbs;

third are fixed phrases and four are interrogatives.

scheme. The tf-idf score is defined as follow:

tfidf(t, q) = freqq(t)× log(N

num(t)) (1)

where freqq(t) is the frequency of tag t in Tq, N is a constantnumber, num(t) is the number of images with t in Flickr.For each image i in Iq, we query with each tag in Ti, theoriginal tag set of i, to get Flickr related tags3, and expand Ti

into a larger tag set T̂i. Thus, each image can be representedin the semantic space as k2 dimensional vector Vs(i), each

dimension of Vs(i) is frequency of this word in T̂i.For the visual representation, we consider both the global

and local features. 225-dimensional grid color moments, 75-dimensional edge distribution histogram and 500-dimensionalbag of words (SIFT) are extracted to construct visual fea-ture space.

Obtaining the visual and semantic representation, we ap-ply a well-known clustering algorithm (i.e., Affinity Prop-agation [3]) to generate query related clusters. So far, acluster-based index for the visualable query word q is built.

3.2 Dialogue AnalysisThe module of dialogue analysis is responsible of two tasks,

meaningful keywords selection and context relation analy-sis. Meaningful keywords reflect users’ intent in chat andare used as query words in image selection. For simplic-ity, pronouns, nouns, adjectives and verbs are consideredas meaningful keywords. As for context relation, includinggrammatical relation and logical relation, it is the basis forvisual layout. Grammatical relation represents dependencyrelations between words. For instance, some words are thesubject or object of a verb. There are 52 grammatical re-lations proposed by [2]. We only pick up 8 most impor-tant relations including adjectival modifier, conjunct, directobject, indirect object, negation modifier, nominal subject,possession modifier and prepositional modifier. The Stan-ford Parser4, a Java implementation of probabilistic naturallanguage parsers, is adopted as the tool of meaning key-words selection and grammatical relation analysis. Logicalrelations represent relationship between clauses. Here, weonly consider 6 logical relations, i.e. causal relationship,assuming relationship, tuning relationship, progressive rela-tionship and parallel relationship. Usually, there are somefeature words standing for these relations in sentences. Forexample, “but” represents tuning relationship. We detectthese feature words and judge the logical relation.3Flickr API provides related tag inquiry.4http://nlp.stanford.edu/software/lex-parser.shtml

97

Figure 4: Some most used layout templates. (a): ad-

jectival modifier and possession modifier; (b): nominal

subject and direct object; (c): conjunct, the upper one

is for “and” conjunction while the lower one is for “or”

conjunction; (d): negation modifier; (e): indirect object;

(f): prepositional modifier; (g): causal relationship; (h):

assuming relationship; (i): tuning relationship

3.3 Image SelectionThe goal of this sub-section is to find the representative

image of keyword q. A two-step approach is developed todiscover representative images of visualable keywords.In the first step, the most proper image sub-cluster is se-

lected according to context clues, since the representativeimage must accord with context. So-called context are otherkeywords in the same sentence or last sentences. Just likethe semantic representation of image, we expend the contextthrough Flickr related tag and get semantic representationof context Vs(context). The semantic similarity betweencontext and each center of sub-cluster can be calculated bycosine similarity in the semantic space. And the closest sub-cluster to Vs(context) is selected.As a good representative image, we wish the object is

salient in the image. In the second step, we try to solve theissue of saliency through image ranking considering bothvisual and semantic aspects. In visual aspect, we detectimage salient region[1], and compute visual saliency as

Salv =Areasalient

Areatotal(2)

Areasalient is the area of salient region and Areatotal is thetotal image area. In sematic aspect, we assume that tags inTi more semantic consistent to q, more salient i is. We cal-culate semantic saliency through averaging the tf-idf scoresof Ti with q:

Sals =

∑x tfidf(tx, q)

|Ti|tx ∈ Ti (3)

|Ti| is the number of tags in Ti. We select most salient imageranked by Sal as follows:

Sal = γSalv + (1− γ)Sals (4)

3.4 Visual LayoutSo far, we have got the proper representative image for

each meaningful keyword in dialogues. A single image canonly represent a certain concept. In this section, we focus ona reasonable visual layout of images, and present a logical il-lustration for each sentence. A good layout scheme must beintuitive to humans and easy to generate by computers. Wepropose a template-based visual layout scheme. As shown inFigure 4, some image layout templates have been designedbased on context relations. What need system to do is justto assign images into these templates according to contextrelations. Different templates are synthesized by the co-ownership parts. If there are no co-ownership parts between

Figure 5: An example of visual layout.

Figure 6: Comparisons between CWI and TIMS from

two aspcts: enjoyment and comprehension

different parts, they will be synthesized in accordance withorder of keywords. A simple visual layout example is demon-strated in Figure 5: “nsubj(saw-2, He-1)” stands for “He” isthe nominal subject of “saw”; “dobj(saw-2, stars-3)” meansthe direct object of “saw” is “stars”, and “prep with(saw-2,telescope-5)” reflects a propositional modifier serves to themeaning of verb. The co-ownership part is “saw”, by whichthree parts are attached together.

4. USER STUDYWe set 1132 initial query words totally on 15 different top-

ics as shown in Table 1. To each q, k1 is set to 400, and k2 isset to 100. The weight factor γ is set to 0.2. As for manualimage database, we collect images for 127 common unvisu-alable concepts. It’s worth noting that the image databaseis built in English, due to Flickr API, which only supportsEnglish. As a result, image selection can be only processedin English. We invited 20 volunteers (12 male and 8 female,ages vary from 22 to 47) participate into the user study ofCWI. They are grouped into 10 pairwise combinations ran-domly. All of the volunteers are skillful in both Englishand Chinese. As mention above, image can only be selectedonly in English, so volunteers are always required expressthemselves in English. Before the study, volunteers are in-formed some priori knowledge, such as what is the meaningof different layout templates. Every pairwise combinationare required to make five dialogues in the following strate-gies: (1) SL+TIMS: Chat in the same language on TIMS.(2) SL+CWI-L: Chat in the same language on CWI withoutvisual layout. Images are arranged according to the words’order. (3) SL+CWI: Chat in the same language on CWI. (4)DL+TIMS+M: Chat in different languages on TIMS withmachine translation. A module of Google translation is inte-grated. (5) DL+CWI: Chat in different languages on CWI.The last two strategies simulate international interaction,volunteers are always thought as English speakers and Chi-nese readers.

4.1 Evaluation of Full Scheme

98

We evaluate full sheme of CWI from two aspects. In thefirst aspect, we compare CWI with TIMS. While in the otheraspect, we evaluate usefulness of CWI during an interna-tional interaction. The comparison with TIMS is based onSL+TIMS and SL+CWI. And the volunteers are asked tomake a decision whether CWI performs“much better”, “bet-ter”, “same”, “worse”, or “much worse” than TIMS in thefollowing indicators:Enjoyment. It measure the extent to which user feel the

process of interaction is enjoyable.Comprehension. It measure the extent to which user

can understand each other easily.Figure 6 shows the results of individual indicators. Ob-

viously, on both of indicators, CWI outperforms TIMS re-markably. Especially, all of the volunteers choose “muchbetter”or“better”on the indictor of enjoyment. As for com-prehension, 25% volunteers choose “same” or “worse”. Viacommunication with volunteers, it is found that they do notadapt to the new form of interaction. To evaluate the useful-ness of CWI system during an international communication,volunteers were invited to answer the question “Is CWI use-ful for expressing your intents and understanding the other’sintents?”They were asked to choose one from three options:“very useful”, “somewhat useful”, and “unuseful”. The CWIsystem was regarded to be very useful by 30% users and beuseful by the remaining 65% users. As for another question,“which form of communication will you choose during aninternational communication, DL+CWI or DL+TIMS+M”,18 volunteers choose our proposed scheme, while the othertwo have no preference between these two way.From the above user studies, we can find that the pro-

posed CWI system outperforms TIMS. What is more, CWIis very helpful to cross the language barrier in an interna-tional communication.

4.2 Evaluation of ComponentsWe further evaluate two components in the proposed scheme:

image selection and visual layout. We propose four criteria:Precision of selected image. It measures whether the

selected images are correct and suitable for chat context.Saliency of selected image. It measures whether the

object in the dialogue is salient in the selected image.Helpfulness of visual layout. It measures whether the

images layout is helpful to understand the content.Naturalness of visual layout. It measures whether the

images layout is consistent with human cognition.The former two criteria measure the effectiveness of im-

age selection. If the selected image includes the concept ofkeyword, it is thought to be precise. And if the area cor-responding to keyword is larger than quarter of the image,it is thought to be salient. We evaluate 382 images of visu-alable keywords in the 30 dialogues (10 pairwise, each pairperformed SL+CWI-L, SL+CWI and DL+CWI). The accu-racy is 89.27%, and the salience rate in the precise imagesis 82.6%. So we can see the image selection has a goodperformance both on the precision and saliency.The latter two criteria measure the effectiveness of visual

layout scheme . The user study is based on volunteers’ ex-perience on SL+CWI and SL+CWI-L. Volunteers are askedto choose whether visual layout is “very helpful / natural”,“helpful / natural” or “unhelpful / unnatural” . From Figure7 we find that majority of volunteers think visual layout ishelpful and natural.

Figure 7: Comparisons between SL+CWI and

SL+CWI-L from two aspcts: helpfulness and natural-

ness.

Table 1: Dialogue Topicsask the way book hotel borrow bookbuy air ticket buy car recommend giftbuy phone electrical repair money change

movie order pizza rent housereserve sickness selected courses travel

5. CONCLUSIONThis paper has demonstrated a novel instant messaging

scheme with visual aids, named Chat with Illustration. Dif-ferent from traditional instant messaging service, CWI notonly provides textual message but also vivid visual message.Visual message is definite, vivid and intuitive. Extensive ex-periments have shown that CWI outperforms TIMS duringan online communication.

6. ACKNOWLEDGMENTSThis work was supported by the National Natural Science

Foundation of China 973 Program (Project No. 2010CB327905)and the National Natural Science Foundation of China (GrantNo. 60903146, 90920303).

7. REFERENCES[1] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and

S.-M. Hu. Global contrast based salient regiondetection. In CVPR, pages 409–416, 2011.

[2] M.-C. de Marnee and C. D. Manning. Stanford typeddependencies manual.

[3] B. J. Frey and D. Dueck. Clustering by PassingMessages Between Data Points. Science, 315:972–976,2007.

[4] D. Joshi, J. Z. Wang, and J. Li. The Story PicturingEngine - a system for automatic text illustration. ACMTOMCCAP, 2:68–89, 2006.

[5] Z. Li, M. Wang, J. Liu, C. Xu, and H. Lu. Newscontextualization with geographic and visualinformation. In ACM MM, pages 133–142, 2011.

[6] R. Mihalcea and C. W. Leong. Toward communicatingsimple sentences using pictorial representations.Machine Translation, 22:153–173, 2008.

[7] M. Wang, K. Yang, X.-S. Hua, and H.-J. Zhang.Towards a Relevant and Diverse Search of SocialImages. IEEE TMM, 12:829–842, 2010.

[8] Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang.Visual query suggestion. In ACM MM, pages 15–24,2009.

[9] Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, T.-S.Chua, and X.-S. Hua. Visual query suggestion: Towardscapturing user intent in internet image search. ACMTOMCCAP, 6:1–19, 2010.

99

Date post:	10-Dec-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Chat with Illustration: A Chat System with Visual Aids

Documents