Detecting and Localizing Internationalization Presentation ...halfond/papers/alameer16icst.pdf ·...

Detecting and Localizing InternationalizationPresentation Failures in Web Applications

Abdulmajeed Alameer, Sonal Mahajan and William G. J. HalfondUniversity of Southern California

Los Angeles, California, USA{alameer, spmahaja, halfond}@usc.edu

Abstract—Web applications can be easily made available to aninternational audience by leveraging frameworks and tools forautomatic translation and localization. However, these automatedchanges can distort the appearance of web applications sinceit is challenging for developers to design their websites toaccommodate the expansion and contraction of text after it istranslated to another language. Existing web testing techniquesdo not support developers in checking for these types of problemsand manually checking every page in every language can be alabor intensive and error prone task. To address this problem,we introduce an automated technique for detecting when a webpage’s appearance has been distorted due to internationalizationefforts and identifying the HTML elements or text responsiblefor the observed problem. In evaluation, our approach wasable to detect internationalization problems in a set of 54 webapplications with high precision and recall and was able toaccurately identify the underlying elements in the web pagesthat led to the observed problem.

I. INTRODUCTION

Web applications enable companies to easily offer theirservices and products on a worldwide basis. Although thiscapability provides companies with many benefits, it alsointroduces the challenges of internationalization. Developersmust design their websites to handle different languages whilemaintaining the appearance and aesthetics of their web pages’user interfaces (UIs). Maintaining the attractiveness of a webpage is important, as previous studies have shown that users’impression of a website can be formed within fifty millisec-onds [18] of viewing a page and that users often base theirimpressions of trustworthiness and quality, and ultimately,decisions to purchase, on the design and visual appearanceof a web page [15], [12], [14].

Extensive infrastructure exists to support internationaliza-tion by web developers [13]. This enables developers to isolate“need-to-translate” strings in language-specific resource files,and load the correct resource files for the languages acceptedby the requesting browser at runtime. This mechanism enablesdevelopers to support localization for any one of the languagesspecified in the ISO 639-3 standard, which currently hasmore than 7,700 entries [6]. Alternatively, developers canintegrate translation APIs into their websites, such as theGoogle Website Translator [4], which allows the visitors toselect their preferred language from a drop down menu andhave the API automatically translate text elements. Althoughthese mechanisms enable the use of translated strings, theydo not address the problem of maintaining the appearance

of the translated version of the page. As we discuss inSection II, translated versions of text can be different lengthsand heights, depending on the character set of the targetlanguage. This can lead to an Internationalization PresentationFailure (IPF), which is an undesired distortion of the page’sintended appearance as HTML elements expand, contract, ormove in order to handle the translated text.

To prevent IPFs, there are best practices that developerscan follow. However, most are too generic to do anythingmore than serve as a reminder. For example, one large soft-ware engineering company’s guidelines for internationalizationstates, “Provide for effective presentation of the UI after theexpansion that results from translation.” [7] Although suchguidelines are helpful, they do not provide concrete or specificguidance to developers on how to accommodate for suchexpansions. Given the lack of effective guidance for avoidingthe problem, developers must resort to proactive detection ofIPFs. Although it is possible to test for IPFs, this effort requiresdevelopers to manually examine each page with each languageoption and compare its appearance to the intended appearance.Unsurprisingly, this can be a labor intensive and error proneprocess.

Automated support in the research and practitioner commu-nity is limited with respect to supporting internationalizationissues. One notable exception is work that locates stringsin web applications that developers need to translate beforeinternationalization can be considered complete [29]. Althoughthis technique helps developers to carry out the international-ization process more thoroughly, it does not help developersdetect when their translations have led to an IPF. Apple’spseudo-localization testing [1] attempts to help developersidentify IPFs, but requires them to check each page of anapp manually. There are other techniques that focus on othertypes of presentation failures. Although not intended to detectIPFs, they can detect a subset of these failures. For example,Fighting-Layout-Bugs [27] uses image processing techniquesto detect overlapping text, a common symptom of IPFs.Cross-Browser Testing (XBT) techniques (e.g., [10], [11],[26]) and general presentation failure finding techniques (e.g.,WebSee [21]) can also detect many IPFs, but as we show in theevaluation, they also generate many false positives since theyoften detect any change in the page caused by the translationsof the text as a presentation failure.

In this paper we present a novel approach that specifically

targets the detection of IPFs. Our technique assists developersin automatically detecting when internationalization effortshave caused a page’s appearance to become distorted andhelps them in identifying the elements responsible for thisproblem. The basic intuition of our approach is to buildrendering based models of web pages, focusing primarily onthe visual relationships between translated text and layoutrelated HTML elements. These models are then comparedto identify elements whose appearance or relative positionhas changed between internationalized versions. Our approachthen analyzes the identified elements to produce a ranked listof potentially faulty elements for developers. Our evaluationof the technique shows that it is very accurate, it can detectIPFs with 91% precision and 100% recall, and identify thefaulty element with a median rank of three. Our technique isalso fast and can perform this detection and localization for agiven webpage in 9.75 seconds. Overall these results are verypositive and indicate our approach could help developers inpreventing IPFs.

The organization of our paper is as follows. In Section IIwe present some basic background information and definitionsrelated to IPFs. We describe the approach in Section III and itsevaluation in Section IV. A discussion of related work appearsin Section V, and we conclude in Section VI.

II. BACKGROUND

An internationalized web application can be built usingmany approaches. However, two general techniques have be-come widespread and popular. The first approach is to isolatethe “need-to-translate” text and images into separate language-specific resource files. A user’s web browser provides theuser’s preferred languages and a server side framework loadsthe correct language specific resource files and inserts theminto placeholders in the web page. This modularization ofthe language-specific content allows for easier management ofthe internationalized website. The second approach is to useonline automated translation services (e.g., Google WebsiteTranslator [4]). With this approach, the developers install aplugin in their website. The visitors to the site can selecttheir desired language from a drop-down box and the pluginwill scan the web page to find all of the textual content.Then, the plugin will send that text to the online automatedtranslation service, which will reply with the translation ofthe text. The plugin then replaces the original text with thetranslated text. This approach allows the developers to easilymake their web applications available in a large number oflanguages without the need to manually extract and translatethe content. However, neither of these approaches providesany guarantee that the translated text will not cause an IPF,and it is up to the web developer to verify that the pages’ userinterfaces have not been distorted.

Figure 1 shows a simplified version taken from the Hotwirewebsite. In the Mexican version of the page, the price textoverflows its container. This overflow is an example of an IPF.In general, the reason this type of a problem occurs is becausethe size of the text varies significantly based on its language.

The resulting change in text size causes the text to overflowits intended containing element or, if the containing elementgrows with the text, the container growth can cause otherelements to move, changing the UI layout. This change inthe size of the text is mainly affected by three factors: numberof characters in the translated text, the language’s characters’width, and the language’s characters’ height. Some of thesechanges can be quite dramatic. For example, in English theword “views” translates to “visualizzazioni” in Italian, whichis almost three times longer. More generally, IBM has foundthat an English text that is less than ten characters couldincrease in size from 100% to 200% when translated to anotherlanguage [7]. The potential for this kind of expansion mustbe anticipated by developers in the design of their web pages.Additionally, the correct rendering of a page must be verified,as the fact that it renders properly for one language does notimply that it will render correctly for other languages.

It is possible for a developer to manually check for thepresence of IPFs. To do this though, the developer wouldhave to manually compare the layout of a baseline page,which is known to be correct, against the translated versions.Depending on how many languages are supported, this couldquickly grow to an unmanageable amount of comparisons.For example, Google Translate can allow websites to betranslated to up to 90 different languages. Furthermore, oncean IPF has been detected, the underlying root cause can bedifficult to determine since the appearance of modern webpages is controlled by a complex rendering interaction ofHTML, CSS, and JavaScript. This means that the connectionbetween an observed failure and the underlying fault is oftennot straightforward. These complications makes both detectionand localization of IPFs a time consuming and error-proneprocess.

III. APPROACH

The goal of our approach is to automatically detect IPFs andidentify the translated text that is responsible for the failure.Our key insight is that IPFs are caused by changes in thesize of translated text. Therefore, our approach defines andbuilds a model, called the Layout Graph (LG), that capturesthe visual relationships and relative positioning of HTML tagsand text elements in a web page (Section III-A). To use ourapproach, a tester provides two web pages as input: the firstis the Page Under Test (PUT) and the second is a baselineversion of the page that shows the correct layout. Typically,the baseline would be the original version of the page, whichis already known to be correct and will be translated toanother language, as represented in the PUT. Our approachfirst builds a LG for each of these pages (Section III-B).The approach then compares these two LGs and identifiesdifferences between them that represent potentially faultyelements (Section III-C). Finally, the approach analyzes andfilters these elements to produce a ranked list of elements forthe developer (Section III-D).

A. Layout Graph Definition

The LG is a model of the visual relationships of the elementsof a web page. As compared to models used in related work,such as the alignment graph [11] and R-tree [21], the LGfocuses on capturing the relationships of not only the HTMLtags, but also the text contained within the tags. The reasonfor this is that the primary change to a web page afterinternationalization is that the text contained within the HTMLtags has been translated to another language. The translatedtext may expand or shrink, which can cause an IPF. Therefore,the LG includes the text elements so that these changes canbe more accurately modeled and compared.

The LG is a complete graph defined by the tuple 〈V, F 〉,where V is the set of nodes in the graph and F is a functionF : V × V → P(R) that maps each edge to a set of visualrelationships defined by R. Each node in V represents anelement that has a visual impact on the page. A node isrepresented as a tuple 〈t, c1, c2, x〉, where t is the node typeand is either “Element” (i.e., an HTML tag) or “Text” (i.e.,text inside of an HTML tag), c1 is the coordinate (x1, y1)representing the upper left corner of the node’s position onthe page, c2 is the coordinate (x2, y2) representing the lowerright corner of the node, and x is the XPath representing thenode. The two coordinates represent the Minimum BoundingRectangle (MBR) that encloses the element or text. The setR of possible visual relationships can be broken into threecategories, direction (i.e., North, South, East, West), alignment(i.e., top, bottom, left, right), and containment (i.e., containsand intersects).

$45 4 stars Hotel

(a) American version

MXN 750 Hotel&de&4&estrellas&

(b) Mexican version

Fig. 1: Part of a web page and its localized version

B. Building the Layout Graph

In the first phase, the approach analyzes the PUT andbaseline page to build an LG of each. The approach firstanalyzes the Document Object Model (DOM) of each page todefine the LG’s nodes (i.e., V ) and then identifies the visualrelationship between the nodes (i.e., F ).

The first step of building the layout graph is to analyze thebaseline page and PUT and compute the nodes in the LG.For each of these pages, this process proceeds as follows. Thepage is rendered in a browser, whose viewport size has beenset to a predefined value. This chosen viewport size has to be

../div/text()

../div ../div/div

../div/div/text()

../img

(a) American version

../div/text() ../div/div

../div/div/text()

../div

../img

(b) Mexican version

Fig. 2: The MBRs of the two versions of the page

the same for both pages for a given comparison, but can bevaried to detect IPFs at different resolutions. After the page isrendered in a browser, the approach uses the browser’s API totraverse the page’s DOM. For each HTML tag h in the DOM,the approach collects h’s XPath ID (i.e., x), finds h’s MBRbased on the browser’s rendering of h (i.e., c1 and c2), andassigns the type “Element” to the tag. If the node containstext (e.g., text between <p> tags or as the default value of an<input> text box) then the approach also creates a node forthe text itself. For this type of node, the XPath is the XPathof the containing node plus the suffix “/text()”, the MBR isbased on the size and shape of the text within the enclosingelement, and the type is denoted as “Text.” This process isrepeated for all HTML tags found in the page’s DOM withthree exceptions, which we describe below.

The first exception is HTML tags that are not visible inthe page. These tags do not affect the layout of the page andtherefore do not have a visual relationship with any other tag.Officially, there are specific HTML and CSS properties, suchas visibility:hidden and display:none, that canbe used to cause a tag to not display. Unofficially, there are amyriad of ways that a developer can use to hide an element.These include setting the height or width CSS properties tozero; using the clip CSS property to cut an element to a zeropixel rectangle; and setting a very high value for the text-indent property to render the element outside the boundaryof its container while also setting the overflow property tohidden. Our approach detects these and other mechanisms,and then does not create a node in the LG for the HTMLtag. This detection is done by querying the browser’s APIafter the page is rendered and getting the applicable HTMLand CSS properties for each element. The second exceptionis for HTML tags that do not affect the layout of the page.The tags are not explicitly hidden, as described above, but arenonetheless not visible in the page’s rendering. These typesof tags may be used to provide logical structure to the page.For example, a <div> may be used as a container to groupother nodes. As with hidden tags, there are many ways todefine these tags. Some of the heuristics we employ for thisidentification process are: (1) container elements that do not

../div ../div/text()Contains

../img

North,RAlign,LAlign

../div/div

Contains,RAlign

../div/div/text()

Contains

North

West

West

South,RAlign,LAlign

South

South,RAlign

South

RAlign

EastNorth,RAlign

Contains

East

North

(a) American version (LG)

../div ../div/text()Contains

../img

North,RAlign,LAlign

../div/div

Contains,RAlign

../div/div/text()

Intersect

North

West

West

South,RAlign,LAlign

South

South,RAlign

South

RAlign

EastNorth,RAlign

Intersect

Intersect

East

NorthIntersect

(b) Mexican version (LG′)

Fig. 3: The LGs of the two versions of the page

have a border and whose background color is similar to itsparent’s background color; (2) tags that have a very smalldimension; (3) tags only used for text styling, such as <font>,<strong>, and <b>; and (4) tags representing an unselectedoption in a select menu.

The third and final exception is for HTML tags embedded inthe text of another tag. An example of this is shown in Figure 4where the link labeled “options” has moved to the second lineafter the translation and therefore would have a different visualrelationship with the link labeled “free trials.” Intuitively, weknow that such changes are inevitable due to translated textand should not be considered as IPFs. Therefore, the approachgroups such tags together and creates one node in the LG forthem with an MBR that surrounds all of the grouped elementsand assigns to that node the type “Text.”

Fig. 4: Example of non-failure text changing position aftertranslation

After computing the nodes of the graph, the second step ofthe approach is to define the F function, which annotates eachedge in the graph with a set of visual relationships. Recallthat an LG is a complete graph, so this step is computingthe visual relationship between each pair of nodes on eachedge. To compute the visual relationship between two nodeson an edge, the approach compares the coordinates of eachnode’s MBR. For example, for an edge (v,w), if v.y2 ≤ w.y1then the relationship set would include North. Similarly, ifv.y2 = w.y2 then the set would include Bottom-Aligned andif (v.x1 ≤ w.x1) ∧ (v.y1 ≤ w.y1) ∧ (v.x2 ≥ w.x2) ∧ (v.y2 ≥w.y2) then it would include the Contains relationship. Theother relationships are computed in an analogous manner.

To illustrate the graph building process, consider the twoweb pages shown in Figure 1. The MBRs identified for thesetwo web pages are shown in Figure 2. Next, Figure 3 showsthe LGs produced for the American and Mexican versions ofour example, which hereafter we refer to as LG and LG′

respectively. As can be seen, the graph is a complete graphwith edges labeled with the spatial relationships between thenodes they represent. For example the edge (../img, ../div) islabeled with the relationships “South, RAlign, LAlign,” whichmeans that the element ../img is to the South of the element../div and is also Right Aligned and Left Aligned with it.

C. Layout Graph Comparison

In the second phase, the approach compares the two LGsproduced by the first phase in order to identify differencesbetween them. The differences that result from the comparisonrepresent potentially faulty tags or text that will be filtered andranked in the third phase. A naive approach to this comparisonwould be to pair-wise compare the visual relationships annotat-ing all edges in LG and LG′. In experiments we found that thedrawback of this approach was that differences were detectedfor tags that were far away from each other on the page andwhose relative change in position was not an IPF. Instead,our approach compares subgraphs of nodes and edges thatare spatially close to a given node n in the LG. Our insight,confirmed by the experiments reported in Section IV, is thatcomparing these more limited subgraphs of LG and LG′,which we refer to as neighborhoods, is sufficient to accuratelydetect IPFs and the responsible faulty elements. In the rest ofthis section, we explain the details of this comparison, whichrequires first determining the nodes to be compared and thenidentifying the neighborhood for each node.

Before any comparison can take place, the approach mustidentify nodes in LG and LG′ that represent the same HTMLelement. Although each node contains an XPath, as we de-scribed in Section II, certain translation frameworks, such asthe Google Translate API, may introduce additional tags. Thismeans that the XPaths will not be an exact match. To addressthis problem, we adapted a matching approach defined by theWebDiff XBT technique [10]. This approach matches elements

probabilistically using the nodes’ attributes, tag names, andthe Levenshtein distance between XPath IDs. We adapted thisapproach to account for common variations introduced by thetranslation frameworks. The output of our adapted matchingapproach is a mapM that matches each HTML tag or text inthe baseline page with a corresponding tag or text in the PUT.In our experiments, we found that this matching was close toperfect for our subjects because the translation API introducedregularized changes for all translated elements.

After computingM, the approach then identifies the neigh-borhood for each n ∈ LG. To do this, the approach firstcomputes the coordinates of the four corners and center ofn’s MBR. Then, for each of these five points, the approachidentifies the k-Nearest Neighbors (k-NN) nodes in the LG.The neighborhood is defined as the union of the five points’k-NNs. The closeness function in the k-NN algorithm iscomputed based on the spatial distance from the point toany area occupied by another node’s MBR. The calculationfor this is based on the classic k-NN algorithm [25]. In ourexperiments, we have found that the approach works best whenthe value of k is set proportionally to the number of nodes inthe LG.

The final step is to determine if the relationships assignedto edges in a neighborhood have changed. To do this, theapproach iterates over each edge e that is part of the neigh-borhood of any n in LG and finds the corresponding edgee′ in LG′, using the previously generated M function. Notethat the corresponding edge always exists since both LGs arecomplete graphs. Then the approach computes the symmetricdifference δ between F (e) and F (e′), which identifies thevisual relationships assigned to one edge but not the other. Ifthe difference is non-empty, then the approach classifies theedge as a potential issue. The output of this step is I , a set oftuples of the form 〈e, e′, δ〉

To illustrate the graph comparison, consider the LGs shownin Figure 3. Consider the node labeled “../div/div/text().” Thebold edges connected to this node represent its neighborhoodthat will be compared against its counterpart in the other LG.In the figure, we have underlined and highlighted in red thelabels on the edges for which there is a difference in therelationships. The edges with underlined red labels will bereported as potential issues and analyzed further in the thirdand final phase of our approach (Section III-D).

D. Ranking the Likely Faulty Elements

In the third and final phase, the approach analyzes the set oftuples, I , identified in the second phase and generates a rankedlist of HTML elements and text that may be responsible forthe observed IPFs. To identify the most likely faulty elements,the approach applies three heuristics to the tuples in I and thencomputes a “suspiciousness” score that it uses to rank, frommost suspicious to least suspicious, the nodes associated withthe edges in I .

The first heuristic serves to remove edges from I that wereflagged as a result of to-be-expected expansion and contractionof text. An example of this is shown in Figure 4. Here a

<div> container element surrounds the icon, title, and textblock. The comparison of the two LGs would detect that thebottom alignment of the two text blocks has changed andthe comparison would report a potential issue. However, inthis case this should not be counted as an IPF because thecontainer element has enough space to allow for the text toexpand without disrupting anything outside of the containingelement. To identify this situation, the approach identifies alledges where the type of the two constituent nodes is eitherText/Element or Text/Text. If the δ of any of these edgescontains alignment related relationships, then these relation-ships are removed from δ. If δ is now empty, then the tuple isremoved from I . This heuristic only allows alignment issuesto be taken into account if they affect the visual relationshipbetween nodes that represent HTML elements.

The second heuristic establishes a method for ruling out lowimpact changes in the relative location of two elements. Anexample of such a change is shown in Figure 5. In the Englishversion, the edge between the header element “Discover” andthe search box is labeled West. After the translation to Russian,the element is shifted, and the relationship is changed, so thelabel becomes South. Although this is technically a distortion,it may or may not rise to the level of being considered anIPF. To calibrate for such situations, our approach allowstesters to provide an α threshold that denotes the degree ofallowed change. For each pair of nodes in an edge in I , if theδ of that edge contains direction related relationships, thenthe approach uses the coordinates of the MBRs to calculatethe change (in degrees) of the angle between the two nodesforming the edge. If the change is smaller than α, then thesedirection relationships are removed from δ. If δ is now empty,then the tuple is removed from I . In our experiments we foundthat α = 45 provided a reasonable balance in terms of flaggingchanges that would be characterized as disruptive and reducingfalse positives.

The third and final heuristic expands the set of edges inI to include suspicious ancestor elements of nodes whoserelative positions have changed. An example of such a changeis also shown by Figure 5. The elements in the header are lielements. After the translation of the page, the element li[6],which represents “Troubleshooting” has been pushed downto a new row in the Russian version of the page. Here, wecannot consider the faulty element to be only li[6]. In fact,the expansion of the text in li[1] through li[5] caused li[6] tobe pushed down, so we report an XPath selector that representsall the text children of the parent ul element. To handle thissituation, when an edge in I is found that has a directionalvisual relationship that has changed, the approach traverses theDOM of the page to find the Lowest Common Ancestor (LCA)of both nodes and adds an XPath selector that represents allof its text children to the list of nodes that will be ranked.

After the three heuristics have been applied to I , theapproach generates a ranked list of the likely faulty nodes. Todo this, the approach first creates a new set I ′ that containstuples of the form 〈n, s〉, where n is any node present inan edge in I or identified by the third heuristic and s is

(a) English version

(b) Russian version

Fig. 5: The header menu of Twitter’s help center.

a suspiciousness score, initialized to 0 for all nodes. Theapproach then increments the suspiciousness scores as follows:(1) every time a node n appears in an edge in I , the score of nis incremented; and (2) the score of a node n is increased bythe cardinality of the difference set (i.e., |δ|). For any XPathselector that was added as a result of the third heuristic, itssuspiciousness score is incremented by the number of times itis added to the list. Once the suspiciousness scores have beenassigned, the approach sorts I ′ in order from highest scoreto lowest score and reports this list to the developer. This listrepresents a ranking of the elements determined to be the mostlikely to have caused the detected IPFs.

IV. EVALUATION

To assess the effectiveness of our approach for detectingand localizing IPFs, we implemented it as a prototype toolto evaluate its accuracy, quality of the localization results,and time needed for it to run. Specifically, we addressed thefollowing research questions:RQ1: What is the accuracy of our technique in detecting IPFs?RQ2: If a failure is detected, what is the quality of thelocalization results provided by our technique?RQ3: How fast is our technique in detecting and localizingIPFs?

To address these questions, we carried out an empirical eval-uation of our approach on a set of real-world web applicationsand compared our results with three well-known approachesfor detecting different types of presentation failures in webapplications, WebSee [21], X-PERT [11], and Fighting LayoutBugs (FLB) [27]. In the following sections, we describe theimplementation of our approach and the subject applicationsused for the experiments. Then we discuss each of the researchquestions.

A. Implementation

We implemented our approach as a Java prototype tool,GWALI (Global Web Applications’ Layout Inspector). Weused several third party libraries to implement some of thefunctionality required by our approach including the Java

Spatial Index library to find the k-Nearest Neighbors of theLG’s nodes. Selenium Firefox Webdriver with Firefox version39.0 was used to load the webpage and execute Javascriptto compute the MBRs of text and tag elements in the webpage. We ran our experiments on a 64-bit Linux Ubuntu14.04 machine with a screen resolution of 1920 × 1080, 8GBmemory, and an Intel Core i7-4790 CPU.

We also compared our approach with three well-knownapproaches. The first one is WebSee [21], which is a tooldesigned to find presentation failures in web applicationsusing computer vision techniques. When running WebSee,we also used WebSee’s exclusion region feature to excludethe areas in the pages that contain text from the comparisonsince WebSee will report a failure once it finds the text inthe PUT to be different from the text in the baseline. Thesecond tool we compared GWALI against is FLB [27], whichallows developers to automatically detect common layout bugsin the page. The tool provides multiple detectors. Some ofthe detectors, such as detecting images with invalid URLs ordetecting text that is not readable due to low contrast with itsbackground, were not related to IPFs. To reduce false positives,we disabled these detectors and enabled only the ones thatwere related to internationalization, which were detecting textthat is near or overlapping with horizontal or vertical edges.The third tool we compared our approach against is X-PERT[11], which is designed to detect Cross-Browser Issues (XBIs).X-PERT works by loading the same page into two differentbrowsers and comparing the rendering of the two pages. Wemade a modification to the code, so that we could perform thecomparison for two different pages in two windows for thesame browser. X-PERT provides multiple modules to performthe comparison. To reduce false positives, we enabled only thestructural detection module, since the other modules, such asthe visual detection and the text detection modules, were notrelevant for detecting IPFs.

B. Subject Applications

Our subject pool contained 54 different web applications.To ensure diversity in our subject pool, we selected interna-

tionalized web applications that were: (1) visited frequently;(2) used different translation technologies; (3) translated intodifferent languages; and (4) designed with different layoutsand styles. We identified potential subjects by searching threedifferent sources. The first of these was builtwith.com, whichis a website that indexes web applications built using varioustechnologies. This site helped us to easily locate websites thatwere built using different kinds of translation frameworks,such as the Google Website Translator, and ensure that oursubjects covered a wide range of translation technologies andframeworks. The second source was the Alexa top 100 mostvisited web sites list. The third and final source was high-profile websites that targeted international audiences, such astravel-related and telecom company websites and were ex-pected to provide their content in multiple languages. For eachof these three sources, we manually inspected the identifiedwebsites to find web pages that contained an IPF. We found30 such pages and used these as the ground truth for truepositive IPFs. For true negative IPFs, we selected 24 pagesthat were internationalized but that did not contain an IPF.Due to space constraints we do not list all of the subjects, buta complete list is available from the project’s website [5]. Foreach subject application, this process resulted in a baselinepage, which we assumed was a correct rendering, and a PUTin another language that contained either zero faults (i.e., a truenegative) or at least one IPF (i.e., a true positive). Overall, thefaulty web sites identified in this process contained a widerange of IPFs, whose impact ranged from misaligned text to acomplete distortion of the page that left it impossible to read.Figure 5 shows an example of the former and Figure 6 showsan example of the latter.

To ensure our experiments were repeatable and not affectedby changes in the live websites, we used the Scrapbook Firefoxplugin to download local copies of the subject webpages alongwith all of the corresponding images and styles that wererequired for the pages to be rendered correctly. We found thatsome pages used JavaScript to dynamically change content(e.g., rotating main news items) in the DOM, which madeit difficult to compare the two versions of the page unlesstheir changes were synchronized. To avoid this, we disabledJavaScript in the page after the pages were loaded and renderedin the browser.

C. Experiments and Results

To answer RQ1, we measured the precision and recallof GWALI’s detection for the subject applications. We ranGWALI, WebSee, X-PERT, and FLB for every pair of pages(i.e., the baseline and PUT), and compared the output they pro-duce with the manually assigned ground truth. We consideredthe tools to have a successful detection if they indicated therewas a failure of any type in the PUT, regardless of whetherthe output contained the faulty element. Then we calculatedthe precision and recall of the tools’ ability to correctly detectif there is a failure. Table I shows the precision and recall ofGWALI compared to WebSee, X-PERT, and FLB.

TABLE I: Detection accuracy of GWALI compared to otherapproaches

GWALI Websee X-PERT FLBprecision (%) 91 55 55 73

recall (%) 100 100 100 55

For precision, the results in Table I show that GWALI coulddetect IPFs with 91% precision, which was much higher thanthe other approaches. We analyzed the results to understandthe reasons for the false positives reported by each of theapproaches. For GWALI, we found that there were two generalcauses of false positives. The first cause was when part of aweb page’s content was displayed in multiple layers using thez-index CSS property. The LGs generated by GWALI modelthe layout of the page in two dimensional space, effectivelymerging different layers to one. In some cases, web designersuse the z-index CSS property in designing sliding contentareas. In these cases, the content of every slide has a differentz-index value. Our model does not take that into account andmerges the slides into one layer which caused false positives.The second cause of false positives was when the changeof a direction relationship was large (i.e., greater than α asdescribed in Section III-D), but this change did not distortthe layout of page. The α = 45 value is a heuristic, and insome cases, such as when a large amount of text shrinks, thiscan make the change exceed this value. For FLB, the maincause of false positives was the incorrect identification of thehorizontal and vertical edges. This was due to inaccuracies inthe image processing techniques used to identify the edges.For both WebSee and X-PERT, the underlying root causeof false positives was the same. Both approaches detectedany difference with the oracle as a failure. This included anyshifting of elements due to expanding or contracting text size.

For recall, the results in Table I show that GWALI, WebSee,and X-PERT had 100% recall. We investigated the results ofFLB in order to better understand the reason for its falsenegatives. We found that this was due to the fact that thedetectors in FLB could only detect a subset of the typesof visible failures that could be associated with an IPF. Inparticular, the only detection rule in FLB that could detectsymptoms related to IPFs was the one we had enabled, thecheck for overlapping text. Since not all IPFs had overlappingtext, many IPFs were missed by the FLB approach.

Fig. 6: Text overlapping with a button after translation

Overall these results show that GWALI is very accuratein detecting IPFs. Although other approaches had the samelevel of recall, GWALI also had much higher precision. It isimportant to note that these results do not indicate that theother techniques are, in general, inaccurate. Only that theirdetection mechanisms are not as effective as GWALI’s whentrying to detect IPFs.

To answer RQ2, we calculated the number of elementsthat a developer would be expected to examine when usingthe output of the evaluated approaches. To determine this, wefirst ran all of the approaches, except for FLB, on the 30subjects that contained one or more IPFs. We did not evaluatethe localization for FLB because it provides its output as amarked up screenshot of the page, with areas of the observedfailures highlighted. It was not clear how this result could bequantified and compared with the other approaches.

Both GWALI and WebSee report a ranked list of potentiallyfaulty elements, which we used as a proxy measure forexpected effort. Although an imperfect metric, rank is widelyused and allows us to quantify and compare results withoutthe expense of a field study with real developers. For subjectswith only a single fault, we simply reported the rank ofthe faulty element after running the tool. For subjects withmultiple faults, we calculated the rank of each fault usinga methodology proposed by Jones and colleagues [17]. Thegeneral idea of this methodology is to report the rank of thefirst faulty element that appears in the result set, simulate thefix of that fault, and then rerun the analysis to get the rankingof the next highest fault. This is repeated until all faults are“fixed.” The general intuition of the motivation behind usingthis methodology is that it approximates the workflow of adeveloper who scans the results, fixes a fault, and then rerunsthe analysis to see if any more faults remain.

Calculating expected effort for X-PERT is a little morecomplicated because it returns an unordered set. If the faultyelement is present in this set, then its location follows auniformly random distribution. In the case of a single fault, theexpected effort is therefore (n+1)/2, where n is the size of theset, since the developer would, on average, have to examinehalf the elements in the set before finding the faulty element.Calculating this metric for the case when there are multiplefaults generalizes to the problem of calculating the averagenumber of comparisons for a linear search for k number ofitems in an unordered set of size n where the distributionof the k items is uniformly random. Equation (1) shows thestandard equation for calculating this metric.

n+ 1

k + 1(1)

An additional complication of calculating the expectednumber of comparisons for X-PERT is that its output maynot contain the faulty element. In this case, we need toapproximate the amount of additional effort that would berequired to find the element in the remaining set of elementscontained in the page. The challenge in this approximation isknowing exactly how developers will search the remaining

elements. Instead of directly addressing this question, wecalculate an upper and lower bound on the best and worst casescenarios. The best case scenario is that the developer finds thefault as they examine the first element in the remaining set ofelements. The expected effort for this would then be the sizeof the set returned by X-PERT plus one. In the worst case, thedeveloper performs a linear search of the remaining elements.To calculate this, we reuse Equation (1) to determine howmany checks would be needed for this scenario. Equation (2)shows the expected number of checks for the scenario wherethe faulty element has not been found in the output set returnedby X-PERT. In this equation, n is the size of the result setreturned by X-PERT, m is the size of the PUT, and k is thenumber of faults. {

n + 1 best casen + m−n+1

k+1 avg case(2)

Our results are as follows. For GWALI, the median numberof expected comparisons was three; for WebSee, the medianwas 198; and for X-PERT, the median calculated with the bestcase assumption was 38 and with the worst case assumptionwas 157. It is important to note that these numbers represent alower bound on the expected number of comparisons becausewe only calculated these numbers for the subject applicationsthat contained IPFs. If we were to compute this numberover all subjects that were reported as having IPFs, then thenumbers for all subjects would increase. If we assume a linearsearch for cases where there would be no faulty element,then the median numbers for WebSee and X-PERT would risesignificantly because they had a lower precision for detectingIPFs.

Rank

Freq

uenc

y

0 5 10 15 20 25 30

05

1015

Fig. 7: Histogram of the ranks reported by GWALI

We further analyzed the ranking results of GWALI bycompiling them into a histogram, which is shown in Figure 7.This diagram shows that over 81% of the correct faultyelements were ranked in the top six returned results. Weconsider this to be a very strong result. Using a tool, suchas Firebug [3] , it would take a developer only a few minutesto use the reported XPath IDs to inspect the rendering of sixelements.

Overall, we consider these results to be a strong indicationthat GWALI can accurately localize the faulty element. Aswith RQ1, we want to emphasize that the results do not showthat X-PERT and WebSee are not accurate techniques. Instead,we interpret our results as showing that the localizationtechniques defined in those approaches are not appropriate forIPFs and that the techniques proposed in our approach are bothnecessary and more accurate for the IPF localization problem.

To answer RQ3, we measured the running time of GWALI,WebSee, X-PERT, and FLB on the subject applications. Foreach tool, the total running time included the total timerequired to start the tool from loading the browser until thetool shut down and produced its output. For GWALI, we alsomeasured the time required for each part of the approach.

For each tool, Table II shows the average, minimum,and maximum measured running time. As the results show,GWALI needed 9.75 seconds on average to run, which isslightly slower than X-PERT (8.97 seconds) and FLB (9.19seconds). WebSee was significantly slower than the otherapproaches because its ranking heuristics used sub-image com-parison. We further analyzed the time required for the differentoperations carried out by GWALI. The average distributionis shown in Figure 8. As can be seen from the graph, themost expensive part of the approach is interacting with thebrowser to collect MBRs and to load the page in the browser.This represents the cost to load and analyze the baselinepage and the PUT. In a real-world scenario, we could expecthalf of this cost (the part associated with the baseline pageloading and analysis) to be amortized over the total number ofPUTs analyzed (i.e., the number of translated pages comparedagainst.) Also, note that 36% of GWALI’s running time (3.5seconds) was consumed by the process of loading the browser.This means that the performance of our approach can besignificantly improved when the testers use it to check a largenumber of test cases. This improvement can be achieved byloading the browser once and running all the test cases oneafter another in the same browser window.

Although the results for this RQ showed that our approachwas slower than two of the other approaches, we still considerthis to be a positive result. Namely because GWALI’s runtimewas not that much slower than the other approaches (less thana second from the fastest) and the average time was under tenseconds, which in absolute terms is not a long time.

D. Threats to Validity

There are threats to validity which we have tried to addressduring our evaluation. To address external threats to validity,we made sure that we have a variety of different translation

TABLE II: Average, min, and max execution time (in seconds)for GWALI compared to other well-known tools

GWALI Websee X-PERT FLBAverage 9.75 519.66 8.97 9.19

Min 6.08 115.80 6.17 6.56Max 36.21 1640.28 17.04 23.84

loadingbrowser36.08%

collec5ngMBRs60.06%

buildinggraphs0.19%

comparinggraphs3.65% filtering

0.01%

Fig. 8: Break down of the running time of the tool

technologies, different languages, and different styles of web-sites so we have a sample that is representative of the actualweb. However, one limitation of our approach is handlingright-to-left languages, such as Arabic, Farsi, and Hebrew. Inthese languages the layout of the page is mirrored, which needsspecial handling when building and comparing the LGs. Wehave also made our subjects publicly available via the projectwebsite [5] to enable independent inspection of the subjects.

To address threats to internal validity, we checked our resultsmanually to verify correctness. A possible threat to validity isthat we fixed the screen resolution for our experiments. Inthe subject applications, some IPFs could appear or disappeardepending on the screen resolution. However, it is importantto note that the ground truth of IPFs (i.e., true positivesand true negatives) would therefore also vary. Furthermore,these IPFs could be easily detected by systematically runningGWALI at different screen resolutions, as is done in relatedtechniques [28]. Another threat is our assumption of how thedevelopers will use the result set produced by our tool. Weused rank, which is a commonly used metric to measure theeffectiveness of localization. When using rank, we assume thatthe developers will examine the first ranked element, and checkif it is faulty or not, if it is not faulty, then they will examinethe second reported element, and so on, until they find theactual faulty element. Our claim of the effectiveness of thelocalization in our approach relies on this assumption. Also, incase there are multiple failures in the page, our assumption isthat the developers use the process described above until theyfind the first fault, once the first fault is found, it will be fixedfirst, then they will rerun the tool again on the new version

of the page, and they will repeat this process until all faultsare fixed or the result set contains only non-faulty elements.This is a commonly used assumption in fault localization tomeasure the effort needed to localize faults [17].

V. RELATED WORK

There has been a lot of research work focusing on testingthe GUI of web applications. Although these have not focusedon automating the detection of IPFs, they have covered manyclosely related topics, which we describe below.

WEBDIFF [10], X-PERT [11], and Browserbite [26] arewell-known approaches in the field of Cross-Browser Testing(XBT). These approaches work by comparing two versionsof the page that are rendered in two different browsers todetect XBIs. These approaches, and other cross browser testingapproaches, suffer from over-sensitivity when used to detectIPFs. In XBT, the two versions of the page are expected tomatch exactly, and a small change in the page that occurs dueto the translation of the text is detected as a failure by thesetechniques.

ReDeCheck [28] is a responsive web design testing tool.The tool uses a model that is similar to the alignment graphused in X-PERT, and thus suffers from the same problems.Although the tool is effective in detecting layout faults inresponsive web design, it does not address the problem ofinternationalization.

Fighting-Layout-Bugs [27] is a technique that allows devel-opers to detect general correctness properties, such as text thatis overlapping with horizontal or vertical edges or text withtoo low contrast with the background. A drawback of thisapproach is that it can only detect a limited set of possibleIPFs. Another drawback is that it only outputs a screenshotof the page with the failures marked on it, so the developerhas to inspect the screenshot and manually search through thepage to identify the faulty elements.

WebSee [21], [20], [19] is a tool that uses computer visiontechniques to compare a browser rendered test web page withan oracle image to find visual differences. Such a tool is usefulin some usage scenarios where the test web page is expectedto exactly match the oracle (e.g., Mock-up driven developmentand regression debugging). However, WebSee has limitationssimilar to those of XBT techniques with respect to IPFs anddoes not perform well on detecting or localizing IPFs.

Apple offers to the developers who use Xcode IDE theability to test the readiness of the layout of their applicationsto adapt to international text [1]. This is done using pseudo-localization, where all the text in the application is replacedwith other dummy text that has some problematic character-istics. The new text is typically longer and contains non-Latincharacters. This helps in revealing internationalization faultsin the applications. The tool also changes the direction ofthe text to test right-to-left languages. Then the developerhas to verify that elements in the GUI reposition and resizeappropriately after the pseudo-localization and the directionchange. This technique suffers from two issues. First, itrequires manual effort from the developers to inspect every

page of the application, which is impractical for applicationsthat have large number of pages. Second, the technique usesan estimate for the amount of possible expansion of text, anddevelopers will not be able to verify the behavior of their GUIif the text expands more.

A tool provided by the World Wide Web Consortium (W3C)called i18n checker [9], when given a webpage, checks fora group of internationalization related properties to makesure they are set properly. For example, it checks if thecharacter encoding is specified in the HTML document, andthe language attribute is added in the HTML tag. The toolis effective in providing similar non-layout related checks.However, it does not perform a test on the layout of the pageto make sure that it can adapt international text.

Several techniques, Cornipickle [16], Cucumber [2], Crawl-jax [22], and Selenium WebDriver [8] allow developers tospecify assertions on the page’s HTML and CSS data. Theseassertions will then be checked against the web page’s DOM.There are multiple drawbacks to these approaches. First, theseapproaches need developers’ guidance by manually specifyingdesirable properties they want to check in the page, and therecould be hundreds of assertions for a single page.

An extensive work in the area of automated GUI testing[24], [23] focuses on testing the behavior of the system basedon event sequences triggered from the user interfaces. Thesetechniques differ from our approach in that they are notfocused on testing the appearance of the GUI, but instead theyuse the GUI to test the behavior of applications.

VI. CONCLUSION

Many companies internationalize their websites to targetcustomers who speak different languages. This process canintroduce IPFs in the layouts of the websites. Manuallyinspecting every page with each language option to detectsuch failures is a labor intensive and error prone activity.In this paper, we presented an automated novel approachthat can detect and localize IPFs. This is done by buildingLayout Graphs of the pages and comparing these LayoutGraphs to identify the failures. To evaluate our approach, weimplemented it as a prototype tool, GWALI. The results ofthe evaluation show that our approach can detect IPFs with91% precision and 100% recall. Our approach can identify thefaulty element with a median rank of three and with an averagerunning time of 9.75 seconds per web page. We believe theseresults are positive and show that our approach could helpdevelopers to address IPFs in web applications.

ACKNOWLEDGMENT

This work was supported by National Science Foundationgrant CCF-1528163.

REFERENCES

[1] Apple Internationalization and Localization Guide. https://developer.apple.com/library/ios/documentation/MacOSX/Conceptual/BPInternational/BPInternational.pdf.

[2] Cucumber. http://cukes.info/.[3] Firebug. https://addons.mozilla.org/en-US/firefox/addon/firebug/.

[4] Google Website Translator. https://translate.google.com/manager/website/.

[5] GWALI: Global Web Testing. https://sites.google.com/site/gwaliproject.[6] IANA Language Subtag Registry. http://www.iana.org/assignments/

language-subtag-registry.[7] IBM Guidelines to Design Global Solutions. http://www-01.ibm.com/

software/globalization/guidelines/a3.html.[8] Selenium. http://docs.seleniumhq.org/.[9] W3C Internationalization Checker. https://validator.w3.org/

i18n-checker/.[10] S. Choudhary, H. Versee, and A. Orso. WEBDIFF: Automated Identi-

fication of Cross-Browser Issues in Web Applications. In 2010 IEEEInternational Conference on Software Maintenance (ICSM), pages 1–10,Sept 2010.

[11] S. R. Choudhary, M. R. Prasad, and A. Orso. X-PERT: Accurate Identi-fication of Cross-Browser Issues in Web Applications. In Proceedings ofthe 35th IEEE and ACM SIGSOFT International Conference on SoftwareEngineering (ICSE 2013), pages 702–711, San Francisco, USA, May2013.

[12] F. N. Egger. “Trust Me, I’m an Online Vendor”: Towards a Model ofTrust for e-Commerce System Design. In CHI ’00 Extended Abstractson Human Factors in Computing Systems, CHI EA ’00, pages 101–102,New York, NY, USA, 2000. ACM.

[13] B. Esselink. A Practical Guide to Localization, volume 4. JohnBenjamins Publishing, 2000.

[14] A. Everard and D. F. Galletta. How Presentation Flaws Affect PerceivedSite Quality, Trust, and Intention to Purchase from an Online Store. J.Manage. Inf. Syst., 22(3):56–95, Jan. 2006.

[15] B. J. Fogg, J. Marshall, O. Laraki, A. Osipovich, C. Varma, N. Fang,J. Paul, A. Rangnekar, J. Shon, P. Swani, and M. Treinen. What MakesWeb Sites Credible?: A Report on a Large Quantitative Study. InProceedings of the SIGCHI Conference on Human Factors in ComputingSystems, CHI ’01, pages 61–68, New York, NY, USA, 2001. ACM.

[16] S. Halle, N. Bergeron, F. Guerin, and G. Le Breton. Testing WebApplications Through Layout Constraints. In Software Testing, Veri-fication and Validation (ICST), 2015 IEEE 8th International Conferenceon, pages 1–8, April 2015.

[17] J. A. Jones, J. F. Bowring, and M. J. Harrold. Debugging in Parallel. InProceedings of the 2007 International Symposium on Software Testingand Analysis, ISSTA ’07, pages 16–26, New York, NY, USA, 2007.ACM.

[18] G. Lindgaard, G. Fernandes, C. Dudek, and J. Brown. Attention Web

Designers: You Have 50 Milliseconds to Make a Good First Impression!Behaviour & Information Technology, 25(2):115–126, 2006.

[19] S. Mahajan and W. G. Halfond. WebSee: A Tool for Debugging HTMLPresentation Failures. In Proceedings of the 8th IEEE InternationalConference on Software Testing, Verification and Validation (ICST) -Tool Track, April 2015.

[20] S. Mahajan and W. G. J. Halfond. Finding HTML PresentationFailures Using Image Comparison Techniques. In Proceedings ofthe 29th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE) – New Ideas track, September 2014.

[21] S. Mahajan and W. G. J. Halfond. Detection and Localization ofHTML Presentation Failures Using Computer Vision-Based Techniques.In Proceedings of the 8th IEEE International Conference on SoftwareTesting, Verification and Validation (ICST), April 2015.

[22] A. Mesbah and A. van Deursen. Invariant-Based Automatic Testing ofAJAX User Interfaces. In Proceedings of the 31st International Confer-ence on Software Engineering, ICSE ’09, pages 210–220, Washington,DC, USA, 2009. IEEE Computer Society.

[23] R. Moreira, A. Paiva, and A. Memon. A Pattern-Based Approach forGUI Modeling and Testing. In Software Reliability Engineering (ISSRE),2013 IEEE 24th International Symposium on, pages 288–297, Nov 2013.

[24] B. Nguyen, B. Robbins, I. Banerjee, and A. Memon. GUITAR:An Innovative Tool for Automated Testing of GUI-Driven Software.Automated Software Engineering, 21(1):65–105, 2014.

[25] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest Neighbor Queries.In Proceedings of the 1995 ACM SIGMOD International Conferenceon Management of Data, SIGMOD ’95, pages 71–79, New York, NY,USA, 1995. ACM.

[26] N. Semenenko, M. Dumas, and T. Saar. Browserbite: Accurate Cross-Browser Testing via Machine Learning over Image Features. In 29thIEEE International Conference on Software Maintenance (ICSM), pages528–531, Sept 2013.

[27] M. Tamm. Fighting Layout Bugs. https://code.google.com/p/fighting-layout-bugs/.

[28] T. A. Walsh, P. McMinn, and G. M. Kapfhammer. Automatic Detectionof Potential Layout Faults Following Changes to Responsive Web Pages.In International Conference on Automated Software Engineering (ASE2015), pages 709–714. ACM, 2015.

[29] X. Wang, L. Zhang, T. Xie, H. Mei, and J. Sun. Locating Need-to-Translate Constant Strings in Web Applications. In Proceedings of theEighteenth ACM SIGSOFT International Symposium on Foundations ofSoftware Engineering, FSE ’10, pages 87–96, New York, NY, USA,2010. ACM.

Date post:	27-Aug-2018
Category:	Documents
Upload:	dinhdat
View:	237 times
Download:	0 times

Detecting and Localizing Internationalization Presentation ...halfond/papers/alameer16icst.pdf ·...

Documents