A Framework for Parallelizing Large-Scale, DOM-Interacting...

A Framework for Parallelizing Large-Scale,DOM-Interacting Web Experiments

Sarah ChasinsUC Berkeley

[email protected]

Phitchaya Mangpo PhothilimthanaUC Berkeley

[email protected]

AbstractConcurrently executing arbitrary JavaScript on many webpages ishard. For free-standing JavaScript — JavaScript that does not in-teract with a DOM — parallelization is easy. Many of the paral-lelization approaches that have been applied to other mainstreamlanguages have been applied also to JavaScript. Existing testingframeworks such as Selenium Grid [19] and Jasmine [9] allow de-velopers to parallelize JavaScript tests that do interact with a DOM.However, these tools frequently limit the user to a very constrainedprogramming model. Although it is possible to build general testingprograms that take arbitrary inputs and yield outputs (rather thanpass or not pass) on top of these systems. The work required to doso is non-trivial. In order to carry out web research on real worldpages, controlling input pages is even less supported in moderntools. In fact, none of the existing tools provides reproducibility tothe users. In response to this paucity of flexible systems, we build aframework designed for running large-scale JavaScript experimentson real webpages. We evaluate the scalability and robustness of ourframework. Further, we present the node addressing problem, andshow that testing node addressing algorithms can be done easily us-ing our framework. We evaluate a new node addressing algorithmagainst preexisting approaches, leveraging our framework’s cleanprogramming model to streamline this large-scale experiment.

1. IntroductionWe describe the design and implementation of a framework thatefficiently runs JavaScript programs over many webpages in par-allel. We envision that users testing multiple approaches to someJavaScript task can provide each candidate solution as a separateprogram, and then test all approaches over a large corpus of web-pages in parallel. Our framework thus facilitates a type of large-scale web experiment for which researchers, until now, have hadno appropriate tools.

We further introduce an application of our framework, a large-scale web experiment that relies on the framework’s guarantees.Specifically, we describe the challenge of robust representation ofDOM nodes in live pages. The task is to observe a webpage attime t1, describe a given node n from the webpage, then observethe webpage at a later time t2, and identify the time-t2 node thatcorresponds to n, based on the time-t1 description. A t2 nodecorresponds to a t1 node if sending the same event to both nodeshas the same effect. We term this the node addressing problem.Several node addressing algorithms have been proposed, but nevertested against each other. We test these algorithms side by side ona test set of thousands of nodes by structuring the experiment as aseries of stages in our framework.

Finally, we created a DOM evolution simulator for creatingsynthetic DOM changes from existing webpages. Our simulatortakes as input a webpage DOM and produces as output a second

webpage DOM, whose structure is similar to the first, but alteredby the types of edits that web designers make over time as pages areredesigned. Thus, given a set of DOMS, our simulator can producea DOM change that approximates the modification a user wouldsee by pulling DOMs from the same URLs at a later point in time.

1.1 Framework

At present, there are few tools for controlled web research. Whilethere is a preponderance of tools targeted at developers testingtheir own pages, these tools are not easily applied to the moregeneral problem of running arbitrary tests over real world pages.We identified five core desirable properties, properties that are keyto making a framework capable of running large-scale experimentson real world webpages. To facilitate large-scale web research, aframework should:1. Parallelize test execution2. Allow DOM interaction3. Run arbitrary JavaScript code4. Run on live pages from URLs (not only local)5. Guarantee same input (page) through experiment

No existing tool does all of these. In fact, many do none of theabove. Thus, our framework is motivated by the need for a systemthat offers all the characteristics necessary to carry out controlled,real world web research.

Our system’s central abstraction is the stage. In each stage, theframework takes as input a set of JavaScript algorithms each withn inputs, and an input table: a set of n-field records. In each record,the first field represents the webpage on which to run. Both thiswebpage argument and the remaining (n − 1) fields are passedas inputs to the JavaScript algorithms. The output table is a set ofrecords. The output records can have an arbitrary number of fields,and there need not be a one-to-one mapping between input andoutput records.

Within a JavaScript algorithm, the user may create multiplesubalgorithms. In fact, an algorithm is a sequence of subalgorithms.If a subalgorithm causes a page reload, then the next subalgorithmwill be run on the new page. If no page reload occurs, the nextsubalgorithm will be run on the same page.

Our framework also offers the concept of a session. A session isa sequence of stages during which our framework offers a samepage guarantee. In web experiments, the webpage is frequentlyconsidered an input, just as much as, for instance, any value passedto a JavaScript function. It is thus crucial that pages stay the sameacross data points, to ensure fairness and correctness of the exper-imental procedure. Our framework facilitates experiments that re-quire stable page inputs by offering the session abstraction.

We implement the framework to take advantage of the program-ming model described above by parallelizing at the level of inputrows. In addition, we use a custom page cache to both offer oursame page guarantee and to improve framework performance.

1 2013/12/17

1.2 Application

To offer robust record and replay for live web pages, a tool must beable to identify the same node in a web page over time, even if theDOM structure of the page changes between record and replay. Anaive xpath, which simply traces the path through the DOM tree tothe target node is too fragile; wrapper nodes around a target nodeor its predecessors make the xpath break, as do sibling nodes addedbefore the target node or any predecessors. Testing an algorithmfor this task is difficult, because the test must be repeated on thewebpage many times, to determine whether it works in the face ofa changing page, and because the sample size must be very largein order to get a large number of (naturalistic) broken paths beforethe later test. To complete these tests by hand — for many nodes,many times, and many algorithms — is very tedious. To completethem programmatically would still be very time-consuming, usingcurrent state-of-the-art tools. To complete them fairly, upholding asame page guarantee where necessary, would require a substantialamount of infrastructure building.

A user can run these tests in our framework with four stages. Afirst stage takes URLs as input, and produces fragile xpaths for eachnode on those pages. A second stage, in the same session so thatthe xpaths do not break, takes the xpaths as input and determineswhether each node is reactive, storing the xpaths and reactions ofthe reactive nodes. A third stage takes the xpaths of the reactivenodes and runs all node addressing algorithms over each node,producing node addresses. This stage must also be part of the samesession. A final stage, not a part of the same session, takes the nodeaddresses as input, and runs each node addressing algorithm toidentify a corresponding node. It then identifies the reaction of eachcandidate corresponding node, comparing the reaction with thestage 2 reaction to evaluate each algorithm’s success. Rerunning thefinal stage at a later time, with the same input, tests the algorithms’robustness over a longer period. This stage can be repeated as oftenas desired.

1.3 DOM Change Simulator

We offer five DOM modifiers, each of which makes a different typeof edit to DOM structure, any or all of which may be useful for agiven application, depending on the needs of the user. They maybe combined as desired into a single simulator. For the applicationdescribed above, which examines the robustness of DOM nodeaddresses in the face of changing DOM structures, we observedthe effects of all five types of modification on node addressingalgorithms.

1.4 Organization

In Section 2, we detail the framework’s programming model. InSection 3 we discuss the implementation of the framework, andin Section 4 its scalability. Section 5 presents the node addressingapplication. In Section 6, we discuss our DOM change simulator.Our evaluation of node addressing algorithms appears in Section 7.Section 8 is a discussion of related work, and Section 9 concludes.

2. Framework Programming Model2.1 Abstractions

Our framework is built around a core set of user-facing abstrac-tions:• Session: sequence of stages for which we offer the same page

guarantee• Stage: run of a single (input program, input table) pair• Input Program: set of algorithms• Input Table: each row corresponds to a single run of all pro-

gram algorithms; the first field is the page on which to run them,while the others represent the arguments to all algorithms

Input Program

Input Table

Framework Output Table

Figure 1: A stage, the central abstraction of our framework.

• Algorithm: JavaScript code to run on all input table rows; asequence of subalgorithms

• Subalgorithm: component of an algorithm that takes place ona single page

• Output Table: a table with zero or more rows per input tablerowDuring a session, a program can run multiple program-table

pairs through our system. We call each program-table pair a stage.During a single session, if the URL in the leftmost column of aninput table is the same in multiple rows (even across tables), thepage that is loaded for those rows will also be the same.

Each stage is defined by its input program and its input table.See Figure 1 for a visual representation of a single stage. A stage’sinput program may contain multiple algorithms to run for each rowin the input. Each algorithm may also contain subalgorithms, if thealgorithm must run over multiple pages. For instance, if the firstsubalgorithm finds and clicks on a link, the second subalgorithmwill run on the page that is loaded by clicking on the link. Ifclicking on a link does not cause a new page to load, the secondsubalgorithm will run instead on the original page.

2.2 Stage Output

Recall that any given input row may correspond to multiple outputrows. This combined with the existence of subalgorithms compli-cates the design of our programming model. One option is to re-quire that only the final subalgorithm can produce any output. Thissimplifies the programming model, but for cases in which earliersubalgorithms can access data that the later subalgorithm cannot,this may be undesirable. For instance, consider a test that finds andclicks on a link, and wishes to compare the URLs before and af-ter clicking. To complete this task now, it would need to be splitinto two stages, with the first stage storing the original URL, whilethe second clicks the link and stores the second URL. Because thisapproach simplifies the programming model, and because splittingsuch tasks across stages is sufficient to make the approach general,this is the design we have adopted.

One plausible improvement is to allow the subalgorithms to passtheir output to later subalgorithms, even if they may not directlywrite to the output table. We believe this modification may indeedbe desirable, and if we find applications for which this adjustmentwould substantially simplify program logic, we may implement it.For now, we find this change unnecessary.

Alternative approaches included allowing all subalgorithms toproduce output rows, but requiring that the users JavaScript testsassociate each slice of the row with some sort of row id. By usingthe id, our system would be able to stitch the row slices togethercorrectly after the completion of the final subalgorithm. Similarly,we could require that the number of output rows be the same acrosssubalgorithms, and use the slices ordering to produce the correctfull rows.

Another possibility would have been to allow each algorithm toproduce only one output row for each input row, but we felt thisdiminished the expressiveness of the model too substantially.

2 2013/12/17

c1 c2 c3

b1 a1 a2

c1 c2 c3

b1 a1 a2

Figure 2: Different algorithms may return different numbers ofrows for a given input row. When this occurs, our frameworkstitches the output rows together according to their order.

We faced a similar design challenge in determining the appro-priate way to combine distinct algorithms’ outputs. In fact, consid-ered most of the same solutions, except that all algorithms must beallowed to produce output. Ultimately, we determined that all algo-rithms should be allowed to produce as many rows of output as de-sired, even if different algorithms produce different numbers of out-put rows for the same input row. Output rows are stitched togetherbased on the order in which they are returned by the algorithms,as shown in Figure 2. This approach simplifies users’ code for thecommon cases in which there is only a single algorithm, each algo-rithm returns only a single row, or each algorithm returns the samenumber of rows. However, it restricts the model’s expressivenessin cases in which algorithms do not know the order in which otheralgorithms will return their output rows, but still want to achieve aparticular lineup with each other. Although we believe this situa-tion is likely to be rare, it would be simple to address such cases byoffering the id approach described above in the context of subalgo-rithm output. We would maintain the ease of programming offeredby the current, simpler model by using the current approach bydefault and switching to the id approach only when the user codesignals that it is necessary. The user could switch modes either ona stage-by-stage basis, or even within a single stage.

2.3 Interleaving Framework Processing and User ProcessingA user can control framework execution from within a standardJava program, using our simple stage and session abstractions.This allows the user to interleave framework processing with otherprocessing as necessary. Figure 3 shows how a user could use theoutput of early stages as the input of later stages, and how theuser can interleave processing of user-created code with frameworkprocessing.

Note that in the Figure 3 example, the user uses the output ofstage 1 as the input for stage 2. The user also inserts a call to thefunction someProcessing to combine the output of stages 2 and3 into an input for stage 4.

3. Framework Implementation3.1 DesignAs illustrated in the system diagram in Figure 4, the basic archi-tecture of our framework revolves around creating a set of workerthreads, each of which controls a browser instance. The browserinstances’ web traffic goes through our custom proxy server, whichcontrols the caching we use both for performance and to offersame-page guarantees.

3.2 Web DriverWe have implemented versions of our framework on top of threedistinct web automation tools: PhantomJS [16], Ghost.py [10], andSelenium [4]. Each web automation tool offers a different typeof browser instance that our framework can control programmat-

Worker Thread …Worker Thread Worker Thread

HTTP req HTTP resp

HTTP req HTTP resp

WebDriver WebDriver WebDriver

Caching Proxy Server

Web

Figure 4: Our system consists of multiple workers running in par-allel. Each worker owns a web driver — essentially a browser in-stance that can be controlled programmatically. The HTTP trafficof the system goes through the caching proxy server.

ically. The first two are headless web browsers, which do not dis-play the webpages with which they interact. The third, Selenium,manipulates an instance of a standard browser. Substituting dif-ferent web automation tools for each other in our system corre-sponds to replacing the boxes labeled “WebDriver” in our systemdiagram, in Figure 4. The WebDriver’s central role in our systemis to load pages. Unfortunately, not all web automation tools loadpages equally well. This led us to test these tools’ suitability for useas WebDrivers in our frameworks.

We built implementations of our framework on top of all threeof these web automation tools. For each tool, we created botha serial and at least one parallel version. Here we describe theapproach of each implementation.

3.2.1 PhantomJS

PhantomJS is a headless WebKit that is controlled with JavaScript.It is termed headless because rather than start a real browser to loada page and run its JavaScript, the WebKit simulates the browser.This reduces startup time substantially, and eliminates the needto display content. Recall that JavaScript is single-threaded. Ourserial PhantomJS implementation waits until each row in the inputtable is fully processed before moving on to the next. Our asyncimplementation is still serial, still runs in the single JavaScriptthread, but uses async.js to improve performance by leveraging theasynchronous nature of the page retrieval and processing task. Ourparallel implementation wraps the asynchronous implementationin another layer. The higher-level layer splits the input table rowsbetween a fixed number of workers, and runs an asynchronousPhantomJS implementation for each slice of the rows.

3.2.2 Ghost.py

Ghost.py is a headless WebKit that is controlled with Python.Our sequential Ghost.py version uses a single worker. Our loadbalancing implementation uses a shared task queue, and a fixednumber of workers. To prevent errors from Ghost’s page closingapproach, we must create a new Ghost instance for each task.

3.2.3 Selenium

Selenium provides APIs for automating web interactions fromwithin several different programming languages. Our Selenium-based implementations use the Java API. Selenium’s APIs can alsobe used to control several different real browsers. Our Selenium-based implementations use Firefox instances. Our first implementa-tion is the standard serial version. Our second splits the input tablerows across a fixed number of workers. Our third uses a shared taskqueue and a fixed number of workers.

3 2013/12/17

f ramework . s t a r t S e s s i o n ( ) ;f ramework . s t a r t S t a g e ( j s 1 , inpu tCsv1 , o u t p u t C s v 1 ) ;/ / Below , t h e u s e r s t r i n g s t o g e t h e r s t a g e s , u s i n g t h e r e s u l t s o f s t a g e 1 as t h e i n p u t f o r s t a g e 2 .f ramework . s t a r t S t a g e ( j s 2 , ou tpu tCsv1 , o u t p u t C s v 2 ) ;f ramework . s t a r t S t a g e ( j s 3 , inpu tCsv3 , o u t p u t C s v 3 ) ;/ / Below , t h e u s e r does some p r o c e s s i n g t o combine r e s u l t s o f s t a g e s 2 and 3 i n t o i n p u t f o r s t a g e 4 .i n p u t C s v 4 = s o m e P r o c e s s i n g ( ou tpu tCsv2 , o u t p u t C s v 3 ) ;f ramework . s t a r t S t a g e ( j s 4 , inpu tCsv4 , o u t p u t C s v 4 ) ;f ramework . e n d S e s s i o n ( ) ;

Figure 3: Interleaving user code and framework processing.

0

5

10

15

20

25

30

35

40

s-‐seq s-‐split s-‐bal g-‐seq g-‐bal p-‐seq p-‐split

Minutes

Benchmark Comple4on Time

Figure 5: Median execution time, across impelementations, on a500-row stage.

3.2.4 Performance Evaluation

To test our implementations, we ran a simple title extraction bench-mark over the homepages of the first 500 sites on the Alexa top siteslist [5]. This experiment was run a Core i7-2600 @3.40GHz 8-coremachine. We tested the following 8 implementations:1. Selenium-sequential (s-seq)2. Selenium-split (s-split)3. Selenium-loadbalance (s-bal)4. Ghost-sequential (g-seq)5. Ghost-loadbalance (g-bal)6. PhantomJS-sequential (p-seq)7. PhantomJS-split (p-split)

The non-sequential implementations each used 8 threads (or pro-cesses).

Figure 5 displays the median execution time in minutes. As ex-pected, Selenium-loadbalance finishes before Selenium-split, andSelenium-split finishes before Selenium-sequential. However, weonly obtained 4.2x speedup from load balancing across 8 threads.We believe this limited speedup stems from the interference of mul-tiple Selenium instances or Selenium-launched Firefox instanceswith each other. While network bandwidth may play a role, we findthis unlikely given the substantial speedup we observe for Pham-tomJS. Our Ghost.py implementation did not benefit from par-allelization. Our PhamtomJS implementation did benefit greatly,achieving a 31x speedup from parallelization. This speedup re-flects the benefits of the additional workers, but also the benefitsof the async implementation — recall that each worker runs notan instance of the serial PhantomJS implementation, but of thefaster async PhantomJS implementation. PhantomJS implementa-tions sometimes hung indefinitely, but we could not include thosedata points in our evaluation.

3.2.5 ReliabilityRunning the simple title benchmark over the first 500 sites in theAlexa top sites list [5], we measured three kinds of undesirable out-comes: server unreachable errors (Figure 6(a)), timeouts (Figure6(b)), and wrong outputs (Figure 6(c)). While server unreachabil-ity is outside of the web automation tools’ control, the tools aresometimes the cause of timeouts, and always the cause of wrongoutputs.

Our results revealed that PhantomJS and Ghost.py did not reli-ably produce the correct (human-identified) outputs. While Phan-tomJS provided the correct titles for most pages, it handled redi-rects poorly, often giving the title associated with a redirect, ratherthan the one associated with the final destination. Further, althoughthis issue did not appear for the simple JavaScript code that re-trieves a document title, we also found that for some JavaScripttests, running the code in PhantomJS failed to produce the same ef-fects that it produced in Selenium, Ghost.py, and normal browsers.Since our framework targets arbitrary JavaScript code, this was un-acceptable.

Ghost.py handles redirects correctly, but it times out on a largenumber of pages, and does not support non-English output. Further,although Ghost.py provides an API for accessing new pages loadedby interacting with a page, this API does not apply when the load-ing interaction is completed by a JavaScript program. This problemappears to be a known issue that the Ghost.py developers have notyet addressed. Because our framework targets arbitrary JavaScriptcode, including DOM-interacting code, and because it targets allpages, including non-English pages, this was unacceptable.

3.2.6 Web Driver SelectionBecause PhantomJS and Ghost.py failed to provide the reliabilityand robustness so crucial to our goals, we chose to build our systemon top of Selenium.

Selenium’s times, although slow in the sequential version, be-came competitive with a load balanced implementation. Even moreimportantly, it produced no incorrect outputs. Because our frame-work is intended to facilitate web research, it is essential that theresults be trustworthy. Ultimately, we preferred seeing more rowswithout answers to seeing rows with wrong answers. Rows with-out answers are an unavoidable component of web experiments,servers typically being outside of the experimenters’ control. Thus,we privileged wrong outputs as the deciding factor, and conse-quently selected Selenium as the provider of our framework’s webdriver.

3.3 Caching Proxy ServerOur framework enforces the same-page guarantee though the use ofa caching proxy server. The framework directs all HTTP request-response traffic though the caching proxy server as illustrated inFigure 4. All pages for a given session are served from a singlecache.

4 2013/12/17

0

5

10

15

20

25


Num

ber o

f Server U

nreachab

le Errors

Server Unreachability

(a) Server unreachability

0

10

20

30

40

50

60


Num

ber o

f Tim

eouts

Timeouts

(b) Timeouts

0

10

20

30

40

50

60

70

80


Num

ber o

f Incorrect Outpu

t Row

s

Wrong Output

(c) Wrong outputs

Figure 6: Instances of several types of bad outcomes, across imple-mentations, on a 500 row benchmark.

Despite the preponderance of existing caching proxy servers,none proved sufficiently controllable to meet our needs. Squidcache [18, 20] can be configured to ignore some web cache pol-icy parameters, such as ‘no-cache,’ ‘must-revalidate,’ and ‘expira-tion.’ However, it cannot be configured to ignore others — for in-stance, ‘Vary.’ As an example, unmodified Squid will never cachestackoverflow.com, because of the presence of “Vary: *” in theheader. Apache Traffic Server [6] provides even less cache config-uration control. Polipo [8] is a much smaller-scale caching proxyserver that can be configured to ignore all caching policy parame-

ters, but is very fragile. The server crashes after serving our frame-work’s HTTP traffic for less than 10 minutes.

To achieve the control our framework demands, we imple-mented our own custom caching proxy server. Our server storesevery response into its cache before forwarding the response backto Selenium. It ignores all web cache policy parameters in theHTTP header. The same request (URL) from Selenium always elic-its the same response from the server. Note that currently the proxyserver does not handle HTTPS requests, although this functional-ity can be built in. The proxy server stores each session’s cache inits own directory, with each response in a separate URL-identifiedfile, relying on the OS to load data from disk. A more efficientstore would trivially improve the proxy servers performance. With-out this addition, however, good performance can be obtained bygrouping requests to the same URL.

Our cache had to handle the challenge of cyclic redirects. Somepages use a cyclic redirect process, redirecting a request for URL Xto URL Y (with a ‘no-cache’ policy), and redirecting a request forY to X , until eventually the originally requested X is ready, and theX response is no longer a redirect. At this point, the response forX contains the final page content that our cache should associatewith URL X .

In this scenario, if the proxy server caches everything, andthere is no mechanism to clear the cache, our system will loopforever. The proxy server will always return the redirects. Ourcache addresses this issue by maintaining a redirect table, mappingrequest URLs to their redirect URLs. Upon receiving a redirectresponse, the cache checks whether adding the redirect responseto the table will create a size-two redirect cycle. If it will, the cacheremoves the pre-existing redirect entry that causes the cycle. Thistechnique is limited to redirect cycles of size two, but since we havenot yet found a redirect cycle of size greater than two, we feel theperformance benefits of avoiding full cycle detection justify thislimitation.

3.4 User InteractionAs discussed in Section 2.3 above, the user controls our frameworkfrom a Java program, creating an instance of the framework, thenusing the session and stage abstractions. She may interleave thisprocessing with processing of her own.

The other inputs are all the files that are passed to the stages:a JavaScript file for each stage, with a function for each subalgo-rithm; a table for each stage, with the URLs and algorithm inputs.

4. Framework EvaluationTo briefly evaluate the scalability of our complete framework, weran the title extraction benchmark on the first 10,000 sites in theAlexa top sites list [5]. We ran this benchmark on the same ma-chine used in the Section 3.2.4 experiments. We recorded the exe-cution time for every increment of 100 sites. The results appear inFigure 7.

The data reveals that both the competition time and the numberof timeouts scale linearly with the size of the input. We concludethat our framework is sufficiently stable, and that its performancemeets the needs of the large-scale experiments our framework tar-gets.

5. The Node Addressing ProblemAs discussed in Section 1.2, web tools — such as record and replaysystems — require a node addressing algorithm that keeps nodesaddressable even in the face of pages’ changing DOMs. A simplexpath that traces the path from the root to the target node breakswhen new wrapper nodes are added, even when new sibling nodesare added for any node along that path. This may occur as the pageis redesigned, or simply in response to user actions. For instance,

5 2013/12/17

stackoverflow.com

0

100

200

300

400

500

0 2000 4000 6000 8000 10000

Minutes

Input Rows

Comple4on Time

(a) Completion time

0

500

1000

1500

2000

0 2000 4000 6000 8000 10000

Num

ber o

f Tim

eouts

Input Rows

Timeouts

(b) Timeouts

Figure 7: Scalability test on the simple title extraction benchmark.

many pages add nodes that make suggestions based on recent useractions, as with an airline site that suggests users repeat recent flightsearches. Node ids and classes change as pages are redesigned, orif a page uses obfuscation, and many nodes lack ids and classes inthe first place.

Figure 9 illustrates a few of the characteristics that can be usedto develop a node representation, including the xpath from root tonode, the node type (a, div, p, and so on), the id, the class, and theinner text.

Figure 10 illustrates some of the common changes that are madeto DOMs over time, the sorts of changes that can derail nodeaddressing algorithms.

5.1 Node Addressing AlgorithmsSeveral solutions have been proposed to the node addressing prob-lem, but they have never been tested in a controlled experiment.

We consider the following 5 plausible node addressing algo-rithms.1. xpath Records the path from the root of the DOM tree to the

target node. To identify corresponding node, follows the samepath. If there is no matched node, returns null.

2. id Selects the first node whose id matches the recorded id,otherwise returns null.

3. class Selects the first node whose class matches the recordedclass, otherwise returns null.

4. iMacros Collects the list of nodes with the same node type andinner text as the target node. Records the target node’s positionin this list. To identify the corresponding node, constructs thelist from the new page, selects the item at the target node’soriginal index. If no such node, returns null.

get all xpaths

URLs Framework

xpaths

filter non-‐responsive nodes

Framework

store node addresses

filtered xpaths Framework

retrieve nodes

node addresses Framework

node responses

1

2

3

session

1

4

session

2

Figure 8: A visualization of the DOM representation algorithmtesting task, split into stages for our framework.

nodeType

id

class

xpath

text

lege

nd

html

head body

div div div

div div



1234

div a

Figure 9: An illustration of some of the DOM node characteristicsthat can be used to construct a DOM ‘address.’

5. Ringer Collects the xpath as described above, the id, the class,and the text. To identify the corresponding node, it uses sixstrategies; the original xpath, xpath suffixes, common variationson the xpath, the class, the id, the text. Each strategy votes forup to one node. Returns the most voted-for node, or null if nonodes receive votes.The first three, xpath, id, and class each use only a single charac-

teristic to identify the corresponding node. They rely on that char-acteristic to identify exactly one node. The second two, iMacros

6 2013/12/17

html

head body

div div div

div div

 text1

div a

html

head body

div div div

div div


div a div

html

head body

div div div

div div


div a div

html

head body

div div div

div div


div a

 text1


…


…


…

Figure 10: A visualization of the types of changes that are made to DOMs as they are redesigned or obfuscated.

and Ringer, are in use by real web tools. The iMacros algorithmcomes from the approach used by the iMacros [1] web scriptingtool. The Ringer approach is also in use by a real tool, Ringer [3],which is a web record and replay system. We are particularly in-terested in the Ringer approach, which is being developed for theRinger tool by Shaon Barman and one of the authors of this paper.

5.2 Testing Node Addressing Algorithms

Our framework offers a means of testing node addressing algo-rithms in a fair setting. For testing in our framework, the experimentis split into four stages, as depicted in Figure 8. The first stage tra-verses the DOM tree, recording an xpath for each node. The secondstage, with an input row for each xpath, clicks on the node at thexpath. Recall that finding a node on one page and then again after areloading is extremely difficult, that this problem is in fact the entiremotivation for this application. Even reloading after a few seconds,the original xpath may break. It is thus crucial that stages 1 and 2occur in the same session, seeing the same instance of each pageduring both stages. Stage 2 also compares the pre-click URL withthe post-click URL, producing an output row only if the URL haschanged. That is, each output row corresponds to a reactive node.If clicking on the node has no effect on the state, it will be impos-sible to check (in stage 4) whether the stage 4 node corresponds tothe stage 2 node. Since, for this experiment, we take the URL as aproxy for the state, the URL must change in order for a node to beconsidered reactive. Stage 3 takes the xpaths of all reactive nodesas input, and runs each node addressing algorithm on each node,producing the node addresses for each node as output.

The fourth stage takes the node addresses as input. It uses theaddresses and the corresponding node addressing algorithms tofind the appropriate node on the new page, and then click on it.It produces the new post-click URL as output. The new post-clickURL can be compared with the stage 2 post-click URL to determinewhether each algorithm successfully identified the correspondingnode. Figure 11 offers a pictorial representation of the stage 4algorithm, showing the different node addressing approaches asdifferent algorithms, and the clicking and URL inspecting functionsas different subalgorithms.

Thus, this experiment can be cleanly divided into four stages inour framework. Note that the experiment relies on all five of thecrucial characteristics identified in Section 1.1:1. Parallelize test execution — a thorough test demands running

this task on many nodes, in order to reveal a sufficiently largenumber of naturalistically broken addresses to distinguish be-tween approaches, which makes parallel execution highly de-sirable

get node, method 1, click

subalg 1 get url

subalg 2

get node, method 2, click

get url

… …

4

Figure 11: A visualization of the stage 4 input program, showingthe first two node addressing algorithms. For each algorithm, thefirst subalgorithm uses the node address to identify the correspond-ing node according to the given node addressing algorithm. It thenclicks. The second subalgorithm checks the URL of the resultantpage.

2. Allow DOM interaction — the test must be able to click on thealgorithm-identified nodes

3. Run arbitrary JavaScript code — the algorithms cannot be lim-ited to, for instance, returning pass or fail as their outputs

4. Run on live pages from URLs (not only local) — the test shouldrun on the real top Alexa sites, not DOMs built up locally

5. Guarantee same input (page) through experiment — stages 1through 3 require a same page guaranteeIn Table 1, we show content descriptions for the input and

output files of each stage.

6. DOM Change SimulatorTesting node addressing algorithms requires a suite of DOMs, eachwith multiple versions. Alternatively, testing can work over DOMswith naturalistic changes introduced to simulate DOM changesover time. To explore the latter approach, we created a small suite ofDOM edits that can be combined to create various DOM workloadsimulators.We implemented 4 types of DOM edits:Wrapper: wrap first-level divs

Wrap every div node that is a direct child of the body node witha center node.

7 2013/12/17

input 1: urloutput 1, input 2: url url after redirects xpathoutput 2, input 3: url url after redirects xpath post-click urloutput 3, input 4: url url after redirects post-click url node address (alg 1) node address (alg 2) ...output 4: url url after redirects post-click url post-click url (alg 1) post-click url (alg 2) ...

Table 1: Column descriptions for the input and output files, for each stage of the node addressing testing task.

html

head body

div div div

a span


div div

html

head body

div div div

a span


div div

center center

html

head body

div div div

a


div div span a div

html

head body

div div div

a p div div

html

head body

div

div

a

span


div

div div

html

head body

div div div

a span


div div

original wrapper insert

type move text


Figure 12: A visualization of our five DOM edits.

Insert: insert many nodesFor every div node, insert as the div’s first child a new node withthe same tag as the div’s original first child.

Type: span to pConvert every span node into a p node.

Move: become sibling’s childMove every object whose next sibling is a div object such thatit becomes that sibling’s first child.

Text: modify node textAdd a letter to the inner text of each node.

Figure 12 gives a pictorial representation of the effects of theseedits.

7. Node Addressing EvaluationFor the purposes of this work, we consider two correctness condi-tions. The first, a conservative estimate of success rate, considers aresult correct if the stage 4 post-click URL is the same as the stage2 post-click URL. This produces some false negatives. Considerclicking the ‘Random Article’ link on wikipedia.org, or the topstory on a news site. Clicking on the correct node will lead to differ-ent urls during different runs. However, whenever this correctnesscondition is met, the node addressing algorithm has definitely suc-ceeded. Thus, this condition offers a lower bound on robustness.

The second correctness condition may overapproximate successrates, considering a result successful if the stage 4 post-click URL

0

1

2

3

4

5

6

7

tes

xpath -‐ ub xpath -‐ lb id -‐ ub id -‐ lb class -‐ ub class -‐ ub iMacro -‐ ub iMacro -‐ lb Ringer -‐ ub Ringer -‐ lb

0

10

20

30

40

50

60

70

80

90

100

Percen

t Success

Figure 13: Node addressing algorithm success rates on day 0 (theday the node addresses were recorded) and day 6. In each grouping,the left bar represents day 0 performance, and the right bar repre-sents day 6 performance. Lower bound is indicated with ‘lb’ andupper bound is indicated with ‘ub.’

is different from the stage 4 pre-click URL. This approach correctlyhandles the ‘Random Article’ and top story cases described above,but may also produce false positives. This approach allows analgorithm to click on any URL-changing node and still succeed,regardless of whether it is the corresponding node. However, if analgorithm fails by this criterion, it has definitively failed. If no URLeffect is produced, the identified node was not the correspondingnode. Thus, this condition offers an upper bound on robustness.

7.1 Robustness Over TimeTo evaluate node addressing algorithms’ robustness over time, wecompared their immediate performance with their performance af-ter a delay. Specifically, we ran them once on the same day that thenode addresses were recorded — day 0 — and once almost a weeklater on day 6.

We ran our experiment on the top 15 sites in the Alexa top siteslist [5], executing the four stages as described in Section 5, runningstage 4 on both day 0 and day 6. The 15 sites yielded 11,373 nodes,1106 of which were reactive.

The algorithms’ success rates on the day 0 and day 6 runsappear in Figure 13. The lighter colored bars represent the successrate using the first correctness criteria — that is, the lower bound.The darker colored portions of the bars represent the additional

8 2013/12/17

wikipedia.org

0

1

2

3

4

5

6

7

tes

xpath -‐ ub xpath -‐ lb iMacro -‐ ub iMacro -‐ lb Ringer -‐ ub Ringer -‐ lb

0

10

20

30

40

50

60

70

80

90

100 Pe

rcen

t Successful

Figure 14: Node addressing algorithm success rates, for DOMedit types. Lower bound is indicated with ‘lb’ and upper boundis indicated with ‘ub.’

percentage points gained by using the second correctness criterion,the upper bound on the true success rate.

As Figure 13 makes evident, the id and class approaches largelyunsuccessful. In contrast, xpath, iMacros, and Ringer all provereasonably robust. Ringer definitively outperforms iMacros on day0, with the lower bound on Ringer’s success rate being higherthan the upper bound on iMacros’. While Ringer’s success regionoverlaps with xpath’s on day 0, both its upper and lower bounds arehigher than xpath’s respective upper and lower bounds. The successregion for xpath in day 0 does overlap with iMacros’, but barely —xpath comes quite close to definitively outperforming iMacros.

All three of the successful approaches see their performance de-grade with the passage of time. On day 6, Ringer’s upper bound isstill greater than all other approaches’ upper bounds, and its lowerbound is still greater than all other approaches’ lower bounds; how-ever, note that it no longer definitively dominates the iMacros ap-proach, there being some overlap in their success regions. Never-theless, since the lower bound on Ringer’s day 6 performance is infact higher than the lower bound on iMacros’ day 0 performance,we consider Ringer’s approach largely successful.

Overall, we conclude that both iMacros and Ringer appear todegrade more slowly over time than xpath. However, since Ringer’soriginal performance solidly dominates iMacros’, we conclude thatthe Ringer approach is the more robust.

7.2 Synthetic BenchmarkWe evaluated the node addressing algorithms using our syn-thetic DOM evolution simulators on 4 sites: amazon.com, word-press.com, bing.com, and ask.com. Each run consists of a singlesession with 9 stages. The first 3 stages correspond to stages 1-3 inthe node addressing task, described in Section 5. The last 6 stagesare variations on stage 4, node retrieval. The node retrieval stageruns with 6 different types of DOM modifications: none (originalbaseline), wrapper, insert, type, move, and text.

This benchmark identified 63 reactive nodes out of 773 totalDOM nodes. Figure 14 shows the percentage of nodes that each

algorithm successfully identifies. The results for id and class areexcluded, since they identified none of the nodes correctly. The100% success rate of the xpath algorithm when no DOM changesare present confirms that our framework provides the same pageguarantee within the same session. Note that iMacros could notcorrectly identify all nodes even when no changes had been appliedto the DOMs, because node type, text, and pos were not enoughinformation.

As evidenced by Figure 14, xpath performs poorly on wrap-per, insert, and move. These edits altered the xpaths to most DOMnodes, substantially degrading xpath’s effectiveness. It performedwell on type because most nodes were not of type span, and itcould correctly identify all nodes when only the text content waschanged. Ringer consistently performed better than xpath. On in-sert and move, it benefited greatly from considering the xpath suf-fixes and common variations on the xpath. However, Ringer per-formed worse than iMacros on insert and move, since Ringer givesxpaths so much weight. iMacro’s success rates were relatively con-sistent across different types of DOM changes, with the exceptionof text. As expected, iMacros could not identify any node correctlywhen text content was modified.

8. Related Work8.1 Frameworks

A large number of tools have been developed for the purposeof running JavaScript tests. Almost all are targeted towards webdevelopers who want to test their own pages, or even only their ownJavaScript. Below we cover the main subcategories of this class oftools.

Jasmine [9] is one of the most prominent system for runningJavaScript unit tests. While its ease of use makes it a good platformfor small-scale experiments, it lacks many of the characteristics wedesire for large-scale, general purpose web research. First, its par-allelization mechanism is quite limited. Second, it uses a restrictiveprogramming model, tailored to offer pass/fail responses for eachtest. Third, it only runs on locally constructed DOMs — it cannotbe fed a URL for its tests.

These limitations characterize a large portion of the JavaScripttesting space, which very heavily tailors platforms to web develop-ers with limited experimental needs. This category of tool includesprojects such as QUnit [17], Mocha [15], and YUI Test [14]. Infact, QUnit, Mocha, and YUI Test do not offer even the limitedparallelization that Jasmine provides.

Some tools, like Vows [11], offer parallelization, but are aimedonly at testing JavaScript. These typically run on Node, which elim-inates any DOM-interactive code from their domains, and naturallyany URL-loaded code.

Finally we consider the three web automation tools on which webuilt our framework implementations: Ghost.py [10], PhantomJS[16], and Selenium [4]. None of these is explicitly a testing frame-work. Rather, they are generic web automation tools. Clearly it ispossible to build up a testing framework on top of any of them,as this paper has shown. However, we found that the amount ofinfrastructure and insight necessary to do so was far from trivial.Ghost.py and PhantomJS have no built-in parallelization. There isa variation on Selenium, Selenium Grid [19], that offers paralleliza-tion. We note however that it is tailored for users who want to runthe same tests on multiple browsers and on multiple operating sys-tems, rather than for users with large-scale experiments. In fact, wefound the Selenium Grid approach sufficiently unwieldy for ourneeds that we chose to implement our own parallelization layer.

Ghost.py, PhantomJS, and Selenium do all offer DOM interac-tion, the ability to run arbitrary JavaScript code, and the ability toload pages from URLs, all characteristics that made them accept-

9 2013/12/17

able candidates for serving as our framework’s web driver. How-ever, in the case of Ghost.py and PhantomJS, we found the abilityto run arbitrary code was sometimes hindered by the tools’ limitedrobustness.

All three of Ghost.py, PhantomJS, and Selenium — like all theprojects described here — lack any support for same page guar-antees. Ultimately, most tools, being targeted towards developers,are targeted towards users who know their test pages will stay thesame, or know how they will change. This makes them generallyunsuitable for broader web research.

8.2 Node Addressing

There are several existing web tools that require robust node ad-dressing algorithms. Therefore, despite the paucity of literature onthe node addressing problem, several solutions have been put intopractice.

We have already described two such solutions, iMacros andRinger. The iMacros [1] approach uses node type and node textto identify a list of nodes, and then uses the target node’s positionin that list as an address. The Ringer [3] approach uses an xpath,several variations on the xpath, the id, the class, and the text in avoting scheme to identify corresponding nodes.

CoScripter, another web tool which offers some record andreplay functionality, takes an iMacros-style approach [12]. Wherepossible, it associates a target node with text, whether it be thetext within the node, or text that precedes it — as when a textboxappears to the right of a label. When no related text is available— for instance, if the node is a search button that displays amagnifying glass icon rather than the word ‘search’ — CoScripterfalls back on the position approach. ActionShot [13] also usesCoScripter’s technique.

Some tools take even more fragile approaches, relying on pro-grammers to adjust the node addresses by hand as appropriate. Forinstance, Selenium IDE offers a record and replay option that de-scribes nodes by id when available, describes links by their innertext contents, and backs off to an approach that describes nodes bythe combination of node type and parent node type.

Other tools take wholly different approaches. For instance,Chickenfoot [7] allows users to access and interact with nodes viahigh level commands and pattern-matching. Sikuli [21] takes a vi-sual approach, using screenshots and image recognition to identifynodes.

8.3 DOM Workloads

To our knowledge, there have been no previous tools for DOMevolution simulation. To this point, researchers conducting web ex-periments who need realistically changing DOM workloads appearto have had two main options. First, they could collect their ownworkloads by pulling DOMs at the desired intervals. If a researcherneeded control over the amount of change, this was the only op-tion. Alternatively, researchers could use DOMs pulled by down byarchiving operations such as the WayBack Machine [2]. However,this option gives researchers no control over the intervals betweencollection points, and there is no guarantee that a researcher’s sitesof interest will have been archived.

9. ConclusionExisting tools for running JavaScript tests in parallel are limited.They offer limited parallelization, limited DOM interaction, lim-ited programming models, often run only on locally constructedDOMs, and never provide same page guarantees. Our frameworkoffers a system with none of these limitations. The node address-ing application revealed that our framework can be used to cleanlystructure large and complicated web experiments. We found thatour new node addressing approach is more effective than a prior

state-of-the-art algorithm. Finally, we introduced a DOM evolutionsimulator for generating DOM workloads with realistically modi-fied structures. We obtained interesting insights into node address-ing algorithms through their use on our synthetic workloads. Ulti-mately, we believe that large-scale web research is an increasinglyimportant area, with exciting problems to solve and many crucialinsights to uncover. Increased tool support should accelerate thepace of discovery in this burgeoning young field of study.

References[1] Browser scripting, data extraction and web testing by imacros.

http://www.iopus.com/imacros/.

[2] Internet archive: Wayback machine.

[3] sbarman/webscript. https://github.com/sbarman/webscript.

[4] Selenium-web browser automation. http://seleniumhq.org/.

[5] Alexa. Alexa top sites. http://www.alexa.com/topsites,October 2013.

[6] Apache. Apache traffic server. http://trafficserver.apache.org/.

[7] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, andRobert C. Miller. Automation and customization of rendered webpages. UIST ’05.

[8] Juliuz Chroboczek. Polipo: A cahing web proxy. http://www.pps.univ-paris-diderot.fr/~jch/software/polipo/.

[9] Jasmine. Jasmine introduction-1.3.1.js. http://pivotal.github.io/jasmine/, October 2013.

[10] Jeanphix. ghost.py. http://jeanphix.me/Ghost.py/, December2013.

[11] Vows JS. Vows ¡¡ asynchronous bdd for node. http://vowsjs.org/, December 2013.

[12] Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau.Coscripter: automating & sharing how-to knowledge in the enterprise.CHI ’08.

[13] Ian Li, Jeffrey Nichols, Tessa Lau, Clemens Drews, and Allen Cypher.Here’s what i did: Sharing and reusing web activity with actionshot.CHI ’10.

[14] YUI Library. Test - yui library. http://yuilibrary.com/yui/docs/test/.

[15] Mocha. Mocha - the fun, simple, flexible javascript test framework.http://visionmedia.github.io/mocha/.

[16] PhantomJS. Phantomjs — phantomjs. http://phantomjs.org/,December 2013.

[17] QUinit. Qunit. http://qunitjs.com/.

[18] SquidCache. Squid: Optimising web delivery. http://www.squid-cache.org/, May 2013.

[19] ThoughtWorks. Selenium grid. http://seleniumgrid.thoughtworks.com/, October 2013.

[20] Duane Wessels and k claffy. Icp and the squid web cache. IEEEJOURNAL ON SELECTED AREAS IN COMMUNICATION, 16:345–357, 1998.

[21] Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. Sikuli: usinggui screenshots for search and automation. UIST ’09.

10 2013/12/17

http://www.iopus.com/imacros/

https://github.com/sbarman/webscript

http://seleniumhq.org/

http://www.alexa.com/topsites

http://trafficserver.apache.org/

http://trafficserver.apache.org/

http://www.pps.univ-paris-diderot.fr/~jch/software/polipo/

http://www.pps.univ-paris-diderot.fr/~jch/software/polipo/

http://pivotal.github.io/jasmine/

http://pivotal.github.io/jasmine/

http://jeanphix.me/Ghost.py/

http://vowsjs.org/

http://vowsjs.org/

http://yuilibrary.com/yui/docs/test/

http://yuilibrary.com/yui/docs/test/

http://visionmedia.github.io/mocha/

http://phantomjs.org/

http://qunitjs.com/

http://www.squid-cache.org/

http://www.squid-cache.org/

http://seleniumgrid.thoughtworks.com/

http://seleniumgrid.thoughtworks.com/

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

A Framework for Parallelizing Large-Scale, DOM-Interacting...

Documents