A Statistical Approach for Efficient Crawling of Rich Internet Applications

Software Security Research Group (SSRG), University of Ottawa in collaboration with IBM

Software Security Research Group (SSRG), University of OttawaIn collaboration with IBM

A Statistical Approach for Efficient Crawling of Rich Internet Applications

M. E. Dincturk, S. Choudhary, G.v.Bochmann, G.V. Jourdan, I.V. Onut

University of Ottawa, Canada

Presentation given at the Intern. Conf. on Web Engineering ICWE, Berlin, 2012

1

Software Security Research Group (SSRG), University of Ottawa & IBM

Abstract Modern web technologies, like AJAX result in more responsive and usable web

applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling RIAs. This new strategy is designed based on the concept of “Model-Based Crawling” introduced in [3] and uses statistics accumulated during the crawl to select what to explore next with a high probability of uncovering some new information. The performance of our strategy is compared with our previous strategy, as well as the classical Breadth-First and Depth-First on two real RIAs and two test RIAs. The results show this new strategy is significantly better than the Breadth-First and the Depth-First strategies (which are widely used to crawl RIAs), and outperforms our previous strategy while being much simpler to implement.

2


SSRG Members University of Ottawa Prof. Guy-Vincent Jourdan Prof. Gregor v. Bochmann Suryakant Choudhary (Master student) Emre Dincturk (PhD student) Khaled Ben Hafaiedh (PhD student) Seyed M. Mir Taheri (PhD student) Ali Moosavi (Master student)

In collaboration with Research and Development, IBM® Rational® AppScan® Enterprise Iosif Viorel Onut (PhD)

3


Overview Background

– The evolving Web - Why crawling RIA crawling

– State identification - State equivalence Crawling strategies

– Crawling objectives - Breadth-first and Depth-first - Model-based strategies

Statistical approach– Probabilistic crawling strategy - Experimental results

On-going work and conclusions

4


The evolving Web Traditional Web

– static HTML pages identified by an URL– HTML pages dynamically created by server, identified by URL

with parameters Rich Internet Applications (Web-2)

– pages contain executable code (e.g. JavaScript, Silverlight, Adobe Flex...); executed in response to user interactions, or time-outs (so-called events); script may change displayed page (“state” of the application changes) – with same URL.

– AJAX: script may interact asynchronously with the server to update the page.

5


Why crawlingObjective A: find all (or all “important”) pages

– for content indexing– for security testing– for accessibility testing

Objective B: find all links between pages– for ranking pages, e.g. Google ranking in search queries– for building a graph model of the application

• pages (or application states) are nodes• links (or events) are edges between nodes

6


IBM Rational AppScan Enterprise Edition

Product overview

IBM Security Solutions


IBM Rational AppScan Suite Comprehensive Application Vulnerability Management

9

REQUIREMENTS CODE BUILD PRE-PROD PRODUCTIONQA

AppScan StandardAppScan Source

AppScan TesterSecurity Requirements

Definition AppScan Standard

Security / compliance testing incorporated into testing & remediation

workflows

Security requirements defined before design

& implementation

Outsourced testing for security audits & production site

monitoring

Security & Compliance Testing, oversight, control,

policy, audits

Build security testing into the IDE

Application Security Best Practices – Secure Engineering Framework

Automate Security / Compliance testing in the Build Process

SECURITY

AppScan Build

AppScan Enterprise

AppScan Reporting Console AppScan onDemand


12

View detailed security issues reports

Security Issues Identified with Static Analysis Security Issues Identified with Dynamic Analysis Aggregated and correlated results Remediation Tasks Security Risk Assessment


RIA Crawling Difference from traditional web

– The HTML DOM structure returned by the server in response to a URL may contain scripts.

– When an event triggers the execution of a script, the script may change the DOM structure – which may lead to a new display and a new set of enabled events – that is a new state of the application.

Crawling means:– finding all URLs that are part of the application, plus– for each URL, find all states reached (from this “seed” URL) by

the execution of any sequence of events• Important note: only the “seed” states are directly accessible by a

URL

15


Difficulties for crawling RIA State identification

– A state can not be identified by a URL. – Instead, we consider that the state is identified by the current

DOM in the browser. Most links (events) do not contain a URL

– An event included in the DOM may not explicitly identify the next state reached when this event is executed.

– To determine the state reached by such an event, we have to execute that event.

• In traditional crawling, the event contains the URL (identification) of the next state reached (but not for RIA crawling)

16


Important consequence For a complete crawl (a crawl that ensures that all states of

the application are found), the crawler has to execute all events in all states of the application

– since for any of these events, we do not know, a priory, whether its execution in the current state will lead to a new state or not.

– Note: In the case of traditional web crawling, it is not necessary to execute all events on all pages; it is sufficient to extract the URLs from these events, and get the page for each URL only once.

17


RIA: Need for DOM equivalence A given page often contains information that changes

frequently, e.g. advertizing, time of the day information. This information is usually of no importance for the purpose of crawling.

In the traditional web, the page identification (i.e. the URL) does not change when this information changes.

In RIA, states are identified by their DOM. Therefore similar states with different advertizing would be identified as different states (which leads to a too large state space).

– We would like to have a state identifier that is independent of the unimportant changing information.

– We introduce a DOM equivalence, and all states with equivalent DOMs have the same identifier.

20


DOM equivalence The DOM equivalence depends on the purpose of the

crawl. – In the case of security testing, we are not interested in the textual content

of the DOM, – however, this is important for content indexing.

The DOM equivalence relation is realized by a DOM reduction algorithm which produces (from a given DOM) a reduced canonical representation of the information that is considered relevant for the crawl.

If the reduced DOMs obtained from two given DOMs are the same, then the given DOMs are considered equivalent, that is, they represent the same application state (for this purpose of crawl).

21


Form of the state identifiers The reduced DOM could be used as state

identifier.– however, it is quite voluminous

• we have to store the application model in memory during its exploration, each edge in the graph contains the identifiers of the current and next states.

Condensed state identifier:– A hash of the reduced DOM

• used to check whether a state obtained after the execution of some event is a new state or a known one

– The crawler also stores for each state the list of events included in the DOM, and whether they are executed or not

• used to select the next event to be executed during the crawl

22


Crawling Strategies for RIA Most work on crawling RIA do not intend to build a

complete model of the application Some consider standard strategies, such as

Depth-First and Breadth-First, for building complete models

We have developed more efficient strategies

based on the assumed structure of the application (“model-based strategies”, see below)

23


Disadvantages of standard strategiesBreadth-First:

– No longer sequences of event executions– Very many Resets

Depth-First: – Advantage: has long sequences of event executions– Disadvantage: when reaching a known state, the

strategy takes a path back to a specific previous state for further event exploration. This path through known edges is often long and may involve a Reset (overhead) – going back to another state with non-executed events may be much more efficient.

26


Comparing crawling strategiesObjectives

– Complete crawl: Given enough time, the strategy terminates the crawl when all states of the application have been found.

– Efficiency of finding states - “finding states fast”: If the crawl is terminated by the user before a complete crawl is attained, the number of discovered state should be as large as possible.

• For many applications, a complete crawl cannot be obtained within a reasonable length of time.

• Therefore the second objective is very important.

27


````

28710191281180311233

164

Comparing efficiency of finding statesCost (number of event executions + reset cost)

Number of states discovered

Total: 129This is for a specific application such comparisons should be done for many different types of applications

Note:log scale


Model-based CrawlingIdea:

– Meta-model: assumed structure of the application

– Crawling strategy is optimized for the case that the application follows these assumptions

– Crawling strategy must be able to deal with applications that do not satisfy these assumptions

30


State and transition exploration phasesState exploration phase

– finding all states assuming that the application follows the assumptions of the meta-model

Transition exploration phase– executing all remaining events in all known states (that

have not been executed during the state exploration phase)

Order of execution– Start with state exploration; then transition exploration– If new states are discovered during transition phase, go

back to the state exploration phase, etc.

31


Three meta-modelsHypercube

– The state reached by a sequence of events from the initial state is independent of the order of the events.

– The enabled events at a state are those at the initial state minus those executed to reach that state.

Menu modelProbability model

32

e1

e2

e3

e4

{e1,e2,e3,e4}

{e2,e3,e4} {e1,e3,e4} {e1,e2,e4} {e1,e2,e3}

{e4}

{e3,e4}

{}

Example: 4-dim. Hypercube


Probability strategy Like in the menu model, we use event priorities. The priority of an event is based on statistical

observations (during the crawl of the application) about the number of new states discovered when executing the given event.

The strategy is based on the belief that an event which was often observed to lead to new states in the past will be more likely to lead to new states in the future.

43


Probability strategy: state exploration Probability of a given event e finding a new state

from the current state is P(e) = ( S(e) + pS ) / ( N(e) + pN )

– N = number of executions– S = number of new states found

– Bayesian formula, with pS = 1 and pN = 2 gives initial probability = 0.5

From current state s, find a locally non-executed event e from state s’ such that P(e) is high and the path from s to s’ is short

– Note: the path from s to s’ is through events already executed– Question: How to find e and s’

44


Choosing an event to explore • Def: P(s) = max (P(e)) for all non-executed events at state s

Example:hat – has high probability of discovering a state and – has a small distance (in terms of event executions) from the

current state

Should we explore e1 or e2 next?Note that although P(e1) is higher, the distance to reach e1 (from current state) is also higher

Solid edges are already explored events, dashed ones are yet to be explorede1 and e2 are the unexplored events with max prob. in s1 and s2 , respectively

s

e2

P(e2) = P(s2) = 0.5

P(e1) = P(s1) = 0.8

e1Currently, we are here

s2

s1


Choosing an event to explore (2)• It is not just to compare P(e1) and P(e2) directly since the distances

(the number of event executions required to explore e1 and e2) are different.

• The number of events executed is the same in the following two cases• option 1 : explore e1

• option 2 : explore e2 and if a known state is reached explore one more event

• The probabilities for finding new states of these options are:• option 1 : P = P(e1)• option 2 : P = P(e2) + (1 – P(e2))Pavg

where Pavg is the probability of discovering a new state averaged over all knows states (we do not know which state would be reached).

s e2

P(e2) = 0.5

P(e1) = 0.8

e1current state

s2

s1


Choosing an event to explore (3) • In general, two events e1 and e2 (where e2 requires k more steps to

be reached) are compared by looking at– P(e1) – 1 - (1 - P(e2)) (1 - Pavg )k

• this value is 1 – (probability of not discovering a state by exploring e2 and k more events)

• Using this comparison, we decide on a state schosen where the next event should be explored

• We use an iterative search to find schosen

• Initalize schosen to be the current state• At iteration i

• Let s be the state with max probability at distance i from the current state • If s is more preferable to schosen, update schosen to be s


Choosing an event to explore (4) • When do we stop the iteration ?

• When it is not possible to find a better state than the current schosen

• How do we know that it is not possible to find a better state ?• We know the maximum probability, Pbest, among

all unexplored events. • We can stop at a distance d from schosen if we have

1 – (1- P(schosen))(1 - Pavg) d ≥ Pbest

• That is, if we cannot find a better state in d steps after the last value of schosen, then no other state can be better (since even the best event would not be preferable)


ExperimentsWe did experiments with the different

crawling strategies using the following web sites:

– Periodic table (Local version: http://ssrg.eecs.uottawa.ca/periodic/)

– Clipmarks (Local version: http://ssrg.eecs.uottawa.ca/clipmarks/)

– TestRIA ( http://ssrg.eecs.uottawa.ca/TestRIA/ )

– Altoro Mutual (http://www.altoromutual.com/ )

49


State Discovery Efficiency – Periodic Table

Plots are in logarithmic scale. Cost of reset for this application is 8.

Cost = number of event executions + R * number of resets


State Discovery Efficiency – Clipmarks



State Discovery Efficiency – TestRIA



State Discovery Efficiency – Altoro Mutual



Results: Transition exploration(cost of exploring all transitions)

Cost for a complete crawl– Cost = number of event executions + R * number of resets

• R = 18 for the Clipmarks web site

55


On-going work

Exploring regular page structures with widgets – reducing the exponential blow-up of combinations

Exploring the structure of mobile applications– Applying similar crawling principles to the exploration of the

behavior of mobile applets Concurrent crawling

– For increasing the performance of crawling, consider coordinated crawling by many crawlers running on different computers, e.g. in the cloud

56

Cost = n


ConclusionsRIA crawling is quite different from

traditional web crawlingModel-based strategies can improve the

efficiency of crawlingWe have developed prototypes of these

crawling strategies, integrated with the IBM Apscan product

57


ReferencesBackground; MESBAH, A., DEURSEN, A.V. AND LENSELINK, S., 2011. Crawling Ajax-based Web Applications

through Dynamic Analysis of User Interface State Changes. ACM Transactions on the Web (TWEB), 6(1), a23.

Our Papers: Dincturk, M.E., Choudhary, S., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., A Statistical Approach

for Efficient Crawling of Rich Internet Applications, in Proceedings of the 12th International Conference on Web engineering (ICWE 2012), Berlin, Germany, July 2012. 8 pages - [pdf]. A longer version of the paper is also available (15 pages) [pdf]

Choudhary, S., Dincturk, M.E., Bochmann, G.v., Jourdan, G.-V., Onut, I.V. and Ionescu, P., Solving Some Modeling Challenges when Testing Rich Internet Aplications for Security, in The Third International Workshop on Security Testing (SECTEST 2012), Montreal, Canada, April 2012. 8 pages - [pdf].

Benjamin, K., Bochmann, G.v., Dincturk, M.E., Jourdan, G.-V. and Onut, I.V., A Strategy for Efficient Crawling of Rich Internet Applications, in Proceedings of the 11th International Conference on Web engineering (ICWE 2011), Paphos, Cyprus, July 2011. 15 pages - [pdf].

Benjamin, K., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., Some Modeling Challenges when Testing Rich Internet Applications for Security, in First International Workshop on Modeling and Detection of Vulnerabilities (MDV 2010), Paris, France, April 2010. 8 pages - [pdf].

Dincturk, M.E., Jourdan, G.-V. , Bochmann, G.v. and Onut, I.V., A Model-Based Approach for Crawling Rich Internet Applications, submitted to a journal.

58

http://ssrg.site.uottawa.ca/docs/ICWE2012.pdf

http://ssrg.site.uottawa.ca/docs/ICWE2012_long.pdf

http://ssrg.site.uottawa.ca/docs/SECTEST2012.pdf

http://ssrg.site.uottawa.ca/docs/ICWE2011.pdf

http://ssrg.site.uottawa.ca/docs/MDV2010.pdf


Questions ??

Comments ??

These slides can be downloaded fromhttp://????/RIAcrawlingProb.pptx *** ssrg server was down

Date post:	24-Feb-2016
Category:	Documents
Upload:	rivka
View:	40 times
Download:	0 times

A Statistical Approach for Efficient Crawling of Rich Internet Applications

Documents