Exploring Tools and Techniques for Distributed Continuous Quality Assurance

9/8/2006 http://www.cs.umd.edu/projects/skoll

1

Exploring Tools and Techniques for Distributed Continuous Quality Assurance

Adam [email protected]

University of Maryland


2

Quality Assurance for Large-Scale Systems

Modern systems increasingly complex Run on numerous platform, compiler & library

combinations Have 10’s, 100’s, even 1000’s of configuration options Are evolved incrementally by geographically-

distributed teams Run atop of other frequently changing systems Have multi-faceted quality objectives

How do you QA systems like this?


3

A Typical QA Process

Developers make changes & perform some QA locally Continuous build, integration & test of CVS HEAD

Sometimes performed by multiple (uncoordinated) machines

Lots of QA being done, but systems still break, too many bugs escape to field & there are too many late integration surprises

Some contributors to QA process failures QA only for the most readily-available platforms & default

configurations Knowledge, artifacts, effort & results not shared Often limited to compilation & simple functional testing Static QA process that focus solely on CVS HEAD


4

General Assessment

Works reasonably well for a few QA processes on a very small part of the full system space

For everything else it has deficiencies & leads to serious blind spots as to the system’s actual behavior Large portions of system space go unexplored Can’t fully predict effect of changes Hard to interpret from-the-field failure reports &

feedback


5

Distributed Continuous Quality Assurance

DCQA: QA processes conducted around-the-world, around-the-clock on powerful, virtual computing grids Grids can by made up of end-user machines,

company-wide resources or dedicated computing clusters

General Approach Divide QA processes into numerous tasks Intelligently distribute tasks to clients who then

execute them Merge and analyze incremental results to efficiently

complete desired QA process


6

DCQA (cont.)

Expected benefits Massive Parallelization Greater access to resources/environs. not readily

found in-house Coordinate QA efforts to enable more sophisticated

analyses Recurring themes:

Leverage developer & application community resources

Maximize task & resource diversity Opportunistically divide & conquer large QA analyses Steer process to improve efficiency & effectiveness Gather information proactively


7

Collaborators

Sandro Fouché, Atif Memon, Alan Sussman, Cemal Yilmaz (now at IBM TJ Watson) & Il-Chul Yoon

Alex OrsoGroupGroup Myra Cohen

Murali Haran, Alan Karr, Mike Last, & Ashish Sanil

Doug Schmidt & Andy Gohkale


8

Overview

The Skoll DCQA infrastructure & approach

Novel DCQA processes Future work


9

Server processes results & updates internal databasesServer selects best task that matches client characteristicsClient executes task & returns resultsClient registers & receives client kitWhen client becomes available it requests a QA task

return resultstask request

QA task

register

client kit

Skoll DCQA Infrastructure & Approach

Clients

Server(s)

See: A. Memon, A. Porter, C. Yilmaz, A. Nagarajan, D. C. Schmidt, and B. Natarajan, Skoll: Distributed Continuous Quality Assurance, ICSE’04


10

Implementing DCQA Processes in Skoll

Define how QA analysis will be subdivided by defining generic QA tasks & the QA space in which they operate QA tasks: Templates parameterized by

“configuration” Variable information needed to run concrete QA tasks

QA space: the set of all “configurations” in which QA tasks run

Define how QA tasks will be scheduled & how results will be processed by defining navigation strategies Navigation strategies “visit” points in the QA space,

applying QA tasks & processing incremental results


11

The ACE+TAO+CIAO (ATC) System

ATC characteristics 2M+ line open-source CORBA implementation maintained by 40+, geographically-distributed

developers 20,000+ users worldwide Product Line Architecture with 500+ configuration

options runs on dozens of OS and compiler combinations

Continuously evolving – 200+ CVS commits per week Quality concerns include correctness, QoS, footprint,

compilation time & more Beginning to use Skoll for continuous build,

integration & test of ATC


12

Options Type Settings

Operating System compile-time {Linux, Windows XP, ….}

TAO_HAS_MINIMUM_CORBA compile-time {True, False}

ORBCollocation runtime {global, per-orb, no}

ORBConnectionPurgingStrategy

runtime {lru, lfu, fifo, null}

ACE_version component version

{v5.4.3, v5.4.4,…}

TAO_version component version

{v1.4.3, v1.4.4,…}

run(ORT/run_test.pl) test case {True, False}

Constraints

(TAO_HAS_AMI) (¬TAO_HAS_MINIMUM_CORBA)

run(ORT/run_test.pl) (¬TAO_HAS_MINIMUM_CORBA)

Define QA Space


13

Define Generic QA Tasks

Implemented as a workflow of “services” provided by client kit Some built-in: log, cvs_download, upload_results Most interesting ones application-specific: e.g., build,

run_tests Ex: The run_testsuite QA task invokes services

that: Enable logging Define component to be built Set configuration options & environment variables Download source from a CVS repo Configure & build component Compile tests Execute tests Upload results to server


14

Navigation Strategy: Nearest Neighbor Search


15



16



17



18

Extension: Adding Temporary Constraints

Χ


19

Overview

The Skoll DCQA infrastructure & framework Novel DCQA processes & feasibility

studies Configuration-level fault characterization Performance-oriented regression testing Code-level fault modeling

Future work


20

Configuration-Level Fault Characterization

Goal Help developers localize configuration-related

faults

Solution Approach Strategically sample the QA space to test for

subspaces in which (1) compilation fails or (2) regression tests fail

Build models that characterize the configuration options and specific settings that define the failing subspace

See: C. Yilmaz, M. Cohen, A. Porter, Covering Arrays for Efficient Fault Characterization in Complex Configuration Spaces, ISSTA’04, TSE v32 (1)


21

Configurations

C1 C2 C3 C4 C5 C6 C7 C8 C9

O1 0 0 0 1 1 1 2 2 2

O2 0 1 2 0 1 2 0 1 2

O3 0 1 2 1 2 0 2 0 1

Sampling Strategy

Goals: Maximize “coverage” of option setting combinations, while minimizing the number of configurations tested

Approach: compute test schedule from T-way covering arrays A set of configurations in which all ordered t-tuples of option settings

appear at least once. 2-way covering array example


22

Fault Characterization

We used classification trees (Brieman et al) to model option & setting patterns that predict test failures

OK ERR-2

0

|CORBA_MESSAGING

AMI_POLLER

AMI_CALLBACK

AMI

OK ERR-1 ERR-3

0

1 0

1

1 0

1


23

Feasibility Study

QA Scenario: Do regression tests pass across entire QA space?

37584 valid configurations 10 compile-time options with 12 constraints 96 test options with 120 test constraints 6 run-time options with no constraints & 2 operating

systems Navigation strategies:

Nearest neighbor search t-way covering arrays for each 2 ≤ t ≤ 6

Applied to one release of ATC using clients running on 120 CPUs

Exhaustive process takes 18,800 cpu hours (excl. compilation)

t 2 3 4 5 6size 116 348 1236 3372 9453

% reduction 99.4 98.2

93.4 82.0 49.7


24

Some Sample Results

Found 89 (40 Linux, 49 Windows), possibly option-related failures

Models based on covering arrays (even for t=2) were nearly as good as those based on exhaustive testing

Several tests failed that don’t fail with default option settings Root cause: problems in feature-specific code

Some failed on every single configuration (even though they don’t fail when using default options!)

Root cause: problems in option processing code 3 tests failed when ORBCollocation = NO

Background: when ORBCollocation = GLOBAL, PER-ORB: local objects communicate directly NO: objects always communicate over network

Root cause: data marshalling/unmarshalling broken


25

Summary

Basic infrastructure in place and working Initial results were encouraging

Defined large control spaces and performed complex QA processes

Found many test failures corresponding to real bugs, some of which had not been found before

Fault characterization sped up debugging time Documenting & centralizing inter-option constraints

beneficial Ongoing extensions

Expanding QA space for ATC continuous build, test & integration

Extending model for platform virtualization


26

Performance-Oriented Regression Testing

Goal Quickly determine whether a software update

degrades performanceSolution Approach

Proactively find a small set of configurations on which to estimate performance across entire configuration space

Later, whenever software changes, benchmark this set and compare post-update & pre-update performance

See: C. Yilmaz, A. Krishna, A. Memon, A. Porter, D. C. Schmidt, A. Gokhale, and B. Natarajan, Main Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems, ICSE’05


27

Reliable Effects Screening (RES) Process

Proactive Compute a formal experimental plan (we use

screening designs) for identifying “important” options

Execute the experimental plan across QA grid Analyze the data to identify important options Recalibrate frequently as system changes

Reactive When the software changes estimate the distribution

of performance across entire configuration space by evaluating all combinations of important options,

(called the screening suite), while randomizing the rest Distributional changes signal possible performance

degradations


28

Feasibility Study

Several recent changes to message queuing strategy.

Has this degraded performance? Developers indentified14 potentially important

binary options (214=16,384 configs.)

Option Name Option Settings

ORBReactorThreadQueue {FIFO, LIFO}

ORBClientConnectionHandler

{RW, MT}

ORBConnectionPurgingStrategy

{LRU, LFU}

ORBConcurrency {reactive, thread-per-connection}

….. …..


29

Identify Important Options

Options B and J: clearly important, options I, C, and F: arguably important

Screening designs mimics exhaustive testing at a fraction of the cost

Half-Normal Probability Plot for Latency (Full Data Set)

0.0 0.5 1.0 1.5

0.0 0.5 1.0 1.5

Half-Normal Probability Plot for Latency

(128-run screening design)

opti

on

eff

ect

s


30

Estimate Performance with Screening Suite

full data set

scre

enin

g su

ite

Q-Q Plot for Latency

Estimates closely track actuals


31

Applying RES to Evolving System

Latency Distribution Over Time

Estimated performance once a day on CVS snapshots of ATC

Observations Developers noticed the

deviation on 12/14/03, but not before

Developers estimated a drop of 5% on 12/14/03.

Our process more accurately shows around 50% drop


32

Summary

Reliable Effects Screening is an efficient, effective, and repeatable approach for estimating the performance impact of software updates

Was effective in detecting actual performance degradations

Cut benchmarking work 1000x (from 2 days to 5 mins.) Fast enough to be part of check-in process

Ongoing extensions New experimental designs: Pooled ANOVA Applications to component-based systems (multilevel

interactions)


33

Code-Level Fault Modeling

Goal Automatically gain code-level insight into

why/when specific faults occur in fielded program runs

Solution Approach Lightly instrument, deploy & execute program

instances Capture outcome-labeled exec. profiles (e.g.,

pass/fail) Build models that explain outcome in terms of

profile dataSee: Murali Haran, Alan Karr, Michael Last, Alessandro Orso, Adam Porter, Ashish Sanil & Sandro Fouche, Techniques for Classifying Executions of Deployed Software to Support Software Engineering Tasks. UMd Technical Report #cs-tr-4826


34

Lightweight Instr. via Adaptive Sampling

Want to minimize overhead (e.g., runtime slowdown, code bloat, data volume & analysis costs)

Use incremental data collection approach Individual instances sparsely sample the

measurement space Sampling weights initially uniform Weights incrementally adjusted to favor high utility

measurements Over time, high utility measurements are sampled

frequently while low ones aren’t Typically requires increase in instances

observed


35

Association Trees

Resulting data set is very sparse (i.e., 90-99% missing) Problematic for many learning algorithms

Developed new learning algorithm called association trees For each measurement m split data range into 2 parts

e.g., low (m < k) or high (m ≥ k), where k is best split point

Convert data to items that are present or missing Use apriori algorithm to find item combinations whose

presence is highly correlated with each outcome Uses of association trees

Analyze rules for clues to fault location Use models on new instances, e.g., determine whether an

in-house test case likely reproduces an in-the-field failure


36

Feasibility Study

Subject program: Java Byte Code Analyzer (JABA) 60k statements, 400 classes, 3000 methods 707 test cases 14 versions Measured method entry and block execution counts Ran clients on 120 CPUs

Standard classification techniques using exhaustive measurement yielded near perfect models (with some outliers)


37

Some Sample Results

Competitive with standard methods using exhaustive data Measured 5-8% of data/instance; ~4x as many instances

total

Coverage Misclass False Pos False Neg

02

04

06

08

01

00

Method Counts Data

Coverage Misclass False Pos False Neg

Block Counts Data


38

Summary

Preliminary results: created highly effective models at greatly reduced cost

Scales well to finer-grained data & larger systems

Ongoing extensions Tradeoffs in # of measurements vs. instances

observed New utility functions

Cost-benefit models Fixed-cost sampling

New applications Lightweight test oracles Failure report clustering


39

Overview

The Skoll DCQA infrastructure & framework Novel DCQA processes Future work


40

Future Work

Looking for more example systems (help!) New problem classes

Performance and robustness optimization Configuration advice

Integrate with model-based testing systems GUITAR --

http://www.cs.umd.edu/~atif/newsite/guitar.htm Distributed QA tasks Improved statistical treatment of noisy data

Modeling unobservables and/or detecting outliers

Date post:	12-Jan-2016
Category:	Documents
Upload:	zwi
View:	45 times
Download:	0 times

Exploring Tools and Techniques for Distributed Continuous Quality Assurance

Documents