9/8/2006 http://www.cs.umd.edu/projects/skoll
1
Exploring Tools and Techniques for Distributed Continuous Quality Assurance
Adam [email protected]
University of Maryland
9/8/2006 http://www.cs.umd.edu/projects/skoll
2
Quality Assurance for Large-Scale Systems
Modern systems increasingly complex Run on numerous platform, compiler & library
combinations Have 10’s, 100’s, even 1000’s of configuration options Are evolved incrementally by geographically-
distributed teams Run atop of other frequently changing systems Have multi-faceted quality objectives
How do you QA systems like this?
9/8/2006 http://www.cs.umd.edu/projects/skoll
3
A Typical QA Process
Developers make changes & perform some QA locally Continuous build, integration & test of CVS HEAD
Sometimes performed by multiple (uncoordinated) machines
Lots of QA being done, but systems still break, too many bugs escape to field & there are too many late integration surprises
Some contributors to QA process failures QA only for the most readily-available platforms & default
configurations Knowledge, artifacts, effort & results not shared Often limited to compilation & simple functional testing Static QA process that focus solely on CVS HEAD
9/8/2006 http://www.cs.umd.edu/projects/skoll
4
General Assessment
Works reasonably well for a few QA processes on a very small part of the full system space
For everything else it has deficiencies & leads to serious blind spots as to the system’s actual behavior Large portions of system space go unexplored Can’t fully predict effect of changes Hard to interpret from-the-field failure reports &
feedback
9/8/2006 http://www.cs.umd.edu/projects/skoll
5
Distributed Continuous Quality Assurance
DCQA: QA processes conducted around-the-world, around-the-clock on powerful, virtual computing grids Grids can by made up of end-user machines,
company-wide resources or dedicated computing clusters
General Approach Divide QA processes into numerous tasks Intelligently distribute tasks to clients who then
execute them Merge and analyze incremental results to efficiently
complete desired QA process
9/8/2006 http://www.cs.umd.edu/projects/skoll
6
DCQA (cont.)
Expected benefits Massive Parallelization Greater access to resources/environs. not readily
found in-house Coordinate QA efforts to enable more sophisticated
analyses Recurring themes:
Leverage developer & application community resources
Maximize task & resource diversity Opportunistically divide & conquer large QA analyses Steer process to improve efficiency & effectiveness Gather information proactively
9/8/2006 http://www.cs.umd.edu/projects/skoll
7
Collaborators
Sandro Fouché, Atif Memon, Alan Sussman, Cemal Yilmaz (now at IBM TJ Watson) & Il-Chul Yoon
Alex OrsoGroupGroup Myra Cohen
Murali Haran, Alan Karr, Mike Last, & Ashish Sanil
Doug Schmidt & Andy Gohkale
9/8/2006 http://www.cs.umd.edu/projects/skoll
8
Overview
The Skoll DCQA infrastructure & approach
Novel DCQA processes Future work
9/8/2006 http://www.cs.umd.edu/projects/skoll
9
Server processes results & updates internal databasesServer selects best task that matches client characteristicsClient executes task & returns resultsClient registers & receives client kitWhen client becomes available it requests a QA task
return resultstask request
QA task
register
client kit
Skoll DCQA Infrastructure & Approach
Clients
Server(s)
See: A. Memon, A. Porter, C. Yilmaz, A. Nagarajan, D. C. Schmidt, and B. Natarajan, Skoll: Distributed Continuous Quality Assurance, ICSE’04
9/8/2006 http://www.cs.umd.edu/projects/skoll
10
Implementing DCQA Processes in Skoll
Define how QA analysis will be subdivided by defining generic QA tasks & the QA space in which they operate QA tasks: Templates parameterized by
“configuration” Variable information needed to run concrete QA tasks
QA space: the set of all “configurations” in which QA tasks run
Define how QA tasks will be scheduled & how results will be processed by defining navigation strategies Navigation strategies “visit” points in the QA space,
applying QA tasks & processing incremental results
9/8/2006 http://www.cs.umd.edu/projects/skoll
11
The ACE+TAO+CIAO (ATC) System
ATC characteristics 2M+ line open-source CORBA implementation maintained by 40+, geographically-distributed
developers 20,000+ users worldwide Product Line Architecture with 500+ configuration
options runs on dozens of OS and compiler combinations
Continuously evolving – 200+ CVS commits per week Quality concerns include correctness, QoS, footprint,
compilation time & more Beginning to use Skoll for continuous build,
integration & test of ATC
9/8/2006 http://www.cs.umd.edu/projects/skoll
12
Options Type Settings
Operating System compile-time {Linux, Windows XP, ….}
TAO_HAS_MINIMUM_CORBA compile-time {True, False}
ORBCollocation runtime {global, per-orb, no}
ORBConnectionPurgingStrategy
runtime {lru, lfu, fifo, null}
ACE_version component version
{v5.4.3, v5.4.4,…}
TAO_version component version
{v1.4.3, v1.4.4,…}
run(ORT/run_test.pl) test case {True, False}
Constraints
(TAO_HAS_AMI) (¬TAO_HAS_MINIMUM_CORBA)
run(ORT/run_test.pl) (¬TAO_HAS_MINIMUM_CORBA)
Define QA Space
9/8/2006 http://www.cs.umd.edu/projects/skoll
13
Define Generic QA Tasks
Implemented as a workflow of “services” provided by client kit Some built-in: log, cvs_download, upload_results Most interesting ones application-specific: e.g., build,
run_tests Ex: The run_testsuite QA task invokes services
that: Enable logging Define component to be built Set configuration options & environment variables Download source from a CVS repo Configure & build component Compile tests Execute tests Upload results to server
9/8/2006 http://www.cs.umd.edu/projects/skoll
14
Navigation Strategy: Nearest Neighbor Search
9/8/2006 http://www.cs.umd.edu/projects/skoll
15
Navigation Strategy: Nearest Neighbor Search
9/8/2006 http://www.cs.umd.edu/projects/skoll
16
Navigation Strategy: Nearest Neighbor Search
9/8/2006 http://www.cs.umd.edu/projects/skoll
17
Navigation Strategy: Nearest Neighbor Search
9/8/2006 http://www.cs.umd.edu/projects/skoll
18
Extension: Adding Temporary Constraints
Χ
9/8/2006 http://www.cs.umd.edu/projects/skoll
19
Overview
The Skoll DCQA infrastructure & framework Novel DCQA processes & feasibility
studies Configuration-level fault characterization Performance-oriented regression testing Code-level fault modeling
Future work
9/8/2006 http://www.cs.umd.edu/projects/skoll
20
Configuration-Level Fault Characterization
Goal Help developers localize configuration-related
faults
Solution Approach Strategically sample the QA space to test for
subspaces in which (1) compilation fails or (2) regression tests fail
Build models that characterize the configuration options and specific settings that define the failing subspace
See: C. Yilmaz, M. Cohen, A. Porter, Covering Arrays for Efficient Fault Characterization in Complex Configuration Spaces, ISSTA’04, TSE v32 (1)
9/8/2006 http://www.cs.umd.edu/projects/skoll
21
Configurations
C1 C2 C3 C4 C5 C6 C7 C8 C9
O1 0 0 0 1 1 1 2 2 2
O2 0 1 2 0 1 2 0 1 2
O3 0 1 2 1 2 0 2 0 1
Sampling Strategy
Goals: Maximize “coverage” of option setting combinations, while minimizing the number of configurations tested
Approach: compute test schedule from T-way covering arrays A set of configurations in which all ordered t-tuples of option settings
appear at least once. 2-way covering array example
9/8/2006 http://www.cs.umd.edu/projects/skoll
22
Fault Characterization
We used classification trees (Brieman et al) to model option & setting patterns that predict test failures
OK ERR-2
0
|CORBA_MESSAGING
AMI_POLLER
AMI_CALLBACK
AMI
OK ERR-1 ERR-3
0
1 0
1
1 0
1
9/8/2006 http://www.cs.umd.edu/projects/skoll
23
Feasibility Study
QA Scenario: Do regression tests pass across entire QA space?
37584 valid configurations 10 compile-time options with 12 constraints 96 test options with 120 test constraints 6 run-time options with no constraints & 2 operating
systems Navigation strategies:
Nearest neighbor search t-way covering arrays for each 2 ≤ t ≤ 6
Applied to one release of ATC using clients running on 120 CPUs
Exhaustive process takes 18,800 cpu hours (excl. compilation)
t 2 3 4 5 6size 116 348 1236 3372 9453
% reduction 99.4 98.2
93.4 82.0 49.7
9/8/2006 http://www.cs.umd.edu/projects/skoll
24
Some Sample Results
Found 89 (40 Linux, 49 Windows), possibly option-related failures
Models based on covering arrays (even for t=2) were nearly as good as those based on exhaustive testing
Several tests failed that don’t fail with default option settings Root cause: problems in feature-specific code
Some failed on every single configuration (even though they don’t fail when using default options!)
Root cause: problems in option processing code 3 tests failed when ORBCollocation = NO
Background: when ORBCollocation = GLOBAL, PER-ORB: local objects communicate directly NO: objects always communicate over network
Root cause: data marshalling/unmarshalling broken
9/8/2006 http://www.cs.umd.edu/projects/skoll
25
Summary
Basic infrastructure in place and working Initial results were encouraging
Defined large control spaces and performed complex QA processes
Found many test failures corresponding to real bugs, some of which had not been found before
Fault characterization sped up debugging time Documenting & centralizing inter-option constraints
beneficial Ongoing extensions
Expanding QA space for ATC continuous build, test & integration
Extending model for platform virtualization
9/8/2006 http://www.cs.umd.edu/projects/skoll
26
Performance-Oriented Regression Testing
Goal Quickly determine whether a software update
degrades performanceSolution Approach
Proactively find a small set of configurations on which to estimate performance across entire configuration space
Later, whenever software changes, benchmark this set and compare post-update & pre-update performance
See: C. Yilmaz, A. Krishna, A. Memon, A. Porter, D. C. Schmidt, A. Gokhale, and B. Natarajan, Main Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems, ICSE’05
9/8/2006 http://www.cs.umd.edu/projects/skoll
27
Reliable Effects Screening (RES) Process
Proactive Compute a formal experimental plan (we use
screening designs) for identifying “important” options
Execute the experimental plan across QA grid Analyze the data to identify important options Recalibrate frequently as system changes
Reactive When the software changes estimate the distribution
of performance across entire configuration space by evaluating all combinations of important options,
(called the screening suite), while randomizing the rest Distributional changes signal possible performance
degradations
9/8/2006 http://www.cs.umd.edu/projects/skoll
28
Feasibility Study
Several recent changes to message queuing strategy.
Has this degraded performance? Developers indentified14 potentially important
binary options (214=16,384 configs.)
Option Name Option Settings
ORBReactorThreadQueue {FIFO, LIFO}
ORBClientConnectionHandler
{RW, MT}
ORBConnectionPurgingStrategy
{LRU, LFU}
ORBConcurrency {reactive, thread-per-connection}
….. …..
9/8/2006 http://www.cs.umd.edu/projects/skoll
29
Identify Important Options
Options B and J: clearly important, options I, C, and F: arguably important
Screening designs mimics exhaustive testing at a fraction of the cost
Half-Normal Probability Plot for Latency (Full Data Set)
0.0 0.5 1.0 1.5
0.0 0.5 1.0 1.5
Half-Normal Probability Plot for Latency
(128-run screening design)
opti
on
eff
ect
s
9/8/2006 http://www.cs.umd.edu/projects/skoll
30
Estimate Performance with Screening Suite
full data set
scre
enin
g su
ite
Q-Q Plot for Latency
Estimates closely track actuals
9/8/2006 http://www.cs.umd.edu/projects/skoll
31
Applying RES to Evolving System
Latency Distribution Over Time
Estimated performance once a day on CVS snapshots of ATC
Observations Developers noticed the
deviation on 12/14/03, but not before
Developers estimated a drop of 5% on 12/14/03.
Our process more accurately shows around 50% drop
9/8/2006 http://www.cs.umd.edu/projects/skoll
32
Summary
Reliable Effects Screening is an efficient, effective, and repeatable approach for estimating the performance impact of software updates
Was effective in detecting actual performance degradations
Cut benchmarking work 1000x (from 2 days to 5 mins.) Fast enough to be part of check-in process
Ongoing extensions New experimental designs: Pooled ANOVA Applications to component-based systems (multilevel
interactions)
9/8/2006 http://www.cs.umd.edu/projects/skoll
33
Code-Level Fault Modeling
Goal Automatically gain code-level insight into
why/when specific faults occur in fielded program runs
Solution Approach Lightly instrument, deploy & execute program
instances Capture outcome-labeled exec. profiles (e.g.,
pass/fail) Build models that explain outcome in terms of
profile dataSee: Murali Haran, Alan Karr, Michael Last, Alessandro Orso, Adam Porter, Ashish Sanil & Sandro Fouche, Techniques for Classifying Executions of Deployed Software to Support Software Engineering Tasks. UMd Technical Report #cs-tr-4826
9/8/2006 http://www.cs.umd.edu/projects/skoll
34
Lightweight Instr. via Adaptive Sampling
Want to minimize overhead (e.g., runtime slowdown, code bloat, data volume & analysis costs)
Use incremental data collection approach Individual instances sparsely sample the
measurement space Sampling weights initially uniform Weights incrementally adjusted to favor high utility
measurements Over time, high utility measurements are sampled
frequently while low ones aren’t Typically requires increase in instances
observed
9/8/2006 http://www.cs.umd.edu/projects/skoll
35
Association Trees
Resulting data set is very sparse (i.e., 90-99% missing) Problematic for many learning algorithms
Developed new learning algorithm called association trees For each measurement m split data range into 2 parts
e.g., low (m < k) or high (m ≥ k), where k is best split point
Convert data to items that are present or missing Use apriori algorithm to find item combinations whose
presence is highly correlated with each outcome Uses of association trees
Analyze rules for clues to fault location Use models on new instances, e.g., determine whether an
in-house test case likely reproduces an in-the-field failure
9/8/2006 http://www.cs.umd.edu/projects/skoll
36
Feasibility Study
Subject program: Java Byte Code Analyzer (JABA) 60k statements, 400 classes, 3000 methods 707 test cases 14 versions Measured method entry and block execution counts Ran clients on 120 CPUs
Standard classification techniques using exhaustive measurement yielded near perfect models (with some outliers)
9/8/2006 http://www.cs.umd.edu/projects/skoll
37
Some Sample Results
Competitive with standard methods using exhaustive data Measured 5-8% of data/instance; ~4x as many instances
total
Coverage Misclass False Pos False Neg
02
04
06
08
01
00
Method Counts Data
Coverage Misclass False Pos False Neg
Block Counts Data
9/8/2006 http://www.cs.umd.edu/projects/skoll
38
Summary
Preliminary results: created highly effective models at greatly reduced cost
Scales well to finer-grained data & larger systems
Ongoing extensions Tradeoffs in # of measurements vs. instances
observed New utility functions
Cost-benefit models Fixed-cost sampling
New applications Lightweight test oracles Failure report clustering
9/8/2006 http://www.cs.umd.edu/projects/skoll
39
Overview
The Skoll DCQA infrastructure & framework Novel DCQA processes Future work
9/8/2006 http://www.cs.umd.edu/projects/skoll
40
Future Work
Looking for more example systems (help!) New problem classes
Performance and robustness optimization Configuration advice
Integrate with model-based testing systems GUITAR --
http://www.cs.umd.edu/~atif/newsite/guitar.htm Distributed QA tasks Improved statistical treatment of noisy data
Modeling unobservables and/or detecting outliers