Evolution in Open Source Software: A Case Study
Michael W. Godfrey Qiang Tu
[paper in ICSM 2000]
Software Architecture Group University of Waterloo
What is software evolution?
“Evolution is what happens while you’re busy
making other plans.”
Usually, we consider evolution to begin once the first version has been delivered:
Maintenance is the planned set of tasks to effect changes.
Evolution is what actually happens to the software.
Previous research Lehman’s laws Parnas on software geriatrics Eick et al. on code decay (10 MLOC
telecom) Gall et al. (10 MLOC telecom)
Munro, Burd et al. (2 MLOC gcc)
Lehman’s Laws in a nutshell Observations:
(Most) useful software must evolve or die. As a software system gets bigger, its resulting
complexity tends to limit its ability to grow. Development progress/effort is (more or less)
constant; growth is at best constant. Advice:
Need to manage complexity. Do periodic redesigns. Treat software and its development process as a
feedback system (and not as a passive theorem).
Lehman’s examples
A case study in evolution:The Linux OS kernel
A case study in evolution:The Linux OS kernel It’s Linux!
Large system, very stable, many releases over several years, many developers
Growing mainstream adoption Open source development model
Interesting phenomenon in itself Easy to track, can publish results, many
experts Not much previous study
Methodology Examined 96 versions of Linux kernel
34 of the 67 stable releases 62 of the 369 development releases
All measures considered only .c/.h files contained in the tarball
Counted LOC using “wc –l” and an awk script that ignored comments and blank lines
Counted # of fcns/vars/macros using ctags Architectural model (SSs hierarchy) based on default
directory structure We plotted growth against calendar time
Lehman suggests plotting growth against release number
Growth of # of source files
0
1000
2000
3000
4000
5000
6000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
# o
f so
urc
e co
de
file
s (*
.[ch
] )
Development releases (1.1, 1.3, 2.1, 2.3)
Stable releases (1.0, 1.2, 2.0, 2.2)
Growth of # of global fcns, variables, and macros
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
# o
f g
lob
al f
cns,
var
iab
les,
an
d m
acro
s Development releases (1.1, 1.3, 2.1, 2.3)
Stable releases (1.0, 1.2, 2.0, 2.2)
Growth of Lines of Code (LOC)
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
To
tal
LO
C
Total LOC ("wc -l") -- development releases
Total LOC ("wc -l") -- stable releases
Total LOC uncommented -- development releases
Total LOC uncommented -- stable releases
Average/median .c file size
0
100
200
300
400
500
600
700
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
Un
com
men
ted
LO
C
Average .c file size -- dev. releasesAverage .c file size -- stable releasesMedian .c file size -- dev. releasesMedian .c file size -- stable releases
Average/median .h file size
0
20
40
60
80
100
120
140
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
Un
co
mm
ente
d L
OC
Average .h file size -- dev. releasesAverage .h file size -- stable releasesMedian .h file size -- dev. releasesMedian .h file size -- stable releases
Growth of major SSs (dev. releases)
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
To
tal
un
com
men
ted
LO
C
drivers
arch
include
net
fs
kernel
mm
ipc
lib
init
SS LOC as percentage of total system
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
Per
cen
tag
e o
f to
tal
syst
em u
nco
mm
ente
d L
OC
driversarchincludenetfskernelmmipclibinit
SS LOC as percentage of total system (ignoring drivers)
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
Per
cen
tag
e o
f to
tal
syst
em u
nco
mm
ente
d L
OC
archincludenetfskernelmmipclibinit
Growth of small core SSs
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
To
tal
un
com
men
ted
LO
C
kernel
mm
ipc
lib
init
Growth of arch SSs
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
To
tal
un
com
men
ted
LO
C
arch/ppc/
arch/sparc/
arch/sparc64/
arch/m68k/
arch/mips/
arch/i386/
arch/alpha/
arch/arm/
arch/sh/
arch/s390/
Growth of drivers SSs
0
50,000
100,000
150,000
200,000
250,000
300,000
Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
To
tal
un
com
men
ted
LO
C
drivers/netdrivers/scsidrivers/chardrivers/videodrivers/isdndrivers/sounddrivers/acorndrivers/blockdrivers/cdromdrivers/usbdrivers/"others"
Observations and hypotheses
Growth along devel. path is super-linear
y = .21*x^2 + 252*x + 90,055 r2=.997y = size in LOC x = days since v1.0 r2 is “coefficient of determination” using least squares
[Lehman/Turski’s model: y’ = y + E/y^2 (3Ex)^(1/3)]
Linux’s strong growth is continuing. This is stronger growth at MLOC level than
observed by others (Lehman, Gall), even for other OSs.
Why has Linux been able to continue its geometric growth?
Core code quality is carefully maintained Architecture/problem domain
It’s largely drivers Much of the code is “parallel” It’s not as big as you might think
Vanilla configuration used only 15% of files
Development model (OSD) and its sociology Popularity and visibility has encouraged outsiders
(both hackers and industry) to contribute
Growth of pine (email client)
0
50
100
150
200
250
300
350
Jan-93 Jun-94 Oct-95 Mar-97 Jul-98 Dec-99 Apr-01
# o
f M
od
ule
s
Growth of gcc/g++/egcs
0
100
200
300
400
500
600
700
800
900
1000
Aug-87 Dec-88 May-90 Sep-91 Jan-93 Jun-94 Oct-95 Mar-97 Jul-98 Dec-99 Apr-01
# o
f m
od
ule
s g++
gcc
egcs
Growth of vim (text editor)
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
May 1990 Sep 1991 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
To
tal
LO
C
Total LOC ("wc -l")
Total LOC (ignoring comments and blank lines)
vim avg % comments and blank lines per file
25.0
26.0
27.0
28.0
29.0
30.0
31.0
May 1990 Sep 1991 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
Ave
rag
e p
erce
nt
com
men
ts +
bla
nk
lin
es
vim avg/median file size
0
100
200
300
400
500
600
700
800
900
1000
May 1990 Sep 1991 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001
Un
com
men
ted
LO
C
Average uncommented LOC per source fileMedian uncommented LOC per source file
vim’s architecture
HypothesesFactors affecting evolution include
Size and age of system Use of traditional sw. eng. principles during
development
PLUS Problem domain
Problem complexity, multi-platform, multi-features Software architecture Process model Sociology, market forces, and acts-of-God
Software evolution research: What next?So far, we have examined only growth. More case studies needed
Qualitative and quantitative Industrial and open source systems Different problem domains, architectures
Supporting tools to aid analysing, visualizing, and querying program evolution
More than just RCS and perl Support for architecture repair
Codified knowledge: Why and how does software change?
Build catalogue of change patterns and evolutionary narratives
Codified knowledge Mature engineering disciplines codify knowledge
and experience. Arguably, this is lacking in software engineering.
Software architecture styles [Shaw] Design patterns [GoF]
Codified knowledge of how and why programs evolve:
Evolutionary narratives [Godfrey] Long term, coarse granularity
Change patterns Short term, fine granularity
Change patterns and evolutionary narratives
Cathedral style [Raymond] careful control and management debugging done before committing code evolution is slow, planned, rarely undone
Bazaar style (OSD) lots of low-level changes, frequent fixes lots of “building around” rather than wholesale
changing, occasional redesigns creeping feature-itis, “complete” dependency graph
Change patterns and evolutionary narratives Band-aid evolution (just add a layer)
quick & dirty way to add new functionality, esp. if system is not well understood
e.g., Y2K fixing, adding portability, new features
“Vestigial features” design artifact persists after rationale dies
e.g., whale fin bone structure resembles hand
Change patterns and evolutionary narratives Phenomena observed in Linux evolution
Bandwagon effect Contributed third party code “Mostly parallel” enables sustained growth Clone and hack Careful control of core code; more flexibility
on contributed drivers, experimental features
Defining, Transforming, and Exchanging High-Level Schemas
A guided journey through the outback
Presented by Michael W. GodfreySoftware Architecture Group (SWAG)Dept of Comp Sci, Univ of Waterloo
This presentation is available from http://plg.uwaterloo.ca/~migod/papers/
What is a High-Level Schema?
My answer:Any schema above the statement level
I see two distinct levels of abstraction:1. Programming language entity level
– Entities are (shared) fcns, vars, types, classes, …
2. Architectural level– Entities are modules, subsystems, classes,
interfaces, …
Previous Work
Lots of motivational work ad hoc extractor snarfing experimental translation mechanisms
Examples (many others exist) CORUM I and II GRAX TAXForm (TA eXchange FORMat) using Acacia, Rigiparse Rigi using VisualAge C++ Dali using Sniff+
My (selfish) goals
I would like to be able to use other extractors … Want to perform architectural analyses of
systems written in languages other than C Want to implement BEAGLE
(a tool for exploring software evolution) … but extractors differ in languages
modelled, level of detail, robustness, bugs, data format, … I want to be able to convert data between tools. Need agreement (awareness) from tool creators
TAXForm Utopia
PBS Extractor(cfx)
R ig i Extractor(rig iparse)
D ali Extractor(SN iFF+)
TAXFormR epository
PBS V iew erand Abstraction
Tools
SystemArtifacts
BunchC lustering Tool
R ig i SHriM PView er
Dali toTAXFormConverter
R igi toTAXFormConverter
cfx toTAXFormConverter
Bunch /TAXFormConverter
TAXForm toRigi Converter
Transforming Between Schemas
Universal
High-Level
Procedural
PL/I C
Object-Oriented
C++ Java
Acacia C Rigi CPBS C
TAXForm — Procedural schema
SourceFile
usesfile
Data Type
de fines
Procedure Data
de finesde fines
usestype
usesda ta
de fines de fines
usesp rocedu re
uses type
TAXForm — High level schema
M odule
depends-on
Subsystemconta ins
conta ins
Back to my (selfish) goals
Would like to concentrate on procedural and OO languages. Others are interested in COBOL, JCL etc.
I am interested in high-level info (f calls g) but not in ASGs, code-level metrics
Need to agree on Syntax Level of granularity and detail What to do in case of X e.g., X = “missing
files”
My schema wish list
[influenced by Acacia’s C and C++ data models]
Top-level programming language entities: functions, variables, constants, type definitions
(procedural languages) methods, class member data, static methods and
member data (object-oriented languages)
Entity containers: files, modules, classes, packages
My schema wish list
Entity attributes: Name, unique identifier (UID -- see next section) UID of container, UID of containing file (if container is not a
file) Signature/data type Line number information (see below) Declared scope/visibility, static or not, final or not Definition or declaration (see below)
Entity container attributes: name, UID relative path (if a file) version identifier (if provided) UID of container (if not a file), UID of cont. file (if not a file)
My schema wish list
Relationships: Function calls, variable uses Line number information (see below) Container use/inclusion (by other containers) Inheritance (various kinds) “Friendship”, various template relationships
Relationship attributes: Line number information (see below) Scope/permission of inheritance
Problems
Some technical problems: UID generation? (name-mangling?) Line numbering (ranges)? Incomplete information?
ill-formed code, gcc/K&R-isms missing header files resolving entity use to dfn/dcl
(esp. with polymorphism, overloading) Pre or post preprocessing?
Problems
We’ve had these conversations before …
“Getting academics to agree on anything is like herding cats.”
Example Extractors/Systems
Included here:
PBS [UWloo]
Acacia [AT&T]
cxref, ctags, cscope
TA++ [UOttawa]
BAUHAUS [UStuttgart]
GUPRO [UKoblenz]
Others:
Rigi [UVictoria]
SPOOL [UMontréal]
Datrix [Bell Canada]
MOOSE [UBern]
SHORE [SD&M]
Neuhold [UVienna]
VisualAge C++ [IBM] … [many others]
Dimensions of Variation Intended use
Level of schema (entity level, architectural level, or mixed) Amount of detail
Languages modelled Multi-lingual Common super schemas Explicit model “cross-overs” (e.g., JCL, embedded SQL)
Hidden assumptions Known limitations
Notation/approach to store factbase Support for translations and transformations
What’s particularly novel and noteworthy
PBS [Holt et al. @ UWaterloo]
Portable Bookshelf is a reverse engineering tool for creating software architecture models of large systems:
Guinea pigs: Mozilla, Linux, Apache, VIM, Mitel, TOBEY, …
Consists of fact extractor, fact manipulation engine (“grok”), and visualization tool (“landscape”)
sourcecode
cfx groklandscape
viewerentity-level
factsarchitectural
facts
PBS C Language E/R View
PBS Architectural Schema
Acacia [Chen, Gansner et al. @ AT&T]
History: CIA CIAO Acacia
Consists of C and C++ extractors SQL-like query engine visualization with auto-layout
Acacia C++/C Schemas
Entity attributes: Hex UID, name, kind (file, function, type, var,
macro), filename, datatype (string), typeclass (enum, struct, etc.), linenum info for def/dec, def/dec/undef, param list, template info, scope, storage spec (static, const, inline, inline virtual, etc.), signature
Relationship attributes: Linenum info, rel. kind (refers, contains,
inherits, instantiates, typedef, etc.), relationship scope
Acacia Queries
SQL-like queries for entities and relationships produces “;” delimited textual output:
% ksh cdef -u fu closeTagFile26f53ece;closeTagFile;function;entry.h;void;regular;83;0;83;d
ec;00000000;(const boolean);;extern;;;;76e7ae31;closeTagFile;function;entry.c;void;regular;551;553;5
63;def;00000000;(const boolean);;extern;;;;
% ksh cref –u - - m - file2=‘osdeps.h’<all entity1 attrs> ; <all entity2 attrs > ; <rel attrs>
ctags, cxref, cscope These are “open source” Unix tools that
perform extractions: ctags extracts only entity info
e.g., file, name, line num, kind, etc works with C, C++, Eiffel, Fortran, and Java. Used for fast context switching while editing source code
with vim/emacs cxref generates cross-reference table for C
systems. Often used for webifying source code (e.g., Linux, Mozilla).
cscope used for program comprehension of C systems (e.g., who calls f, who uses v)
Older commercial Unix tool, recently open sourced.
TA++ [Lethbridge et al. @ UOttawa]
TKSee aids programming comprehension i.e., what programmers do all day TA++ is the data modelling language
Want “full story” from the source code: Want pre-preprocessing view of code for all
platforms and environments (text editor’s view)
… but most extractors use a compiler front end and preprocess toward a particular target and option set
Some extractors keep some macro info
TA++ Combined E/R Model
BAUHAUS [Koschke et al. @ UStuttgart]
Software architecture recovery system Parse code, look for hidden/decayed abstractions,
then redesign Uses various heuristics to perform “clustering” Works both at entity level and subsystem level
Built from many tools … … including Rigi viewer and a customized C
parser/extractor that (optionally) dumps RSF Example WoSEF problem:
Cannot derive full includes hierarchy from Bauhaus extracted facts; this was a design decision, as the researchers were not interested in this information
BAUHAUS Entities
BAUHAUS Relationships
BAUHAUS Combined E/R
GUPRO [Ebert, Kullbach, Winter et al.@ UKoblenz]
GUPRO supports simultaneous modelling of inter-related systems written in different programming languages In particular, concerned with the
COBOL/MVS/JCL mainframe world GUPRO is notable because:
Simultaneously multilingual Explicitly models “boundary crossings” (!) Looks at (very real) problems of the mainframe
world COBOL, JCL, database migration
GUPRO
Candidate system is modelled in an object-based repository using a graph-based approach:
EER (modelling language)
+GRAL (constraint language)
GReQL mechanism supports structured queries on the repository via restricted first-order logic
GUPRO
JCL schema COBOL schema
GUPRO
Integrated schemas for JCL and COBOL
GUPRO Multi-Language Model
Summary — High-Level Schemas
Lots of sticky issues at the prog. lang. level: To pre- or not to pre-process Entity resolution often not done (e.g., Datrix) What is a function: def, dec, polymorphism,
overloading, templates, … How to deal with missing libraries, incremental
extractions, versioned extractions, non-ANSI-isms, … Conceptual gaps:
COBOL/JCL world very different from C/C++/Java world
“I didn’t know you wanted full includes info…”
Summary — Good News
Many of us seem to be doing similar kinds of extractions. It seems like that:
Many extractors can be used within other tools Some form of common interchange format is feasible,
tho it may not please everyone. Challenges:
May want to use multiple tools together I have been working on a standalone cxref-based hack to
add full includes information to a BAUHAUS converter Can we take advantage of the web to set up some sort
of distributed fact extraction/conversion factory? [Holt]