Evolution in Open Source Software: A Case Study

Evolution in Open Source Software: A Case Study

Michael W. Godfrey Qiang Tu

[paper in ICSM 2000]

Software Architecture Group University of Waterloo

What is software evolution?

“Evolution is what happens while you’re busy

making other plans.”

Usually, we consider evolution to begin once the first version has been delivered:

Maintenance is the planned set of tasks to effect changes.

Evolution is what actually happens to the software.

Previous research Lehman’s laws Parnas on software geriatrics Eick et al. on code decay (10 MLOC

telecom) Gall et al. (10 MLOC telecom)

Munro, Burd et al. (2 MLOC gcc)

Lehman’s Laws in a nutshell Observations:

(Most) useful software must evolve or die. As a software system gets bigger, its resulting

complexity tends to limit its ability to grow. Development progress/effort is (more or less)

constant; growth is at best constant. Advice:

Need to manage complexity. Do periodic redesigns. Treat software and its development process as a

feedback system (and not as a passive theorem).

Lehman’s examples

A case study in evolution:The Linux OS kernel

A case study in evolution:The Linux OS kernel It’s Linux!

Large system, very stable, many releases over several years, many developers

Growing mainstream adoption Open source development model

Interesting phenomenon in itself Easy to track, can publish results, many

experts Not much previous study

Methodology Examined 96 versions of Linux kernel

34 of the 67 stable releases 62 of the 369 development releases

All measures considered only .c/.h files contained in the tarball

Counted LOC using “wc –l” and an awk script that ignored comments and blank lines

Counted # of fcns/vars/macros using ctags Architectural model (SSs hierarchy) based on default

directory structure We plotted growth against calendar time

Lehman suggests plotting growth against release number

Growth of # of source files

0

1000

2000

3000

4000

5000

6000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

# o

f so

urc

e co

de

file

s (*

.[ch

] )

Development releases (1.1, 1.3, 2.1, 2.3)

Stable releases (1.0, 1.2, 2.0, 2.2)

Growth of # of global fcns, variables, and macros

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000


# o

f g

lob

al f

cns,

var

iab

les,

an

d m

acro

s Development releases (1.1, 1.3, 2.1, 2.3)

Stable releases (1.0, 1.2, 2.0, 2.2)

Growth of Lines of Code (LOC)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000


To

tal

LO

C

Total LOC ("wc -l") -- development releases

Total LOC ("wc -l") -- stable releases

Total LOC uncommented -- development releases

Total LOC uncommented -- stable releases

Average/median .c file size

0

100

200

300

400

500

600

700


Un

com

men

ted

LO

C

Average .c file size -- dev. releasesAverage .c file size -- stable releasesMedian .c file size -- dev. releasesMedian .c file size -- stable releases

Average/median .h file size

0

20

40

60

80

100

120

140


Un

co

mm

ente

d L

OC

Average .h file size -- dev. releasesAverage .h file size -- stable releasesMedian .h file size -- dev. releasesMedian .h file size -- stable releases

Growth of major SSs (dev. releases)

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000


To

tal

un

com

men

ted

LO

C

drivers

arch

include

net

fs

kernel

mm

ipc

lib

init

SS LOC as percentage of total system

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0


Per

cen

tag

e o

f to

tal

syst

em u

nco

mm

ente

d L

OC

driversarchincludenetfskernelmmipclibinit

SS LOC as percentage of total system (ignoring drivers)

0.0

5.0

10.0

15.0

20.0

25.0

30.0


Per

cen

tag

e o

f to

tal

syst

em u

nco

mm

ente

d L

OC

archincludenetfskernelmmipclibinit

Growth of small core SSs

0

1000

2000

3000

4000

5000

6000

7000

8000

9000


To

tal

un

com

men

ted

LO

C

kernel

mm

ipc

lib

init

Growth of arch SSs

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000


To

tal

un

com

men

ted

LO

C

arch/ppc/

arch/sparc/

arch/sparc64/

arch/m68k/

arch/mips/

arch/i386/

arch/alpha/

arch/arm/

arch/sh/

arch/s390/

Growth of drivers SSs

0

50,000

100,000

150,000

200,000

250,000

300,000


To

tal

un

com

men

ted

LO

C

drivers/netdrivers/scsidrivers/chardrivers/videodrivers/isdndrivers/sounddrivers/acorndrivers/blockdrivers/cdromdrivers/usbdrivers/"others"

Observations and hypotheses

Growth along devel. path is super-linear

y = .21*x^2 + 252*x + 90,055 r2=.997y = size in LOC x = days since v1.0 r2 is “coefficient of determination” using least squares

[Lehman/Turski’s model: y’ = y + E/y^2 (3Ex)^(1/3)]

Linux’s strong growth is continuing. This is stronger growth at MLOC level than

observed by others (Lehman, Gall), even for other OSs.

Why has Linux been able to continue its geometric growth?

Core code quality is carefully maintained Architecture/problem domain

It’s largely drivers Much of the code is “parallel” It’s not as big as you might think

Vanilla configuration used only 15% of files

Development model (OSD) and its sociology Popularity and visibility has encouraged outsiders

(both hackers and industry) to contribute

Growth of pine (email client)

0

50

100

150

200

250

300

350

Jan-93 Jun-94 Oct-95 Mar-97 Jul-98 Dec-99 Apr-01

# o

f M

od

ule

s

Growth of gcc/g++/egcs

0

100

200

300

400

500

600

700

800

900

1000

Aug-87 Dec-88 May-90 Sep-91 Jan-93 Jun-94 Oct-95 Mar-97 Jul-98 Dec-99 Apr-01

# o

f m

od

ule

s g++

gcc

egcs

Growth of vim (text editor)

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

May 1990 Sep 1991 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

To

tal

LO

C

Total LOC ("wc -l")

Total LOC (ignoring comments and blank lines)

vim avg % comments and blank lines per file

25.0

26.0

27.0

28.0

29.0

30.0

31.0


Ave

rag

e p

erce

nt

com

men

ts +

bla

nk

lin

es

vim avg/median file size

0

100

200

300

400

500

600

700

800

900

1000


Un

com

men

ted

LO

C

Average uncommented LOC per source fileMedian uncommented LOC per source file

vim’s architecture

HypothesesFactors affecting evolution include

Size and age of system Use of traditional sw. eng. principles during

development

PLUS Problem domain

Problem complexity, multi-platform, multi-features Software architecture Process model Sociology, market forces, and acts-of-God

Software evolution research: What next?So far, we have examined only growth. More case studies needed

Qualitative and quantitative Industrial and open source systems Different problem domains, architectures

Supporting tools to aid analysing, visualizing, and querying program evolution

More than just RCS and perl Support for architecture repair

Codified knowledge: Why and how does software change?

Build catalogue of change patterns and evolutionary narratives

Codified knowledge Mature engineering disciplines codify knowledge

and experience. Arguably, this is lacking in software engineering.

Software architecture styles [Shaw] Design patterns [GoF]

Codified knowledge of how and why programs evolve:

Evolutionary narratives [Godfrey] Long term, coarse granularity

Change patterns Short term, fine granularity

Change patterns and evolutionary narratives

Cathedral style [Raymond] careful control and management debugging done before committing code evolution is slow, planned, rarely undone

Bazaar style (OSD) lots of low-level changes, frequent fixes lots of “building around” rather than wholesale

changing, occasional redesigns creeping feature-itis, “complete” dependency graph

Change patterns and evolutionary narratives Band-aid evolution (just add a layer)

quick & dirty way to add new functionality, esp. if system is not well understood

e.g., Y2K fixing, adding portability, new features

“Vestigial features” design artifact persists after rationale dies

e.g., whale fin bone structure resembles hand

Change patterns and evolutionary narratives Phenomena observed in Linux evolution

Bandwagon effect Contributed third party code “Mostly parallel” enables sustained growth Clone and hack Careful control of core code; more flexibility

on contributed drivers, experimental features

Defining, Transforming, and Exchanging High-Level Schemas

A guided journey through the outback

Presented by Michael W. GodfreySoftware Architecture Group (SWAG)Dept of Comp Sci, Univ of Waterloo

This presentation is available from http://plg.uwaterloo.ca/~migod/papers/

What is a High-Level Schema?

My answer:Any schema above the statement level

I see two distinct levels of abstraction:1. Programming language entity level

– Entities are (shared) fcns, vars, types, classes, …

2. Architectural level– Entities are modules, subsystems, classes,

interfaces, …

Previous Work

Lots of motivational work ad hoc extractor snarfing experimental translation mechanisms

Examples (many others exist) CORUM I and II GRAX TAXForm (TA eXchange FORMat) using Acacia, Rigiparse Rigi using VisualAge C++ Dali using Sniff+

My (selfish) goals

I would like to be able to use other extractors … Want to perform architectural analyses of

systems written in languages other than C Want to implement BEAGLE

(a tool for exploring software evolution) … but extractors differ in languages

modelled, level of detail, robustness, bugs, data format, … I want to be able to convert data between tools. Need agreement (awareness) from tool creators

TAXForm Utopia

PBS Extractor(cfx)

R ig i Extractor(rig iparse)

D ali Extractor(SN iFF+)

TAXFormR epository

PBS V iew erand Abstraction

Tools

SystemArtifacts

BunchC lustering Tool

R ig i SHriM PView er

Dali toTAXFormConverter

R igi toTAXFormConverter

cfx toTAXFormConverter

Bunch /TAXFormConverter

TAXForm toRigi Converter

Transforming Between Schemas

Universal

High-Level

Procedural

PL/I C

Object-Oriented

C++ Java

Acacia C Rigi CPBS C

TAXForm — Procedural schema

SourceFile

usesfile

Data Type

de fines

Procedure Data

de finesde fines

usestype

usesda ta

de fines de fines

usesp rocedu re

uses type

TAXForm — High level schema

M odule

depends-on

Subsystemconta ins

conta ins

Back to my (selfish) goals

Would like to concentrate on procedural and OO languages. Others are interested in COBOL, JCL etc.

I am interested in high-level info (f calls g) but not in ASGs, code-level metrics

Need to agree on Syntax Level of granularity and detail What to do in case of X e.g., X = “missing

files”

My schema wish list

[influenced by Acacia’s C and C++ data models]

Top-level programming language entities: functions, variables, constants, type definitions

(procedural languages) methods, class member data, static methods and

member data (object-oriented languages)

Entity containers: files, modules, classes, packages

My schema wish list

Entity attributes: Name, unique identifier (UID -- see next section) UID of container, UID of containing file (if container is not a

file) Signature/data type Line number information (see below) Declared scope/visibility, static or not, final or not Definition or declaration (see below)

Entity container attributes: name, UID relative path (if a file) version identifier (if provided) UID of container (if not a file), UID of cont. file (if not a file)

My schema wish list

Relationships: Function calls, variable uses Line number information (see below) Container use/inclusion (by other containers) Inheritance (various kinds) “Friendship”, various template relationships

Relationship attributes: Line number information (see below) Scope/permission of inheritance

Problems

Some technical problems: UID generation? (name-mangling?) Line numbering (ranges)? Incomplete information?

ill-formed code, gcc/K&R-isms missing header files resolving entity use to dfn/dcl

(esp. with polymorphism, overloading) Pre or post preprocessing?

Problems

We’ve had these conversations before …

“Getting academics to agree on anything is like herding cats.”

Example Extractors/Systems

Included here:

PBS [UWloo]

Acacia [AT&T]

cxref, ctags, cscope

TA++ [UOttawa]

BAUHAUS [UStuttgart]

GUPRO [UKoblenz]

Others:

Rigi [UVictoria]

SPOOL [UMontréal]

Datrix [Bell Canada]

MOOSE [UBern]

SHORE [SD&M]

Neuhold [UVienna]

VisualAge C++ [IBM] … [many others]

Dimensions of Variation Intended use

Level of schema (entity level, architectural level, or mixed) Amount of detail

Languages modelled Multi-lingual Common super schemas Explicit model “cross-overs” (e.g., JCL, embedded SQL)

Hidden assumptions Known limitations

Notation/approach to store factbase Support for translations and transformations

What’s particularly novel and noteworthy

PBS [Holt et al. @ UWaterloo]

Portable Bookshelf is a reverse engineering tool for creating software architecture models of large systems:

Guinea pigs: Mozilla, Linux, Apache, VIM, Mitel, TOBEY, …

Consists of fact extractor, fact manipulation engine (“grok”), and visualization tool (“landscape”)

sourcecode

cfx groklandscape

viewerentity-level

factsarchitectural

facts

PBS C Language E/R View

PBS Architectural Schema

Acacia [Chen, Gansner et al. @ AT&T]

History: CIA CIAO Acacia

Consists of C and C++ extractors SQL-like query engine visualization with auto-layout

Acacia C++/C Schemas

Entity attributes: Hex UID, name, kind (file, function, type, var,

macro), filename, datatype (string), typeclass (enum, struct, etc.), linenum info for def/dec, def/dec/undef, param list, template info, scope, storage spec (static, const, inline, inline virtual, etc.), signature

Relationship attributes: Linenum info, rel. kind (refers, contains,

inherits, instantiates, typedef, etc.), relationship scope

Acacia Queries

SQL-like queries for entities and relationships produces “;” delimited textual output:

% ksh cdef -u fu closeTagFile26f53ece;closeTagFile;function;entry.h;void;regular;83;0;83;d

ec;00000000;(const boolean);;extern;;;;76e7ae31;closeTagFile;function;entry.c;void;regular;551;553;5

63;def;00000000;(const boolean);;extern;;;;

% ksh cref –u - - m - file2=‘osdeps.h’<all entity1 attrs> ; <all entity2 attrs > ; <rel attrs>

ctags, cxref, cscope These are “open source” Unix tools that

perform extractions: ctags extracts only entity info

e.g., file, name, line num, kind, etc works with C, C++, Eiffel, Fortran, and Java. Used for fast context switching while editing source code

with vim/emacs cxref generates cross-reference table for C

systems. Often used for webifying source code (e.g., Linux, Mozilla).

cscope used for program comprehension of C systems (e.g., who calls f, who uses v)

Older commercial Unix tool, recently open sourced.

TA++ [Lethbridge et al. @ UOttawa]

TKSee aids programming comprehension i.e., what programmers do all day TA++ is the data modelling language

Want “full story” from the source code: Want pre-preprocessing view of code for all

platforms and environments (text editor’s view)

… but most extractors use a compiler front end and preprocess toward a particular target and option set

Some extractors keep some macro info

TA++ Combined E/R Model

BAUHAUS [Koschke et al. @ UStuttgart]

Software architecture recovery system Parse code, look for hidden/decayed abstractions,

then redesign Uses various heuristics to perform “clustering” Works both at entity level and subsystem level

Built from many tools … … including Rigi viewer and a customized C

parser/extractor that (optionally) dumps RSF Example WoSEF problem:

Cannot derive full includes hierarchy from Bauhaus extracted facts; this was a design decision, as the researchers were not interested in this information

BAUHAUS Entities

BAUHAUS Relationships

BAUHAUS Combined E/R

GUPRO [Ebert, Kullbach, Winter et al.@ UKoblenz]

GUPRO supports simultaneous modelling of inter-related systems written in different programming languages In particular, concerned with the

COBOL/MVS/JCL mainframe world GUPRO is notable because:

Simultaneously multilingual Explicitly models “boundary crossings” (!) Looks at (very real) problems of the mainframe

world COBOL, JCL, database migration

GUPRO

Candidate system is modelled in an object-based repository using a graph-based approach:

EER (modelling language)

+GRAL (constraint language)

GReQL mechanism supports structured queries on the repository via restricted first-order logic

GUPRO

JCL schema COBOL schema

GUPRO

Integrated schemas for JCL and COBOL

GUPRO Multi-Language Model

Summary — High-Level Schemas

Lots of sticky issues at the prog. lang. level: To pre- or not to pre-process Entity resolution often not done (e.g., Datrix) What is a function: def, dec, polymorphism,

overloading, templates, … How to deal with missing libraries, incremental

extractions, versioned extractions, non-ANSI-isms, … Conceptual gaps:

COBOL/JCL world very different from C/C++/Java world

“I didn’t know you wanted full includes info…”

Summary — Good News

Many of us seem to be doing similar kinds of extractions. It seems like that:

Many extractors can be used within other tools Some form of common interchange format is feasible,

tho it may not please everyone. Challenges:

May want to use multiple tools together I have been working on a standalone cxref-based hack to

add full includes information to a BAUHAUS converter Can we take advantage of the web to set up some sort

of distributed fact extraction/conversion factory? [Holt]

Date post:	25-Jan-2016
Category:	Documents
Upload:	effie
View:	36 times
Download:	1 times

Evolution in Open Source Software: A Case Study

Documents