RESEARCH TOPICS AND MAINTENANCE · 2020. 5. 18. · 4.1 Introduction 81 4.2 Incremental Change in...

RESEARCH TOPICSIN SOFTWARE EVOLUTION

AND MAINTENANCE

VICERRECTORÍA DE INVESTIGACIÓN

RESEARCH TOPICSIN SOFTWARE EVOLUTION

AND MAINTENANCE

Jairo Hernán Aponte Melo Mario Linares Vásquez

Laura Viviana Moreno Cubillos Christian Adolfo Rodríguez Bustos

editors

Bogotá, D. C., May de 2012

© Universidad Nacional de Colombia Vicerrectoría de Investigación Dirección de Investigación Sede Bogotá© Editorial Universidad Nacional de Colombia© Jairo Hernán Aponte-Melo [email protected]

Mario Linares-Vásquez [email protected]

Laura Viviana Moreno-Cubillos [email protected]

Christian Adolfo Rodríguez-Bustos [email protected]

Editors

Dirección de Investigación Sede BogotáLuis Fernando Niño VásquezDirector

Editorial UNEditorial BoardAlfonso Correa Motta

María Belén Sáez de Ibarra

Jaime FrankyJulián García GonzálezLuis Eugenio Andrade PérezSalomón Kalmanovitz Krauter Gustavo Silva Carrero

First Edition, 2012ISBN: 978-958-761-162-5 (paperback)ISBN: 978-958-761-163-2 (print on demand)ISBN: 978-958-761-167-0 (e-book)

DIB collection designÁngela Pilone Herrera

PublisherEditorial Universidad Nacional de [email protected]

www.editorial.unal.edu.co

Bogotá, D. C. Colombia, 2012No part of this book may be reproduced by any means without permission in writing from the owner of the patrimonial rights.

Made and printed in Bogotá, D. C. Colombia

Universidad Nacional de Colombia Cataloging-in-Publication Data Research topics in software evolution and maintenance / [eds.] Jairo Hernán Aponte Melo ... [et al.]. -- BogotáUniversidad Nacional de Colombia. Vicerrectoría de Investigación. Dirección de Investigación Sede Bogotá, 2012 xxiv 256 p. – (Colección DIB)

Includes bibliography references and indexes ISBN : 978-958-761-162-5 (paperback). – ISBN : 978-958-761-163-2 (print on demand). -- ISBN : 978-958-761-167-0 (e-book)

1. Software engineering 2. Software evolution 3. Software maintenance 4. Software visualization I. Aponte Melo, Jairo Hernán, 1965- II. Series CDD-21 005.1 / 2012

List of contributors

Jairo AponteUniversidad Nacional de Colombia, sede Bogotá

E-mail: [email protected]

Fernando CortésUniversidad Nacional de Colombia, sede Bogotá


Miguel CubidesUniversidad Nacional de Colombia, sede Bogotá


Óscar ChaparroUniversidad Nacional de Colombia, sede Bogotá


Víctor Escobar-SarmientoUniversidad Nacional de Colombia, sede Bogotá


Mario Linares-VásquezUniversidad Nacional de Colombia, sede Bogotá


David MontañoUniversidad Nacional de Colombia, sede Bogotá


vii

Laura MorenoUniversidad Nacional de Colombia, sede Bogotá


Yury NiñoUniversidad Nacional de Colombia, sede Bogotá


Christian Rodríguez-BustosUniversidad Nacional de Colombia, sede Bogotá


Juan Gabriel Romero-SilvaUniversidad Nacional de Colombia, sede Bogotá


Leslie SolorzanoUniversidad Nacional de Colombia, sede Bogotá


Henry Roberto Umaña-AcostaUniversidad Nacional de Colombia, sede Bogotá


Angélica Veloza-SuanUniversidad Nacional de Colombia, sede Bogotá


Contents

Preface xxi

Chapter 1Summarizing Software Artifacts: Overview and Applications 1

Abstract 11.1 Introduction 11.2 Essentials on Natural Language Summarization 2

1.2.1 The Dimensions of Summarization 31.2.2 Summarization Evaluation 5

1.3 Summarizing Software Artifacts: ExistingApproaches 61.3.1 Summarizing Documentation 71.3.2 Summarizing Source Code 81.3.3 Combining Software Artifacts 16

1.4 Making Easier Software Evolution: UsingSoftware Summaries in Maintenance Activities 171.4.1 Software Comprehension 191.4.2 Reverse Engineering 20

1.5 Trends and Challenges 21References 24

Chapter 2Survey and Research Trends in Mining Software

Repositories 29Abstract 292.1 Introduction 29

ix

RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

2.2 Understanding Software Repositories 312.2.1 Historical Repositories 322.2.2 Communications Logs 332.2.3 Source Code 342.2.4 Other Kind of Repositories 34

2.3 Processes of Mining Software Repositories 342.3.1 Techniques 342.3.2 Tools 35

2.4 Purpose of Mining Software Repositories 362.4.1 Program Understanding 372.4.2 Prediction of Quality of Software Systems 382.4.3 Discovering Patterns of Change

and Refactorings 392.4.4 Measuring of the Contribution of Individuals 402.4.5 Modeling Social and Development Processes 42

2.5 Trends and Challenges 442.5.1 Thinking in Distributed Version

Control Systems 442.5.2 Integrating and Redesigning Repositories 462.5.3 Simplifying MSR Techniques 47

2.6 Summary 48References 50

Chapter 3Software Visualization to Simplify the Evolution

of Software Systems 57Abstract 573.1 Introduction 573.2 Background on Software Visualization 58

3.2.1 How Software Visualization SupportsSoftware Evolutions Tasks 58

3.2.2 The Software Visualization Pipeline 593.2.3 Overview of Visualization Tools 593.2.4 Sources of Information Commonly Used 603.2.5 Differences of Software Visualization

and Modeling Languages Like UML 62

x

3.3 SV Techniques 623.3.1 Metaphors 623.3.2 2D Approaches 643.3.3 3D Approaches 693.3.4 Virtual Environments 72

3.4 Towards a Better Software Visualization Process 723.4.1 Other Programming Paradigms 733.4.2 Include Other Languages 733.4.3 Better and More Flexible Metaphors 743.4.4 Educational Issues 74


Chapter 4Incremental Change: The Way that Software Evolves 81

Abstract 814.1 Introduction 814.2 Incremental Change in the Software Development

Process 824.2.1 Software Maintenance vs. Software

Evolution 834.2.2 Activities of Incremental Change 84

4.3 Concept and Feature Location 874.3.1 Software Comprehension 874.3.2 Concept Location 884.3.3 Static Techniques 89

4.4 Impact Analysis 964.5 Summary 98References 100

Chapter 5Software Evolution Supported by Information Retrieval 105

Abstract 1055.1 Introduction 1055.2 Information Retrieval 106

5.2.1 Classic Models 108

xi

CONTENTS


5.2.2 Alternative and Hybrid Models 1095.2.3 Web Models 109

5.3 Software Evolution Activities 1105.3.1 Incremental Change 1105.3.2 Software Comprehension 1115.3.3 Mining Software Repositories 1125.3.4 Software Visualization 1125.3.5 Reverse Engineering & Reengineering 1125.3.6 Refactoring 113

5.4 Information Retrieval and Software Evolution 1135.4.1 Concept/Feature Location 1135.4.2 Mining Software Repositories (MSR) 1145.4.3 Automatic Categorization of Source

Code Repositories 1175.4.4 Summarization of Software Artifacts 1185.4.5 Traceability Recovery 119


Chapter 6Reverse Engineering in Procedural Software Evolution 127

Abstract 1276.1 Introduction 1276.2 Reverse Engineering Concepts and Relationships 129

6.2.1 Reverse Engineering and SoftwareComprehension 129

6.2.2 Reverse Engineering and SoftwareMaintenance 130

6.2.3 Reverse Engineering Concepts 1306.3 Techniques in Reverse Engineering 133

6.3.1 Standard Techniques 1336.3.2 Specialized Techniques 134

6.4 Application of Techniques 1416.4.1 Description of the System 1416.4.2 Considerations for Applying Reverse

Engineering 142

xii

6.4.3 Application of Standard Techniques 1436.4.4 Application of Specialized Techniques 144

6.5 Reverse Engineering Assessment 1476.5.1 Assessment of Techniques 1476.5.2 Assessment of Tools 148


Chapter 7Agility Is Not Only About Iterations But Also About

Software Evolution 161Abstract 1617.1 Introduction 1617.2 Evolutionary Software Processes 164

7.2.1 EVO 1657.2.2 Spiral 1687.2.3 The Unified Process Family 1687.2.4 Staged Model 171

7.3 Principles, Agility and the Agile Manifesto 1737.3.1 The Agile Manifesto 1737.3.2 Agile Principles 1757.3.3 Agility in Software Development 177

7.4 Agile Methodologies History 1787.4.1 Iterative Development (1970-1990) 1787.4.2 The Birth of Agile Methodologies

(1990-2001) 1797.4.3 The Post-manifesto Age (2001-2011) 181

7.5 Agile Methodologies Overview 1837.5.1 Extreme Programming (XP) 1837.5.2 SCRUM 1847.5.3 Feature Driven Development (FDD) 1847.5.4 Lean Agile Development: LSD, Kanban,

Scrumban 1857.5.5 Agile Versions of UP: AgileUP,

Basic/OpenUP 1887.6 Agility and Software Evolution 190

xiii

CONTENTS



Chapter 8Software Development Agility in Small and Medium

Enterprises (SMEs) 201Abstract 2018.1 Introduction 2028.2 Legal Definition of SMEs 205

8.2.1 Foreign Definitions 2058.2.2 Local Definition (Colombia) 207

8.3 Agile Methodologies in the Real World 2088.3.1 Knowledge About Agile Methodologies 2088.3.2 People Who Practiced Agile at

a Previous Company 2098.3.3 Roles in Agile Usage 2098.3.4 Reasons for Agile Adoption 2098.3.5 Agile Methodology Used 2108.3.6 What Already Have Been Achieved

by Using ASDM 2108.3.7 Barriers to Further Agile Adoption 2108.3.8 Agile Practices 2118.3.9 Plans for Implementing Agile on Future

Projects 2118.3.10 Using Agile Techniques on Outsourced

Projects 2118.4 Agility Assessment Models 211

8.4.1 Boehm and Turner’s Agility andDiscipline Assessment 212

8.4.2 Pikkarainen and Huomo’s AgileAssessment Framework 215

8.5 Weaknesses and Strengths of AgileMethodologies 219

8.6 Challenges Adopting Agile Methodologies in SMEs 2218.7 Summary 223References 224

xiv

Chapter 9Model-driven Development and Model-driven Testing 227

Abstract 2279.1 Introduction 227

9.1.1 Challenges of Software Development 2289.1.2 Traditional Testing Process 2289.1.3 A solution to Face the Problem 2289.1.4 Model-based Testing 229

9.2 Why to Use Model-driven Architecture in AgileMethodologies 230

9.3 Model-driven Development 2309.3.1 Philosophy 2309.3.2 Unified Modeling Language 232

9.4 Model-driven Architecture 2339.4.1 Software Engineering 233

9.5 What the Model-driven ArchitectureDoes Not Do 234

9.6 Notations for Modeling Tests 2359.6.1 Transitions-based Modeling 2359.6.2 Pre/Post Modeling 235

9.7 Testing from Finite State Machines 2369.7.1 FSM and ModelJUnit 2369.7.2 Case Study 237

9.8 Testing from Pre/Post Models 2399.8.1 Object Constraint Language (OCL) 2399.8.2 Case Study 240

9.9 Trends and Challenges 2419.10 Summary 242References 243

Subject Index 246

Name Index 252

xv

CONTENTS

List of Figures

1.1 Characterization of Natural Language Summarization 41.2 Summary generation process for methods 111.3 Corpus creation process 121.4 Summary generation process for software changes 15

2.1 Classification of Software Repositories 322.2 Development workflows models 45

3.1 Structograms (sequence, loop and conditional) 653.2 Jackson diagrams representation (sequence, loop ccccc

and conditional) 663.3 Basic control structure diagrams (sequence, loop ccccc

and conditional) 663.4 SeeSoft tool example 673.5 Mozilla Firefox example of evolutionary coupling 683.6 Fractal representation 683.7 Pixel-maps example 693.8 sv3D and SeeIT 3D metaphor 703.9 Sorting steps using a third spatial dimension 713.10 CodeCity metaphor 71

4.1 Incremental change activities 834.2 Concept location techniques frequently use ccccccc ccc

an intermediate representation of source code 89

5.1 General architecture of an IR system 1075.2 Information retrieval models 108

xvi

5.3 Information retrieval and software evolution activities 113

6.1 Reverse Engineering process 1326.2 Business rules knowledge through abstraction levels 1356.3 Modularization process based on clustering/searching ccc

algorithms 1406.4 Prototype of the Oracle Forms Reverse Engineering ccc

Tool 1446.5 Example of a CURSOR and its use in a FOR cccccc

statement 1456.6 Example of a code block that does not represent ccccc

a business rule 1466.7 Application of the MECCA approach 152

7.1 Iterative development 1647.2 Incremental development 1657.3 Iterative and incremental development 1667.4 The Spiral model 1697.5 The Unified Process (UP) model 1717.6 The RUP phases and their goals 1727.7 UP elements 1737.8 The Rational Unified Process (RUP) model 1747.9 The simple staged model 1757.10 The Crystal Family 1807.11 Agile methodologies evolution 1827.12 SCRUM 1847.13 FDD lifecycle 1857.14 Kanban board 1877.15 OpenUP layers 1907.16 Communication modes and effectiveness 1917.17 Hub organizational structure 193

8.1 Software development in Colombia 2048.2 Earnings generated by Software development ccccccccc

in Colombia 2048.3 Dimensions affecting method selection 2168.4 An Agile assessment framework 218

xvii

LIST OF FIGURES


8.5 Data collection planning 2188.6 Changes required for adopting Agile methodologies cccc

in traditional organizations, and associated cccccccccchallenges/risks 222

9.1 Representing states and state transitions using ccccccccca state diagram 236

9.2 State diagram for the case study 2379.3 State diagram for case study 239

xviii

List of Tables

1.1 Software summarization approaches 18

2.1 Law of program evolution 302.2 Methodologies used in MSR 352.3 Mining Software Repositories tools 362.4 Current state of MSR tools 48

3.1 Steps in the visualization pipeline 603.2 List of applications 61

4.1 Classification of IR-based techniques according cccccccto source code applicability 94

4.2 Classification of IR-based techniques according cccccccto intermediate representation 95

4.3 Classification of IR-based techniques according cccccccto granularity of results 95

4.4 Classification of IR-based techniques according cccccccto the type of information used 96

7.1 Agile principles vs. Software features 192

8.1 Software exports from India, Ireland and Israel 2028.2 Growth of the Indian software industry 2038.3 Definition of SMEs in selected countries 2068.4 Personnel characteristics 2148.5 Agility - plan-driven method home grounds and levels cc

of software method understanding and use 215

xix


9.1 Main OCL constructs 2409.2 Test cases generated 241

xx

Preface

Over its lifetime, a software system is affected by many changes whosefundamental purpose is to adapt it to its operating environment. Whenthe first software systems were developed, only few adaptive changeswere needed to achieve a perfect coupling between the new system andits production environment. This phenomenon, almost negligible at thattime, has been growing gradually, and has become what is now calledsoftware evolution and maintenance. Nowadays, it is considered thatany software system must be designed to change, because during theirlifespan they must be continuously updated in order to be adapted toan ever-changing environment.

As a research group in software engineering at Universidad Nacionalde Colombia, we are interested in understanding this evolutionary phe-nomenon, building models to describe the past, present and future of theevolution of a software system, and designing and implementing toolsto support these permanent change processes. The study of the driversand general properties of the software evolution phenomenon is impor-tant because the highest costs of software construction are associatedwith maintenance tasks, and because the lifespan of a software sys-tem depends heavily on the procedures and techniques used for changeimplementation, corrections and required extensions.

Our research in this subject is aimed at understanding, assessing,implementing and managing changes needed by all types of softwareartifacts. This includes evolutionary construction (Agile methods), soft-ware understanding, concept location (to identify where a particularfunctionality is implemented in the source code), impact analysis (todetermine which parts of the current system are affected by a particular

xxi


change already implemented), identification of dependencies and linksamong artifacts that evolve simultaneously, and reverse engineering (toproduce high-quality models to facilitate understanding of a complexlegacy system).

What is this Book About?

This book presents specific research topics within the field of softwareevolution, addressed by members of the ColSWE research group, whohave been working on them for the last three years. The first chap-ter examines several approaches for summarizing software artifacts, anddiscusses the use of summaries for aiding software comprehension, anunavoidable step of maintenance tasks. The second chapter highlightsthe importance of software repositories as sources of information thathave allowed researchers to understand the evolutionary processes ofsoftware systems. Also, it presents a literature survey on approachesfor mining software repositories, and a brief review of the main researchtrends and current challenges in this field. The third chapter presentsthe most relevant visualization techniques that have been applied tosoftware entities as tools for supporting software understanding, and ex-plains the major difficulties to be overcome by the future visualizationtools. The fourth chapter studies the incremental change process, whichis the underpinning of software evolution because it enables developers toadd new functionalities to programs, improve existing ones, and removebugs. The fifth chapter shows how information retrieval techniquescan support the implementations of some activities of the evolutionarymodel of software development. The sixth chapter presents an overviewof reverse engineering, an analysis of its applicability in an industriallegacy software system, and some future trends that will direct researchin this discipline. The seventh chapter provides a comprehensive intro-duction and explanation of agile software methods, and explains howthese methodologies address specific features of software projects suchas requirements volatility, users’ volubility, incremental change, and un-certainty in schedules. The eighth chapter is an analysis of how Agilesoftware methods are being used in small and medium sized softwaredevelopment companies in Colombia. Finally, the last chapter intro-

xxii

PREFACE

duces the application of model-based design for building and executingthe necessary artifacts to perform software testing, as well as the use oftest cases for developing a new system.

Who Should Read this Book?

This book is of interest to everyone working in the field of softwareengineering. It is also aimed at senior undergraduate and graduatestudents and researchers who are new to the field of software evolution.For them, each chapter of the book provides up-to-date information andlinks to other resources, within a specialized subject. In addition, eachone can be read independently.

We hope students, researchers, and practitioners will find in thisbook a gentle introduction and the initial motivation to start their ownresearch in the field of software engineering, and particularly, in softwaredevelopment and maintenance.

Acknowledgments

The authors would like to thank all people who have contributed to thisbook. In particular a very special thanks to:

• Luis Fernando Niño, the head of Dirección de Investigación SedeBogotá - DIB, who has made possible the publication of this bookas part of the collection called “Colección de Investigación de laDIB”.

• Hugo Alberto Herrera, the chair of the Department of ComputerSystems and Industrial Engineering, who has provided significantresources to conduct the research presented in this book.

• Leslie Solorzano and Lina Johana Montoya for reviewing and im-proving our writing in English in so many ways.

• Our advisors and collaborators at Wayne State University and TheCollege of William and Mary, who have given us their expertiseand support for conducting research in Software Engineering.

xxiii


• Our undergraduate students at Universidad Nacional de Colombia,who have participated as Java developers within the user studiesrelated to this research field.

Bogotá, Colombia, November 2011

Jairo Aponte

xxiv

Summarizing SoftwareArtifacts: Overview and

Applications

Laura MorenoJairo Aponte

ABSTRACTSoftware maintenance is one of the most time and effort consumingphases in software life cycle. Usually, there is some underlying docu-mentation to the design and development of software, which is funda-mental for supporting maintainers’ tasks. However, while the amountand extension of software artifacts increase, their use gets complicatedand non-practical. In order to deal with this situation, summarizationof software artifacts has become a new area of software engineeringresearch. This chapter presents various existing approaches for summa-rizing software artifacts, their possible applications in maintenance, andsome research challenges.

1.1 INTRODUCTION

Most of the time and effort of software developers and maintainers isdevoted to read and analyze a huge amount of information about thesystem they are dealing with. This information is generated duringeach stage of the software life cycle, and depending on its purpose, isfrequently captured in different kinds of software artifacts, like require-ments specification, use cases documents, technical designs, code, bugreports, test cases, and so on. Such artifacts represent the main sourceof knowledge when maintaining software, since they reflect the domain,design, and functionality of the system.

1


Nonetheless, in some occasions those software documents can beexcessive or too long for being read completely when performing, for ex-ample, an improvement to the system. In order to reduce the cost, effortand time spent by maintainers when using software artifacts, several ap-proaches have been proposed traditionally from diverse fields, includingvisualization, data mining and information retrieval. Thus, the applica-tion of concepts and techniques used in natural language processing is anovel subject in software engineering, particularly, in regard to summa-rization processes. The integration of these two subjects is reasonablesince the same problem affects both areas: The continuous growth ofdocuments and information.

Consequently, Section 1.2 states the basics on summarization, in-cluding the factors that affect it and the taxonomy of summaries eval-uation. Section 1.3 describes how summarization has been adopted bysoftware engineering domain; it also presents some existing approachesfor summarizing software artifacts. The role or application of summariesfor aiding software maintenance activities is treated in section 1.4. Fi-nally, some trends and challenges of summarization on the softwareengineering field are discussed in section 1.5

1.2 ESSENTIALS ON NATURAL LANGUAGESUMMARIZATION

The huge amount of textual information found physically and digitallyhas stimulated the research on Natural Language Processing (NLP) inthe last decades, turning it into an extended field, product of the fusionbetween a variety of areas of knowledge such as linguistics, statistics,psychology, and computer science. Particularly, the continuous growthof the World Wide Web has made data reduction a central point inNLP in order to deal with the information overload problem [1,2]. Thatreduced form is called summary, “a text that is produced from one ormore texts, that conveys important information in the original text(s),and that is no longer than half of the original text(s) and usually sig-nificantly less than that” [3]. In this way, the automatic summarizationin NLP is basically about generating a shorter text from one or moredocuments, which still preserves their important content.

2

SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

1.2.1 The Dimensions of Summarization

Since mid-1950s different automatic techniques have been proposed torepresent text documents in a condensed form, from fields such as statis-tics, machine learning, information retrieval, and natural language anal-ysis itself. These techniques are affected by several conditions calledorthogonal views [2], aspects of variation [4], or context factors [5],which are commonly divided in three major categories (Figure 1.1):

• input factors, which define the characteristics of the original doc-ument(s);

• purpose factors, for characterizing the required transformationsaccording to the summary usage; and

• output factors, that determine the final product, i.e., the summary.

Among the most remarkable input factors can be mentioned thelanguage (monolingual vs. multilingual), the register or linguistic style,the genre, and the units or size of source(s). This last one determinesif the input comprises single or multiple documents, and whether theelimination of content redundancy (or repetitive information) becomesa key issue within summarizing techniques.

On the other hand, purpose factors include the envelope [5], thetarget audience, and the usage of a summary. For instance, a summaryis critical or evaluative when it points out the opinion of the author ona particular subject; it is indicative if it helps to decide to the readerwhether the sources are worth reading; otherwise, if the summary com-pletely replaces the reading of original documents, it is informative.

3


* Fo

rm: l

angu

age,

regi

ster

,

med

ium

, str

uctu

re, g

enre

,

leng

th*

Subj

ect t

ype

or sp

ecifi

city

* U

nits

or s

ourc

e siz

e*

Hea

der

* U

sage

* Au

dien

ce*

Expa

nsiv

enes

s*

Enve

lope

: tim

e, lo

catio

n,fo

rmal

ity, t

rigge

ring,

dest

inat

ion

* M

ater

ial:

cove

rage

,re

duct

ion,

der

ivat

ion,

spec

ialit

y*

Styl

e*

Cohe

renc

e*

Form

: lan

guag

e, re

gist

er,

med

ium

, str

uctu

re, g

enre

* Gr

amat

ical

ly*

Non

-redu

ndan

cy*

Refe

renc

ial

clar

ity*

Stru

ctur

e an

d co

here

nce

* Pr

ecisi

on*

Reca

ll*

F-sc

ore

* Re

lativ

e ut

ility

* Si

mila

riry

* RO

UGE

fam

ily*

Pyra

mid

s

* Re

leva

nce

asse

smen

t*

Read

ing

com

preh

ensio

n*

Docu

men

tca

tego

rizat

ion

* In

form

atio

nre

trie

val

* Q

uest

ion

answ

erin

g

NAT

URA

L LA

NGU

AGE

SUM

MAR

IZAT

ION

Basic

con

cept

s Cont

ext f

acto

rsSu

mm

ary

Inpu

t fac

tors

Purp

ose

fact

ors

Extr

insic

Eval

uatio

n

Intr

insic

Cont

ent

Co-s

elec

tion

Cont

ent-b

ased

Text

qua

lity

Out

put f

acto

rs

Fig

ure

1.1

Cha

ract

eriz

atio

nof

Nat

ural

Lang

uage

Sum

mar

izat

ion

[2,4

,5].

4


Finally, output factors take into account coherence, reduction, andderivation. In detail, derivation means that a summary can be extractiveor abstractive: While the former is comprised entirely by sequences ofwords literally taken from the source, the latter is built, at least in part,from words which do not appear in the original document [4]. Thus pro-ducing abstracts is harder because they represent additional challengesinvolving analysis, topic fusion and generation of natural language [1,4],but even so, several approaches have tackled each single case, some moresuccessfully than others, just as it is mentioned in [1, 2, 5]. But how isthis determined? How is it known if an approach shows good results orat least better than others?

1.2.2 Summarization Evaluation

Currently, there is no clear idea of what constitutes a good summary.Actually, it is possible to obtain various perfectly acceptable summariesfrom the same sources. Moreover, the lack of a standard frameworkmakes it difficult to construct a baseline for contrasting summarizingsystems. However, despite the fact that evaluation is a controversialand challenging concern, it is a major subject in text summarization,taking into consideration that it allows to assess the results of a specificmethod or system and compare the results of different techniques; evenmore, some types of evaluation allow to understand why a method doesnot work adequately [6].

The quality of summaries can be determined in two different ways:By analyzing their internal properties (intrinsic methods) or by studyingtheir impact when performing a particular activity (extrinsic methods).In the first case, the features to be evaluated are the text quality and thecontent of the summary. Measures of text quality assess aspects such asgrammar, readability, cohesion, and coherence. This kind of linguisticfeatures can not be evaluated by automatic methods, but by humanannotators (online approaches). In contrast, content evaluation is donewith significantly less or no human intervention (off line approaches),usually by comparison of systems output with a gold standard, i.e, anideal summary which contains important content of a given source, andcan be produced either manually or automatically [5, 7]. Co-selection

5


measures like precision, recall or relative utility are good examples ofthis type of evaluation [7], since they allow to compare the amount ofsentences present in both the peer summary and the gold standard.

Nevertheless, such methods leave aside the fact that two sentencesmight mean the same although they are written differently. In that way,the evaluation of two summaries can diverge, even when they expressidentical ideas but in different words. In order to avoid this situation,content-based measures assess the similarity level between sentences,and by extension, between summaries. As an illustration, cosine simi-larity establishes how similar are two documents by representing theminside a vector space model and verifying the co-occurrence of words.ROUGE family also evaluates co-occurrence, but this time of subse-quences of words from a given text [8]. Pyramid method is a goodexample of content-based evaluation as well. It scores summaries fromcontent units (SCUs) which are prioritized depending on their frequencyof appearance in human summaries [9].

Extrinsic evaluation moves away from inner properties of the sum-mary, in order to get it as a whole and determine its effect, efficiencyor usefulness on a particular task. Therefore, task-based evaluation re-quires performing an activity manually or automatically supported bysummaries. In [1], some attempts associated to relevance assessmentand reading comprehension are mentioned. In [2], some approachesrelated to document categorization, information retrieval and questionanswering are discussed.

A well-formed study on intrinsic summary evaluation is shown in [7].An approximation to the taxonomy of summary evaluation is reflectedin [2] and in Figure 1.1.

1.3 SUMMARIZING SOFTWARE ARTIFACTS:EXISTING APPROACHES

Software engineering domain is not exempt from information overloadproblems. Through software development process developers must workdaily with a large amount of data: From requirements specification anddesign documents, passing by maintenance records, until the sourcecode itself.

6


Sometimes when maintaining software, developers deal with a soft-ware system of considerable size whose domain is totally unknown forthem, and still they have to make some kind of enhancement, adap-tation or correction in it. According to [10, 11] developers spend moretime reading and navigating the source code than writing it. Therefore,it is imperative to find a way to facilitate the maintenance of systems.At this point a short description of software artifacts would be very use-ful, especially because searching and browsing within source code anddocumentation are two common tasks when maintaining software.

So, although summarization is an emerging issue within softwareengineering field, several approaches have been proposed to reduce thedata found in software artifacts. In this context, an artifact is definedas any product derived from development process which describes theprocess, functionality, design, or implementation of software. Thesederivatives include source code, binaries, and majorly, documentation.Currently, there is at least one summarization work that deals with eachone of them separately, but in some cases, artifacts are used togetherfor producing more accurate descriptions of their content.

1.3.1 Summarizing Documentation

From a practical point of view, the approaches that treat with softwaredocumentation usually apply natural language summarization techniqueson natural language texts, i.e., the traditional summarizing process. Anevident approach of summarization of software documentation is [12],where bug reports are synthesized by machine learning techniques.

An important issue about bug reports is that they often comprisetwo parts: One with predefined values in fixed fields, and other one withfree-form texts such as a title, a bug description, and a sequence of com-ments related to its lifecycle. In that sense, bug reports are somewhatsimilar to email conversations, and in consequence, same techniquesapplied in the latter case might be useful for summarizing the formerone. Along these lines, in [12] are extracted the most relevant sentencesfrom bug reports based on three classifiers trained on structural, par-ticipants, length and lexical features, with different corpora: Annotatedemail threads, meetings, and bug reports.

7


The evaluation of the three methods was performed from severalperspectives. For instance, each system was evaluated against a base-line classifier to measure its effectiveness using the ROC curve. Later,systems were compared between them applying the standard intrinsicmeasures of precision, recall and F-score. Next, content selection qual-ity was assessed via pyramid method. Not surprisingly, the classifierbased on the bug reports corpus surpasses the other two systems. Theinformativeness of each feature considered in the summarizers was alsoevaluated: The features with the highest F-score were those related tolength. Extrinsic measures were also used for evaluating the informative-ness, redundancy, relevance, and coherence of summaries, and actuallytheir results were acceptable.

1.3.2 Summarizing Source Code

When the aim of summarization is to describe source code, one cru-cial issue has to be considered: Source code is a mixed artifact whichcontains information for communicating to both humans (the develop-ers) and machines (the compilers). In [13], it is acutely explained thissituation by means of an example similar to the next one. Suppose arandom chunk of code as the following (extracted from ATunes1 system,a full-featured audio player and manager):

public static void setLanguage(String fileName) {

if (fileName != null)

languageBundle = getLanguageFile(TRANSLATIONS_DIR + ’/’ + fileName);

else

languageBundle = getLanguageFile(TRANSLATIONS_DIR + ’/’ +

DEFAULT_LANGUAGE_FILE); }

Program instructions within this method can be interpreted by com-pilers, even if identifiers are replaced by arbitrary words. So, functionalityis unaltered despite identifiers do not reveal the intention of the code:

public static void method_1(type_1 a) {

if (a != null)

b = method_2(c + ’/’ + a);

else

b = method_2(c + ’/’ + d); }

1http://www.atunes.org/ (accessed and verified on April 11, 2011)

8


On the other hand, if the information removed from source code isthe formal one, textual information represented by the terms composingidentifiers still remains, providing readers with an idea of the purpose ofthe code:

set language string file name file name language bundle get language

file translations dir file name language bundle get language file

translations dir default language file

Consequently, source code analysis can be performed statically ordynamically. The elemental difference between them is that static anal-ysis does not involve executing code, whereas dynamic analysis studiesthe behavior of code during program execution. In these circumstances,dynamic approaches depend on the program input and the program it-self [14].

Static Approaches

Static approaches usually consider syntactic and semantic properties ofsource code. The first step in any static approach is commonly token-ization, where composite identifiers are split into words according tocapital letters, underscores or other special characters. On this wise,method names such as cdda2WavFile or cdda_2_wav_file are trans-formed into cdda, 2, wav, file.

The most notable techniques for describing source code staticallycan be distinguished by the coherence factor that is determined by theoutput fluency [4].

This means that a summary is fluent if it consists of well-formedsentences that are related to each other forming coherent paragraphs.Otherwise, if it comprises individual words or text fragments which donot keep any relation, the summary is said to be disfluent. As an illus-tration, for the setLanguage method previously mentioned, an exampleof a fluent sentence-based summary is:

This method sets application language.

If the file name is defined, it is used; if not, the default language file

is applied

For the same method, a disfluent term-based summary could be:

set language bundle translation default file.

9


Sentence-based Summaries

One of the most outstanding approaches of software summarizationwhich deals directly with source code by creating descriptions of Javamethods is [15]. The essentials on this proposal are heuristics, for se-lecting the central statements of code within a method (s_units), andtemplates, for generating natural language sentences and reducing re-dundancy. In such manner, the algorithm applied to obtain the descrip-tive comments starts as usual with source code preprocessing, whichincludes tokenization and abbreviation expansion. Then, the action,theme and secondary arguments for methods are obtained using a Soft-ware Word Usage Model (SWUM) that captures linguistic, structuraland occurrence relationships of words within code. Next, some heuris-tics are applied to identify the most relevant units of code within amethod. Five kinds of statements are considered as relevant: Endingunits, void-return units, same-action units, data-facilitating units, andcontrolling units. The relevance and role of these statements were drawnas an inference from a study of a set of comments from open sourceJava programs, and an opinion survey with Java developers about theneed of certain units within methods descriptions. At last, those unitsare lexicalized from predefined templates.

For instance, the fixed template to an assignment is:action theme secondary-args and get return-type

So, the text generated for the s_unittitle = getNameWithoutExtension()

isGet name without extension and get title.

Similar kinds of templates were designed to variables, single methodcalls, return statements, nested and composed method calls, conditionalexpressions, and loop expressions. The whole summarizing process isshown in Figure 1.2.

In general, this technique produces acceptable summaries for meth-ods, as stated by an informal evaluation where some Java developerswere asked about the accuracy, content adequacy, and conciseness oftext generated for individual s_units and the whole summaries. How-ever, developers disagreed with the level of detail required in summaries,

10


Summary comment generator

Relevant s_unitsselection Text generation

Text combinationand smotthing

Method summary

Method

Prepocesing

Program analysis Natural languajeanalysis

Software Word UsageModel contruction

Figure 1.2 Summary generation process for methods. Adapted from [15].

which suggests that selection of statements should be studied carefully.In addition, the proposal is designed for and limited to methods, makingit unable to produce that kind of comments at other granularity levels,such as classes.

Term-based Summaries

In that sense, [16] goes ahead and generates (majorly) extractive sum-maries based on terms for methods and classes, applying and comparingthe results of several text retrieval techniques. Two basic phases areexecuted in this extent: corpus creation and relevant terms selection.First, a corpus is created from source code by extracting identifiers andcomments, but on this occasion, as a particular case, tokenization isan optional step. So, the corpus can be composed of split identifiers,original identifiers, or both (Figure 1.3).

After filtering out terms which do not carry out an specific meaning,the summaries are generated by selecting the terms with the highestscores, obtained by applying algebraic reduction methods, Vector SpaceModel (VSM) and Latent Semantic Indexing (LSI), with variations inthe options of weighting terms (e.g., log, tf-idf, binary entropy).

In [16], these schemes were compared against random and lead sum-maries, using intrinsic, online evaluation. Random summaries compriseartifacts terms chosen in a haphazard way, whereas lead summaries arebuilt with the first terms of the artifact. The results until this pointof the study showed that each variation of lead method outperformed

11


the other techniques regardless the weighting and length options. Al-though VSM summaries also obtained good scores, they were not asgood as lead ones. Nevertheless, it was found that this technique ex-tracts relevant terms from parts of code where the lead method hasno effect at all, i.e., other lines besides the header of the source codeartifact. Therefore, these techniques are complementary and their unionproduces summaries with a greater amount of relevant terms, principallyfor methods.

Corpora

Splittedidentifiers

Splitted + original

identifiers

Originalidentifiers

Use stop words

(keywords,prepositions,

. . .)

Keeporiginal

identifiers

Discardoriginal

identifiers

Splitidentifiers

Extractidentifiers

and comments

Collectionof

documents(methods,

classes)

Figure 1.3 Corpus creation process followed in [16]

Model-based Summaries

Source code descriptions can be represented as well through modelssuch as diagrams, maps, or graphs. For instance, in [17] it is presentedthe software reflection model, where the developer initially selects ahigh-level task-specific model of the software system, which is used as aframework for summarizing information in the source code. This high-level model is delineated through interconnected boxes that refer tothe task at hand, and represent the main modules of the system andthe interactions between them. After choosing the high-level model, thedeveloper uses a syntactic analysis tool to extract structural informationfrom the source code, i.e., the methods and the method calls. Next,the user performs a mapping of the high-level model to the entitiesand relationships extracted from the code. He can use a series of toolsto support the task development (like regular expression matching ormapping several source code entities into a single map entry), but themapping is mostly performed manually. Lastly, the developer computes

12


a reflection model, which is a comparison between the high-level model,the source code structural information and the constructed map, andrepresents a summary of the structural information contained in thesource code of the system.

Summaries as By-products

Other approaches like [13,18] do not have as an explicit aim generatingthat kind of short descriptions, but still, their results can be consideredas (partial) summaries. For example, in [13] the main objective wasto analyze software without taking into account external documents, inorder to provide a first impression of an unfamiliar system. In this case,LSI was used to compute the similarity between source code artifacts,and then, to describe the topic of clusters of artifacts with labels ex-tracted from the same source code. These labels capture the importantconcepts within each linguistic cluster, revealing the intention of thecode.

By the same token, a combination of LSI and Formal Concept Anal-ysis (FCA) was used by [18] for indexing source code and organizingthe results in a concept lattice, respectively. The initial aim of thiswork was the reduction of developers’ effort when searching in sourcecode, providing them with a list of artifacts related to a query enteredby the user. Even so, the resulting representation of relevant informa-tion, labeling topics, concepts, and relationships between them, can berecognized as an approximation to source code summarization.

Dynamic Approaches

As stated before, in order to analyze behavioral aspects, there are somemethods that require the execution of a program, or at least, one ofits slices or traces. In contrast to static approaches, dynamic ones areable to analyze the behavior of variables and control structures, detectdata dependencies, and collect and log temporal information. Moreover,dynamic approaches allow to observe the flow and behavior of a programunder determined conditions.

13


Summarizing Long Traces

In [19], the most representative software routines of large traces areidentified by selection or generalization of executed content. Thus, thesummary is a simplification of the original input that is reduced whenremoving low-level implementation details and utilities (through an util-ityhood metric), but in this case the obtained description is representedas a UML2 sequence diagram. By utility, [19] refers to any softwareelement designed and implemented to be accessed or called from any-where in source code (e.g., accessing methods or constructors), whereasby implementation detail, refers to any element whose absence does notinterfere in the comprehension of a component.

Basically, the algorithm for extracting the relevant routines from ascenario proposed in [19] starts by instrumenting source code for loggingthe method calls. Afterward, a static call graph is built and pruned byremoving the low-level implementation details mentioned above. Next,the utilityhood of each remaining node of the graph is computed, basedon fan-in and fan-out metrics; and then, the unnecessary routines are re-moved, i.e., those nodes whose utilityhood was the lowest. This processis repeated until reaching the amount of routines required by the user.Finally, the routines which are still present in the graph are representedin a light version of a UML sequence diagram, in order to visualize thetrace behavior.

The evaluation of this approach was performed informally by analyz-ing a summary obtained from a particular trace of a program, througha questionnaire answered by developers with intermediate and advancedknowledge of the system. The questions were addressed to assess thequality of the summary by asking about how well its content representedthe trace process, and how effective could it be in software maintenance.Again, just as in [15], the amount of details included in the summariesand the amount required by the users were a disagreement point. Still,this kind of short descriptions were marked as useful when understandingdifferent scenarios of a system.

2Unified Modeling Language, a standardized modeling language for object-oriented development.

14


Summarizing Software Changes

Another direction is taken by [20], where code changes are documentedby describing the effects they have in the behavior of a program. Thisdescription, commonly known as commit message in control versionsystems, is generated automatically when executing a couple of versionsof the same system in search of differences between their control flows(delta). As a result, the proposed algorithm considers the new behaviorof the system and the conditions under it is produced.

To this end, <statement, path predicate> pairs are obtained fromrunning two different versions of the system, using symbolic values asinput variables. In this context, the path predicate indicates the condi-tions under which the statement is executed. By comparing both setsof pairs, the statements whose path predicate has been added, removedor modified are identified. Of course not all statements become partof the summary: A process of filtering is performed for retaining onlymethod invocations, field assignments, return, and throw statements.Now, some summarization transformations are applied to those state-ments previously found. For instance, if the first version of a chunk ofcode isif (interrupted) deleteFile();

and the second one isif (interrupted) revertProcess();

the whole change transformation for conditions expresses the change asif interrupted, do revertProcess() instead of deleteFile().

Revision 1

Revision 2

Predicateextraction

Text generation

Text optimization and smoothing

Acceptance text

Change summary

Figure 1.4 Summary generation process for software changes. Adapted from [20].

Similar kind of templates are defined for predicates, hierarchicalstructures, and method calls. Other transformations are applied to re-duce the size of the documentation, and some others to improve read-ability. Major steps of the approach are showed in Figure 1.4.

15


The evaluation of this method assessed quantitative features likesize and content of the summary. This last one was evaluated througha comparison rubric which contrasted the description of changes witha set of humans’ annotations. Additionally, developers considered qual-itative features of summaries like usefulness, readability and accuracy.In general, the results were satisfactory, and the summaries could bea complement for control version systems, reducing developer’s effortwhen describing changes.

1.3.3 Combining Software Artifacts

All previous approaches give an insight into specific software artifacts.They are explicitly focused on single type of documents, avoiding ex-ternal files which might in fact complement their results. However,more often than not, the information provided by just one artifact is notenough. This is specially true in some maintenance tasks that demandseveral sources of knowledge in order to be successfully completed, as inthe case of bug correction where data contained in bug reports, sourcecode, and even in specification documents, can support maintainers’work.

As an illustration, the second model proposed in [17], called on lex-ical source model, is constructed using keyword matching techniques(e.g., grep or awk) to find structural information in the source code,by specifying regular expressions related to different types of structuralconstructs (i.e., method declarations, method calls, variable definition,etc.). In this approach, a developer is able to specify a set of pat-terns for finding structural information in source code (e.g., define aregular expression to match a function call), a set of actions to be ex-ecuted when specific structural information is found, and a set of rulesto combine the structural information found in different files into onemodel. Several types of artifacts such as data files or documentationfiles can be scanned using this lexical approach, and the information canbe combined into one model. This technique can be used to provide thestructural information needed to build the aforesaid reflection model.

In [21] it is proposed a framework for synthesizing the informationrelated to an evolution task or concern, and its interactions between

16


code, revision history, bug reports, code wikis, and available documen-tation. Basically the approach follows two phases. In the first one, theknowledge provided by sources and related to the concern is extractedand deducted using static analysis methods, data mining and languageprocessing techniques for populating an ontology, which is used in thesecond phase to complete fixed templates or predefined rules in order toform a text-based summary.

Likewise, in [22] are applied code analysis and text mining on specifi-cation documents, UML diagrams and user’s manuals, with the purposeof building an ontology that allows to cross and match the semanticknowledge between those elements. So, source code and documentsare processed in such way that entities are identified and extracted forbeing associated to a class belonging to the ontology. This approachwas evaluated through intrinsic co-selection measures which turned outin acceptable levels of precision and recall for text mining techniques.Nevertheless, the results of code analysis suggested, once again, thatnaming conventions are fundamental for the quality of the ontology, andalso, that disambiguation between parts of code is a desired feature.

Another approach which uses several types of software artifacts is[23]. In it, some techniques from information extraction and natural textprocessing fields are applied to source code and documentation in orderto connect the entities found in both of them, in such a way that businessrules underlying to design and implementation are rebuilt. Thus, afterpreprocessing source code, an Abstract Syntax Tree is built and stored ina knowledge database. Then, the collected data are simplified and linkedto knowledge documentation. For evaluating the results, the associatedkeyphrases are compared to others sets of documents used by analyzers,through intrinsic measures.

1.4 MAKING EASIER SOFTWARE EVOLUTION:USING SOFTWARE SUMMARIESIN MAINTENANCE ACTIVITIES

Maintenance is the most difficult and extended phase of software lifecycle: According to [24], more than 90% of software costs are spenton maintenance activities. The causes of this situation are related tothe inherent properties of modern software systems such as complexity

17


and changeability, but they are also associated with technical prob-lems (e.g., limited understanding, testing costs, impact analysis, andmain tenability) and management affairs (e.g., alignment with organi-zational objectives, staffing, process issues, organizational aspects, andoutsourcing) [25].

Table 1.1 Software summarization approaches.

Artifact Short description Input Output EvaluationDocumentation Summarizing bug reports

through machine learningclassifiers [12]

- Bug reports Text-basedsummaries

- Intrinsicevaluation byco-selection-Informalevaluation

Source code(dynamicapproaches)

Summarizing the contentof large traces throughRoutines filtering [19]

- Large traces UML sequencediagrams

- Informalevaluation

Documenting programchanges by symbolicexecution comparison andnatural languageprocessing [20]

- Source code oftwo versions of aprogram

Textualdescription ofchanges inruntime behavior

- Quantitativeand qualitativeevaluation

Source code (staticapproaches)

Building reflection modelscomparing high-levelmodels and source codestructural data [17]

- Source code Reflectionmodels

Identifying topics in sourcecode using informationretrieval and mapvisualization [13]

- Source code Semantic clusters

Summarizing source codeartifacts applying textretrieval techniques [16,26]

- Source codeartifacts

Term-basedsummaries

- Intrinsicevaluation(online andoffline)

Summarizing Javamethods by relevanceheuristics andtemplates [15]

- Source code ofmethods

Text-basedsummaries

- Informalevaluation

Documentation +Source code

Building lexical sourcemodels using keywordmatching techniques [17]

- Source code- Data files- Documentation

Lexical sourcemodels

Summarizing softwareconcerns applying staticanalysis, informationretrieval and naturallanguage processingtechniques [21]

- Source code- Historicalrepositories- Bug repositories- Wikis anddocumentation

Text-basedsummaries

Building an ontologybetween code anddocumentation via sourcecode analysis and textmining [22]

- Sourcecode- Specificationdocuments- UML diagrams- User’s manuals

Ontology linkingsource code anddocumentssemantic

- Intrinsicevaluation byco-selection

Extracting business rulesfrom code anddocumentation byinformation extraction andnatural languageprocessing [23]

- Source code- Specificationdocuments

Knowledgedatabase

18


In order to reduce the effects of these concerns, there are sometechniques used specifically for maintenance, which include but are notlimited to program comprehension, re-engineering, reverse engineering,impact analysis, and feature/concept location. Such kind of activitiesrequires support from available information about software, which ismainly found in artifacts. In that sense, a short description of themwould be totally appropriate because it would reduce time and effortwhen reading, for example, source code or software documentation.Next, some applications of software summarization into certain mainte-nance activities are mentioned.

1.4.1 Software Comprehension

40% to 60% of maintenance effort is dedicated to program understand-ing [25]. Before making any enhancement, adaptation or correction to asystem, the possible changes need to be analyzed for determining wherethey have to be implemented, and how those modifications will be per-formed. In an ideal situation, maintainers would be provided with con-cise, sufficiently explanatory and up-to-date software deliverables, to aidsoftware comprehension. However, in many occasions documentation islengthy, out-dated, or in the worst case, non-existent or poor-quality.

The summarization of software documents represents a viable al-ternative to minimize the overload documentation problem, which is acommon concern in document-driven processes, (e.g., those based onthe Unified Process, RUP). In each phase of this type of methodologies,it is often generated a long list of documents intended to help maintain-ing systems. Although such artifacts are neither the most important norused by maintainers when understanding a system [27], they still denotea rich but undervalued source of information about software, which canbe better exploited if a shorter version of the documents, or at least thegist of their content, is provided to the user. As a result, depending onthe purpose of the summary, maintainers would be able to

• identify whether the document is useful and worth analyzing morecarefully (if the summary is indicative); or

• use it as a substitute of the document (if the summary is infor-mative).

19


On the other hand, if the problem to face is the low-quality or ob-solete documentation, source code becomes the only useful artifact forsupporting maintenance tasks. Actually, source code is in some casesthe only element generated through development processes, and not sur-prisingly, it is considered by developers as the most important and usedartifact when understanding software systems [27]. However, duringsoftware maintenance it is not always possible to read and understandthe entire implementation of a system: Techniques like scanning, skim-ming, and detailed reading are commonly used by maintainers whendealing with source code. Briefly:

• scanning is about deciding if a resource is useful or not (e.g.,reading only the header of a method), and allows developers tolocate relevant information rapidly;

• skimming is getting the gist of a source code artifact withoutreading each line, and allows to digest its purpose and role in thesystem quickly; and

• detailed reading is to focus on artifacts of particular interest, byreading them in detail.

As it can be noticed, these techniques are fully integrated withsearching and navigation. Furthermore, developers spend more timereading and navigating the source code than writing it [16]. In suchsense, source code summarization could generate descriptions whosecontent offers more information than headers and comments, and whoselength is less than the one of the chunk of code at hand. Even if readingthe summary does not replace detailed reading, it could substitute scan-ning and skimming tasks. Along these lines, searching and navigationtools can be supported also by summaries.

1.4.2 Reverse Engineering

Reverse engineering covers every method, technique and tool whoseaim is to recover or acquire knowledge about software systems, in orderto support the performing of a software engineering task [28]. This isusually made by identifying software components and their interrelation

20


ships, and creating representations of the software at higher levels ofabstraction [25]. In this regard, software summarization itself can beconsidered as a reverse engineering technique. As a case in point, theautomatic summarization of source code artifacts can be used in docu-mentation and re-documentation processes of software systems [15,16],by generating the leading comments and keywords of classes or methods.

Moreover, the summaries of software artifacts have been proposedto improve traceability link recovery. One of the main challenges inthe analysis of candidate links is the great amount of different softwareartifacts that has to be studied. In [29], it is proposed the applicationof text processing techniques for summarizing text-based elements suchas requirements, user’s manuals, design documents, test cases, bug re-ports, etc., and hybrid methods for describing source code. In this way,developers will be provided with summaries of software artifacts effec-tive for making decisions about the correctness of links. The summariesfor fulfilling this objective are extractive or abstractive depending on thesummarized artifact, but they are always intended to be informative.

1.5 TRENDS AND CHALLENGES

The previous background reveals that summarization is a relatively re-cent concept in software engineering field. Even though there are cur-rently some approaches attempting to describe shortly certain typesof artifacts, there is still a lot of work to do and many obstacles toovercome. The first obvious one is the lack of approaches aiming tosummarize single documents like requirements specifications, use cases,technical designs or test cases. Such files are often written in naturallanguage, and their content is organized in formal, semi-standard struc-tures; such features can be exploited using the same techniques appliedin natural language processing. Even if they include source code chunksin their content, they can be treated with mixed methods for taking intoaccount different properties of discourse.

In regard to source code summarization, there are some issues thatneed to be covered as well. Albeit acceptable summaries are achievedin [15, 16], both approaches present their own inconveniences. For ex-ample, in the former it is assumed that the code follows certain types

21


of conventions, and for each pattern found it is proposed a template,which supposes an indirect human intervention, an extra effort in textgeneration, and probably, the lack of a template for unidentified pat-terns. On the other hand, the latter approach leaves aside the structuralinformation of source code, considering an artifact as a bag of words,which clearly it is not the case. Even more, static approaches supposethat source code contains good-quality identifiers and comments, andsometimes, that it follows certain writing standards. So, the results ofstatic summarizing methods keep a strong correlation with the lexical,syntactic and semantic adequacy of terms chosen by programmers.

In such sense, dynamic approaches represent an apparent alternativefor summarizing source code, leaving aside those grammatical assump-tions by considering its real functionality under arbitrary conditions. Butnone of existing dynamic techniques describe the content of a method,class or package yet, probably because this kind of analysis requires extraeffort when designing quality execution scenarios. Furthermore, a betteralternative could take advantage of semantic, syntactic and functionalproperties of source code, by designing a fusion between both staticand dynamic analyses, similar to the one presented by [20] to describesoftware changes. Having this in mind, in [6] a proposal is presentedto provide a description of the code which is more informative thanthe header and the leading comments, yet much shorter than the im-plementation, while capturing the essential information from it. Thenovelty of this approach is the use of lexical and structural informa-tion, complemented with some heuristics based on the parts of the codethat developers consider important when describing specific source codeunits.

Besides, the hybrid techniques embody one of the most promisingchallenges in software summarization, not just because of the value theyrepresent in maintenance activities through the richness of their content(derived from developers’ knowledge and captured in different kindsof software artifacts), but also because of the value of algorithms thatconsider the diverse properties of source code and natural language textsand beyond, interrelate in someway their content for generating a morecomplete and coherent description of artifacts. In this extent, [22, 23]embody well-designed approaches which mix artifacts from the base of

22


a common language, whereas [21] is so far a proposal for summarizingsystems by using several types of artifacts.

It is important to state that summarization techniques should tend tobe fully automatic, or at least, to have the minimum human intervention,unless their objective is feedback or customization of the results. In thisway, although the models generated in [17] allow to understand thestructure of a system, the techniques applied to obtain them requirea lot of intervention of developers, making the approach unsuitable inmaintenance tasks.

Regarding the evaluation of software summaries, the lack of formalframeworks becomes evident for both, intrinsic and extrinsic types. Un-til now, informal assessing of software summarization approaches hasmade it difficult to compare techniques and results. Just as in Nat-ural Language Processing, those techniques which do not involve thesubjectivity of evaluators are desired (i.e., off line evaluation). The re-sults presented in [12, 26, 30] suggest that information retrieval metricsare suitable to assess the inner quality of summaries, either of docu-mentation or source code. Even so, evaluation depends on summariesrepresentation; if the resulting description is delineated as a diagramor map, the ideal would be that the same measures were adapted tothe context in order to facilitate the comparison between approaches.Additionally, the research on evaluation requires more studies related tothe usefulness of software summaries within typical activities of softwareprocesses, besides software comprehension and reverse engineering.

Finally, summaries are practical only if they can be easily generated,accessed and used. In that sense, automatic tools for producing sum-maries should be available within development environments. Moreover,summarizers should allow developers to provide a feedback about thesummaries for improving them by adding or taking off details, dependingon needs and expertise of users.

In conclusion, software summarization is an interesting research areain software engineering. Not surprisingly it has rapidly gained impor-tance and obtained good results at many levels. However, a lot of workstill remains to do concerning to software artifacts coverage, improve-ment of summarization techniques, summaries representation, evalua-tion formalization, and summarizers availability.

23


REFERENCES

[1] Mani, I. Summarization of Text, Automatic. In: Encyclopedia ofLanguage & Linguistics, (ed.) Brown, K. Elsevier, Oxford: 2006.doi:DOI:%25252010.1016/B0-08-044854-2/00957-3.

[2] Steinberger, J. & Jeek, K. Text Summarization: An OldChallenge and New Approaches. In: Foundations of Computa-tional, Intelligence Volume 6, (eds.) Abraham, A., Hassanien,

A. E., Leon, D. & Snáel, V., tome 206 in Studies in Compu-tational Intelligence. Springer Berlin / Heidelberg: 2009, 127–149.doi:10.1007/978-3-642-01091-0_6.

[3] Radev, D. R., Hovy, E. & McKeown, K. Introductionto the Special Issue on Summarization. Computational Linguis-tics, 28(4): 2002; 399–408. ISSN 0891-2017. doi:10.1162/089120102762671927.

[4] Hovy, E. & Lin, C. Y. Automated Text Summarization inSUMMARIST. In: Advances in Automatic Text Summarization,(eds.) Mani, I. & Maybury, M. T. MIT Press: 1999.

[5] Jones, K. Automatic summarising: The state of the art. Infor-mation Processing & Management, 43(6): 2007. doi:10.1016/j.ipm.2007.03.009.

[6] Aponte, J. & Moreno, L. Summarizing Source Code Artifacts.In: Encuentro Nacional de Investigación y Desarrollo - ENID, (eds.)Niño, L. & Gómez, A. Universidad Nacional de Colombia: 2010.

[7] Hariharan, S. & Srinivasan, R. Studies on intrinsic summaryevaluation. 2: 2010.

[8] Lin, C. ROUGE: A Package for Automatic Evaluation of sum-maries. In: Proc. ACL workshop on Text Summarization BranchesOut: 2004, 10.

[9] Nenkova, A. & Passonneau, R. Evaluating Content Selectionin Summarization: The Pyramid Method : 2005.

24


[10] LaToza, T. D., Venolia, G. & DeLine, R. Maintainingmental models: a study of developer work habits. In: ICSE’06: Proceedings of the 28th international conference on Soft-ware engineering. ACM, New York, NY, USA: 2006, 492–501. doi:10.1145/1134285.1134355.

[11] Ko, A., Myers, B., Coblenz, M. & Aung, H. An Ex-ploratory Study of How Developers Seek, Relate, and Collect Rele-vant Information during Software Maintenance Tasks. IEEE Trans-actions on Software Engineering, 32(12): 2006; 971–987. doi:10.1109/TSE.2006.116.

[12] Rastkar, S., Murphy, G. C. & Murray, G. Summarizingsoftware artifacts: a case study of bug reports. In: ICSE ’10:Proceedings of the 32nd ACM/IEEE International Conference onSoftware Engineering. ACM, New York, NY, USA: 2010, 505–514.doi:http://doi.acm.org/10.1145/1806799.1806872.

[13] Kuhn, A., Ducasse, S. & G’ırba, T. Semantic clustering:Identifying topics in source code. Inf. Softw. Technol., 49(3): 2007;230–243. doi:http://dx.doi.org/10.1016/j.infsof.2006.10.017.

[14] Binkley, D. Source Code Analysis: A Road Map. In: 2007Future of Software Engineering, FOSE ’07. IEEE Computer Society,Washington, DC, USA: 2007. ISBN 0-7695-2829-5, 104–119. doi:http://dx.doi.org/10.1109/FOSE.2007.27.

[15] Sridhara, G., Hill, E., Muppaneni, D., Pollock, L. &Shanker, K. V. Towards Automatically Generating SummaryComments for Java Methods. 25th IEEE/ACM International Con-ference on Automated Software Engineering: 2010; 43–52. doi:http://doi.acm.org/10.1145/1858996.1859006.

[16] Haiduc, S., Aponte, J., Moreno, L. & Marcus, A. Onthe Use of Automated Text Summarization Techniques for Sum-marizing Source Code: 2010.

[17] Murphy, G. C. Lightweight structural summarization as an aidto software evolution. Tesis Doctoral, University of Washington:1996.

25


[18] Poshyvanyk, D. & Marcus, A. Combining Formal ConceptAnalysis with Information Retrieval for Concept Location in SourceCode. In: 15th IEEE International Conference on Program Com-prehension (ICPC ’07). IEEE, Washington, DC, USA: 2007, 37–48.doi:10.1109/ICPC.2007.13.

[19] Hamou-Lhadj, A. & Lethbridge, T. Summarizing the Con-tent of Large Traces to Facilitate the Understanding of the Be-haviour of a Software System. In: Proceedings of the 14th IEEEInternational Conference on Program Comprehension. IEEE Com-puter Society: 2006. doi:10.1109/ICPC.2006.45.

[20] Buse, R. P. L. & Weimer, W. R. Automatically documentingprogram changes. In: ASE ’10: Proceedings of the IEEE/ACMinternational conference on Automated software engineering. ACM,New York, NY, USA: 2010, 33–42. doi:http://doi.acm.org/10.1145/1858996.1859005.

[21] Rastkar, S. Summarizing software concerns. In: ICSE ’10:Proceedings of the 32nd ACM/IEEE International Conference onSoftware Engineering. ACM, New York, NY, USA: 2010, 527–528.doi:http://doi.acm.org/10.1145/1810295.1810464.

[22] Witte, R., Li, Q., Zhang, Y. & Rilling, J. Text mining andsoftware engineering: an integrated source code and documentanalysis approach. IET Software Journal, 2(1): 2008.

[23] Putrycz, E. & Kark, A. Connecting Legacy Code, BusinessRules and Documentation, tome 5321. Springer Berlin / Heidel-berg: 2008, 17–30.

[24] Koskinen, J. Software Maintenance Costs: 2010. Accessed andverified on 04/10/2011.

[25] IEEE Computer Society. Software Engineering Body ofKnowledge (SWEBOK): 2004. Accessed and verified on April 11,2011.

[26] Haiduc, S., Aponte, J. & Marcus, A. Supporting programcomprehension with source code summarization. In: ICSE ’10:

26


Proceedings of the 32nd ACM/IEEE International Conference onSoftware Engineering, tome 2. ACM, New York, NY, USA: 2010,223–226. doi:http://doi.acm.org/10.1145/1810295.1810335.

[27] de Souza, S. C. B., Anquetil, N. & de Oliveira, K. M.

A study of the documentation essential to software maintenance.In: Proceedings of the 23rd annual international conference onDesign of communication: documenting & designing for pervasiveinformation, SIGDOC ’05. ACM, New York, NY, USA: 2005. ISBN1-59593-175-9, 68–75. doi:http://doi.acm.org/10.1145/1085313.1085331.

[28] Tonella, P., Torchiano, M., Du Bois, B. & Systä, T.

Empirical studies in reverse engineering: state of the art and futuretrends. Empirical Softw. Engg., 12: 2007; 551–571. ISSN 1382-3256. doi:10.1007/s10664-007-9037-5.

[29] Aponte, J. & Marcus, A. Improving Traceability Link RecoveryMethods through Software Artifact Summarization. In: Proceed-ings of the 6th International Workshop on Traceability in EmergingForms of Software Engineering (TEFSE 2011): 2011.

[30] Moreno, L., Rodríguez, C. & Aponte, J. Evaluatingsource-code summaries using Information Retrieval and Text Sum-marization techniques. In: Tendencias en ingeniería de software einteligencia artificial, (eds.) Giraldo, G. & Zapata, C. Univer-sidad Nacional de Colombia: 2011.

27

In conclusion, software

Survey and ResearchTrends in Mining

Software Repositories

Christian Rodríquez-BustosYury Niño

Jairo Aponte

ABSTRACTSoftware repositories such as version control and bug tracking systemsand archived communication mechanisms are used by software devel-opers to facilitate tracking activities that they perform. Particularly,version control systems allow storing and managing changes made tosource code and documentation. Bug tracking systems are employedto track the state of issues, and other repositories such as IRC systemsare used to discuss topics related with coordination activities. Datacontained in these repositories have been used by researchers to sup-port maintenance tasks, improve software designs, understand softwaredevelopment, predict bugs, handle planning aspects and measure thecontribution of individuals. Mining Software Repositories (MSR) are aresearch field that seeks discovering knowledge through exploration, in-tegration, processing and analyzing data contained in these repositories.This document presents a literature survey on approaches for MSR andvarious research trends and challenges that require further investigation.

2.1 INTRODUCTION

Since the 80’s, several authors have studied data generated in softwareprojects to understand how software evolves. The most notable resultsof these studies are software life cycles [1], the laws of software evo-lution [2], software evolution metrics [3, 4], and the theory of software

29


evolution [5]. These studies or research works were done only in in-dustrial systems which did not have public software repositories, conse-quently, researchers’ efforts were limited to a few companies interestedin making good use of their historical data that could be used to improvethe development process.

Meir Lehman was one of the pioneers in software evolution theory;he identified the evolution sources on computer applications and pro-grams and showed that it is a never ending process. One of the maincontributions he made was the formulation of software evolution laws inthe 80’s. They are summarized in the Table 2.1.

Table 2.1 Law of program evolution [1].

Law DescriptionContinuing change This law expresses the fact that large programs are

never completed. The change or decay process con-tinues until it is judged more cost effective to replacethe system with a recreated version.

Increasing complexity This law proposes that because an evolving programis continually changed, its complexity reflects a de-teriorating structure.

The fundamental lawof program evolution

This law proposes that the evolution of a programis subject to a dynamic. It makes the program-ming process self-regulating with statistically deter-minable trends and invariables.

Conservation of orga-nizational stability

This law proposes that during the active life of aprogram, the global activity rate in a programmingproject is statistically invariant.

Conservation of famil-iarity

This law describes the fact that during the active lifeof a program the release content (changes, additions,deletions) of the successive releases of an evolvingprogram is statistically invariant.

Recently, with the success of open source software projects, availabledata in public software repositories have grown exponentially1. Theserepositories include version control systems, bug tracking systems andarchived communication mechanisms. Version control systems allow

1http://sourceforge.net, the most important Web site for the developmentof open code projects, had 2.7 million of registered developers and 260,000 projectscreated at February 2011.

30

SURVEY AND RESEARCH TRENDS IN MINING SOFTWAREREPOSITORIES

developers to store and manage changes made to source code and doc-umentation. Bug tracking systems are employed to track the state ofissues. Other repositories such as Internet Relay Chat (IRC) systemsare used to discuss topics related with coordination activities. Theserepositories are generated and populated during software evolution; con-sequently, they hold a wealth of information and provide a unique viewof the actual evolutionary path taken to realize a software system [6].

In this context, Mining Software Repositories (MSR) emerged as aresearch field that seeks for new knowledge through exploring, integrat-ing, processing and analyzing data contained in software repositories.The goal is to make the acquired knowledge useful for present and fu-ture decision making processes [7].

The term MSR has been coined to describe a broad class of relatedinvestigations which can be classified according to the type of issuethey tackle. Particularly, MSR has been used by researchers in variousways, for instance, analyzing software ecosystems and mining reposi-tories across multiple projects, assisting in program understanding andvisualization [8–10], predicting the quality of software systems [11–13],studying the evolution of software systems, characterizing software de-fects [14], discovering patterns of change and refactorings [15, 16], un-derstanding the origins of code cloning [17], measuring the contribu-tion of developers [18–23], or modeling social and development pro-cesses [24–26].

The goal of this chapter is to provide a general approach about MSR.This document presents a description of software repositories in Section2.2; in Section 2.3 an explanation of the phases commonly used in MSRand a description of functionality of some tools for MSR are presented.A state of the art of research works made in MSR is presented in Section2.4, and finally, some research challenges and trends are mentioned inSection 2.5.

2.2 UNDERSTANDING SOFTWARE REPOSITORIES

Software repositories are artifacts produced and archived during soft-ware life cycle. They are used by software developers to help in man-aging progress of software projects. They can be classified in historical

31


repositories, communications logs, source code, runtime, documenta-tion, and tests executions. Figure 2.1 illustrates the classification ofsoftware repositories according to the data they store.

CVS

SVN

Clearcase

Mantis

SCCS

RCS

PVCS

Local Only

Version Control Systems

Historical

Distributed

Git

Mercurial

Bikeeper

Bazaar

BugtrackerSystems

Bugzilla

Jira

SoftwareRepositories

CommunicationsLogs

IMConversations

Emails

Forums

Sourde Code

Sourceforge

Google Code

GitHub

Other kind

Test Results

DeploymentLogs

Error Messages

Build Warnings

Centralized

Perforce

Figure 2.1 Classification of Software Repositories.

2.2.1 Historical Repositories

Historical repositories include version control and bug tracking systems.The following subsections present a general description of both.

Version Control Systems

Version control systems keep the record of changes made to files overtime and have been used by developers for managing source code interms of revisions, versions or patches. Also, version control systems al-low developers to create workflows through branching and merging oper-ations [27]. Changes stored in historical repositories generally have asso-ciated detailed information about every transaction or commit, namely:Date, author, modification performed, comments and extra informa-tion [28]. These repositories are classified as centralized or distributed.

32


Centralized repositories, such as CVS collapsed2, Subversion 3, and Per-force4 have a single server that contains all the versioned files allowingclients to check out files from that central place [27]. In distributedrepositories, such as Git5, Mercurial6, Bazaar7 or Darcs8, clients notonly check out the latest snapshot of the files, but they mirror the en-tire central repository or others developer’s repositories [29]. This modelis explained later in Section 2.5.1.

Bug Tracking Systems

Bug tracking systems such as Bug Zilla9 or Mantis10 have been used tostore bug reports, requests for enhancements and new features requests.These repositories are employed by developers for submitting reportsand providing feedback about results of testing activities. Reports areassigned to developers, who fix the corresponding bugs and then markthem as solved. In bug tracking systems, users may add comments andpropose source code patches [28].

2.2.2 Communications Logs

In open source software projects, most developers are geographicallydistributed. Communication mainly occurs through electronic mail, inmost cases in the context of electronic mailing lists, forums, or throughInternet Relay Chat (IRC) systems [28]. These systems are used bydevelopers for asking each other questions and get instant responses.Consequently, the corresponding archives are a rich source of informationrelated with the decisions taken throughout the life of a project.

2CVS Project webpage: http://www.nongnu.org/cvs/3SVN Project webpage: http://subversion.tigris.org4Perforce Project webpage: http://www.perforce.com/5Git Project webpage: http://git-scm.com/6Mercurial Project webpage: http://mercurial.selenic.com/7Bazaar Project webpage: http://bazaar.canonical.com/en/8Darcs Project webpage: http://darcs.net/9Bugzilla project website: http://www.bugzilla.org/

10Mantis project webpage: http://www.mantisbt.org/

33


2.2.3 Source Code

Sites as SourceForge11, Google Code12 or GitHub13 are web-based sourcecode repositories that allow developers to create open source softwareprojects and manage the development and distribution process. Theserepositories act as a centralized location for software developers to con-trol and manage projects. Some of these systems provide revision con-trol systems, bugtrackers, wikis, metrics, access to databases and uniqueURLs14.

2.2.4 Other Kind of Repositories

Other kinds of less common repositories include documentation, testexecutions, runtime logs, deployment logs and building reports. Thesearchives contain valuable dynamic and static information about the sys-tems. Although the extraction of knowledge from these archives can bedifficult given the unstructured nature of the data contained in them,these archives are a rich source in the phases of operation and mainte-nance.

2.3 PROCESSES OF MINING SOFTWAREREPOSITORIES

In the following subsections, a survey of each methodology that hasbeen developed or adopted by MSR is presented. A description of thetechniques employed in recent MSR studies as well as an explanation ofthe functionality of tools proposed in the literature are also included.

2.3.1 Techniques

Several methodologies have been proposed for resolving problems as-sociated to software engineering using MSR. Huzefa [6, 30] classifiedsome methodologies according to the techniques used in studies foundin the MSR literature. Table 2.2 depicts a description of several populartechniques reported.

11SourceForge webpage: http://sourceforge.net/12Google Code webpage: http://code.google.com/13GitHub webpage: https://github.com/14URLs are useful for publishing projects in Internet

34


Table 2.2 Methodologies used in MSR.

Technique Purpose

MetadataAnalysis

To study the links between metadata (e.g., author, severity, date,id) using regular expressions, heuristics and common subsequencematching.

Static SourceCode Analysis

To study changes in individual versions of source code using staticprogram analysis.

Source CodeDifferencing

To study differences between versions of source code usingsemantic differencing and changes in micro patterns analysis.

SoftwareMetrics

To study software metrics that assess aspects of software productsby analyzing modules size, developers’ effort, phases cost,functionality implemented and software quality, complexity,efficiency, reliability and maintainability.

Visualization To study computer-based and interactive visual representations ofdata mined from software repositories in order to amplifycognition and understanding.

CloneDetectionMethods

To study source code entities with similar textual, structural andsemantic composition using techniques as text-based,token-based, program dependency graphs and AST.

FrequentPatternMining

To study metadata, source code data and difference data withfrequent pattern through itemset mining and sequential patternmining.

InformationRetrievalMethods

To study traceability, program comprehension and software reusethrough classification and clustering of textual units based onvarious similarity concepts using IR methods.

ClassificationwithSupervisedLearning

To study the classification of bug reports or changes usingtechniques that automatically acquire and integrate knowledge inorder to improve performance for a task.

SocialNetworkAnalysis

To study the relationships between human entities using socialnetwork analysis for measuring contributions, discoveringdevelopers’ roles and associations between projects’ contributors.

2.3.2 Tools

In order to accomplish MSR studies some researchers have developedspecific purpose tools; some of them are shown in Table 2.3. For eachtool a short description of its purpose is mentioned.

35


Table 2.3 Mining Software Repositories tools.

Tool FunctionalityAlitheia Core The Alitheia Core tool is an extensible platform for software quality anal-

ysis. It is designed to facilitate software engineering research on largeand diverse data sources, by integrating data collection and preprocess-ing phases with an array of analysis services; it presents the researcherwith an easy to use extension mechanism [31] [32].

CVSAnaly CVSAnaly is a tool for measuring and analyzing remotely big libre softwareprojects using publicly available data from their version control reposi-tories. It consists of three steps: Preprocessing, database insertion, andpostprocessing [33].

CVSChecker CVSChecker is a tool designed to analyze the performance of individualdeveloper s and the work distribution patterns of teams based on historicalsource code repository data. It is developed as a plug-in for the EclipseIDE and assumes CVS as the underlying source code repository [34].

CVSScan CVSScan is an integrated multiview environment that is oriented to dis-play changing code, in which each version is represented by a columnand the horizontal direction is used for time. Separate linked displaysshow various metrics, as well as the source code itself. A large variety ofoptions is provided to visualize a number of different aspects [35].

DrJones DrJones is a system for performing analysis of software archaeology forsoftware archived in a versioned repository. It is used for studing how oldthe current code base in software applications is [33].

GlueTheos GlueTheos is a system which allows users to retrieve and analyze data ofpublic software repositories. It can access CVS repositories and archivesof source packages. It is designed in a highly modularized way. By usingexternal tools it can make various different analyses on the fetched dataand produce various kinds of reports [36].

MailingListStats MailingListStats is a tool for analyzing activity in mailing lists. The toolitself extracts information about authors, dates, subjects and other fieldsrelated to email, and stores them in a database [37].

Programeter Programeter is a metrics tracking kit for software development. It moni-tors a set of metrics throughout the software development life cycle andgenerates full and objective analytics on the coding, quality assuranceand project management trends.

SoftChange SoftChange is a tool for the extraction, enhancement and visualizationof software trails as CVS. It has an extractor, a SQL relational databasemanagement system, a fact enhancer and a visualizer [38].

SLOCCount SLOCCount is a tool that performs advanced counting of physical sourcelines of source code. It uses various heuristics to determine the program-ming language, and then filters comments [39].

Xia Xia is a visualization tool for the navigation and exploration of softwareversion history and associated human activities. It has exploration mech-anisms to query the information space. The tool was integrated withEclipse, an integrated development environment [40].

2.4 PURPOSE OFMINING SOFTWARE REPOSITORIES

Researchers have mined data and metadata gathered from softwarerepositories to extract pertinent information for resolving problems

36


related to system growth, program understanding, software systemsquality prediction, refactorings and change patterns, individual contri-bution measuring and social and development processes modeling andunderstanding. In the next subsections the state of the art of someworks related with the problems mentioned above is presented.

2.4.1 Program Understanding

A good understanding of the software system is needed to reduce thecost and time of maintenance activities. Various research works havebeen proposed to assist developers in this process. Next, a descriptionof some related works is shown.

In 2001, Sayyad et al. [8] described the application of inductivemethods on data extracted from both source code and software mainte-nance records. They extracted the relations that indicated which files ina legacy system were relevant to each other, in the context of programmaintenance. They proposed a methodology for extracting and eval-uating the relations and found that the precision and recall measuresincreased compared with other research works.

In 2004, Hassan et al. [41] described an approach which recoversvaluable information from source control systems and attaches this in-formation to the static dependency graph of a software system. Thisinformation is used to help understanding the architecture of large soft-ware systems. To demonstrate the viability of their approach, they usedit to understand the architecture of NetBSD.

In 2007, Canfora et al. [9] presented a technique to track the evo-lution of source code lines, identifying whether a CVS change is due toline modifications rather than to additions and deletions. The techniquecompares the sets of lines added and deleted in a change set using vec-tor space models with the Levenshtein distance. They obtained resultsthat indicated that the proposed approach ensured both high precisionand high recall.

The improvement of those kinds of techniques could be useful fordevelopment industry because the knowledge about the system couldbe extracted from the system itself. In this sense there is no humandependency with the experts, making easier the work for newcomers,

37


distributed teams or legacy systems. One of the big cons is that thereare few “generic tool” that implement these approaches, restricting theirapplicability in industry.

2.4.2 Prediction of Quality of Software Systems

The participants of software projects spend a high amount of resourcesin quality assurance, but these resources most of the time are limited.Because of this high cost, managers need to invest resources on thosemodules which have the highest risk of producing failures. Some re-search works proposed in that sense are presented below.

In 2006, Knab et al. [11] presented an approach that applies a de-cision tree learner on evolution data, extracted from the Mozilla opensource web browser project for predicting density of the defects. Thedata includes different source codes, modifications, and defect mea-sures, computed from seven recent Mozilla releases. Their experimentsshowed that a simple tree learner can produce good results with varioussets of input data. They found that the lines of code have a lower valuefor predicting defect densities than the number of bug reports in thepast.

In 2007, Morisaki et al. [12] described an empirical study to revealrules associated with defect correction efforts, defined as a quantitativevariable and extended association mining rule to directly handle suchquantitative variables. An extended rule describes the statistical char-acteristic of a ratio or interval scale variable in the consequent part ofthe rule by its mean value and standard deviation, so that conditionsproducing distinctive statistics can be discovered. They found that itis necessary to pay attention to types of defects that have larger meanand standard deviation of effort.

In 2008, Zimmermann et al. [42] researched how code complex-ity, problem domain, past history, and process quality affect softwaredefects. They performed a case study on five Microsoft projects andobserved significant correlations between complexity metrics, in bothobject-oriented OO and non-OO metrics, and post-release defects.

In 2007, Panjer et al. [13] explored the viability of using data min-ing tools to predict the time spent on fixing a bug, having only basic

38


information about the bug’s lifetime. They used a historical portion ofthe Eclipse Bugzilla database for modeling and predicting bug lifetimes.A bug history transformation process is described and various data min-ing models are built and tested. They found that an accuracy of 34.9%can be achieved by using primitive attributes associated with a bug.

In 2010, Guo et al. [43] performed an empirical study to characterizefactors that affect which bugs get fixed in Windows Vista and Windows7. They focused on factors related to bug reports and relationshipsbetween people involved in handling the bug. They built a statisticalmodel to predict the probability of fixing a new bug and found that bugsreported by people with better reputations were more likely to be fixed,as well as those bugs handled by people on the same team and workingin geographical proximity.

It is notorious that software industry spends a lot of resources in qual-ity assurance of its products. Since these resources are limited, progressin this research field has allowed to spend them in a most effective way,by getting the best quality and the lowest risk of failure. Open sourceprojects communities have been benefited from these results becausethey have bug tracking systems that contain a wealth of informationabout software failures, how they occurred, who was affected, and howthey were fixed. In previous works, this information has been used topredict future software properties such as where the defects are, how tofix them and their associated cost [42].

2.4.3 Discovering Patterns of Changeand Refactorings

Change propagation is a central aspect of software development. Whendevelopers modify software entities, such as functions or variables, tointroduce new features or fix bugs, they must ensure that other entitiesin the software system are updated to be consistent with these newchanges [15]. A description of some research works published in theliterature is shown below.

In 2004, Hassan et al. [15] proposed various heuristics to predictchange propagation. They presented a framework to measure the per-formance of these heuristics and validated the results empirically using

39


data obtained by analyzing the development history of five large opensource software systems.

In 2004, Ying et al. [16] showed how useful change pattern mining isand introduced a set of criteria for evaluating the usefulness of changepattern recommendations. Their approach consisted on three stages:First, data from a software configuration management system (SCM)are extracted; second, the data are preprocessed to be suitable as aninput for a data mining algorithm; finally, an mining algorithm (based onassociation rule) is applied to construct change patterns and recommendrelevant source files. Although the precision and recall found after theprocess were not high, recommendations revealed valuable dependenciesthat may not be distinguishable from other existing analysis.

2.4.4 Measuring of the Contribution of Individuals

Measuring and monitoring the contribution of the developers are impor-tant for managers because human resources are the source of most ofthe costs. In this sense, some models described in the literature haveproposed solutions for visualizing the progress and planning of equip-ment, which revealed weaknesses in released products, identified bottle-necks within working groups and ensured monitoring the contributionof developers.

In 2006, Amor et al. [18] presented a model for measuring the contri-bution of developers in free software projects. This model considered thehuman effort as the sum of individual costs: Internal costs and externalcosts. The internal costs considered developers assigned to companiesparticipating in the project, while external costs referred to the effort bythird parties, usually volunteers.

In 2006, Amor et al. [19] characterized the coding activity of devel-opers on a software project. They used a methodology for classifyingthe interactions of developers with versioning systems. The methodol-ogy is based on the analysis of the textual descriptions attached to eachtransaction. They presented the results of applying this methodologyon the FreeBSD CVS repository.

In 2007, Anvik et al. [20] presented an empirical evaluation of twoapproaches to determine implementation expertise, based on the data

40


contained in source and bug repositories. They used the “Line 10 Rule”heuristic, which is used in programs such as Expertise Recommender andthe Expertise Browser. This heuristic was used for creating bug reports,and expertise sets were created and compared to those provided byexperts. This comparison was evaluated using the measures of precisionand recall and the results showed that both approaches are good atfinding all the appropriate developers, although they may vary in howmany false positives are returned.

In 2009, Gousios et al. [22] presented a model for measuring thedeveloper’s contribution in agile and distributed projects. This modelcombines traditional contribution metrics with data mined from softwarerepositories; with these data they created clusters of similar projects toextract weights that are then applied to the actions that a developer per-forms on project tasks, in order to extract a combined measurement ofthe developer’s contribution. The model presented has been developedas a plug-in to the Alitheia Core software evaluation tool.

In 2010, Niño et al. [44] presented a model to measure the contribu-tion of developers in open source projects. They defined a contributionfunction associated to each developer; the function is composed by twofactors: A first factor associated to the size of the contributions and asecond factor associated to the quality of contributions. The values aredefined on a set of metrics of size and quality of software. The metricsof size include the number of added, modified or deleted artifacts, andthe metrics of quality include the cyclomatic complexity of these arti-facts. They implemented a tool named DevMeter, that automaticallycalculates the values of these metrics and the value of the contributionfunction.

Industry and academia have paid special attention to measuring thecontribution of the developers. Human resources are the source of mostof the costs and a good team is essential for their effective perfor-mance [45]. Unfortunately, only a third of all software companies usetechniques to measure their products and development projects [46];the reason is related to the difficulties associated with measuring non-tangible aspects such as the contribution of the developers. The re-sults obtained in the automation of productivity measuring will reducethe costs associated with software production process; it will show

41


weaknesses in released products and it can be used to identify bot-tlenecks within working groups.

2.4.5 Modeling Social and Development Processes

The theory of complex networks is based on representing complex sys-tems as graphs. In software engineering this approach has been success-fully used for modeling social processes.

In 2004, Lopez et al. [47] represented a network in terms of actors;each vertex was associated with a particular person and two vertices arelinked together when they belong to the same group of people. They alsorepresented the network in terms of groups; each vertex is associatedwith a group and two groups are linked through an edge when there is,at least, one person belonging to both at the same time. They used thisapproach for characterizing the open software projects, their evolutionover time and their internal structure.

In 2004, Crowston et al. [24] examined 120 project teams fromSourceForge, representing a wide range of FLOSS project types, fortheir communications centralization as revealed in the interactions in thebug tracking system. They found that FLOSS development teams varywidely in their communications centralization, from projects completelycentered on one developer to projects that are highly decentralized andexhibit a distributed pattern of conversation between developers andactive users.

In 2005, Ohira et al. [48] developed a tool named Graphmania tocollect data of projects and developers at SourceForge, and to visualizethe relationship among them using techniques of collaborative filteringand social networks. They performed a case study applying Graphmaniato F/OSS15 projects data collected from SourceForge and they found acommon practice in which similar projects have similar project names,and they showed the benefits of knowing which are the developer’sneighborhoods.

In 2005, Huang et al. [49] used a representative model named Le-gitimate Peripheral Participation (LPP) for describing the interactionsin the open source development process. They divide developers and

15Acronym of Free and open-source software.

42


modules into groups according to revision histories of the open sourcesoftware repository; they divided the modules into kernel and non-kerneltypes. Their results showed some process of relative importance on theconstructed graph from the project development. The graph revealedcertain subtle relationships in the interactions between core and non-core team developers, and the interfaces between kernel and non-kernelmodules.

In 2007, Yu et al. [25] applied clustering techniques to data gath-ered from CVS repositories, in order to determine whether a developerbelongs to the core or not. They discovered that for small (14 mem-bers) or medium (55 members) teams, the core was formed by 5 and 11developers, respectively. It was remarked that there is no strong relationbetween core and non-core developers.

In 2010, Yuan et al. [50] mined 7779 projects hosted on SourceForgeto understand the structure of OSS in terms of the relationships betweenroles involved in the development process; the goal was to analyze theroles structure involved around those projects in order to have a quan-titative way to measure them. The authors discovered that the numberof roles involved in the projects are related to the rank of the projectinside the community; in this way they mentioned that the analysis ofthe role structure in an OSS project conduces to an OSS evaluation;also, they found that specific roles are most suitable to interact withsome other roles, and projects with more roles established tend to bebetter ranked inside the SourceForge community.

In software engineering field, people are the responsible for the suc-cess or the failure of a project; in this sense, to analyze the invisiblerelationships between team members should be a priority. Factors suchas code contribution, roles, communication flows or code ownership area few examples of what kinds of invisible relationships or informationare possible to extract using MSR techniques. The improvement, im-plementation and adoption of techniques that allow team leaders tocalculate this kind of metrics could represent valuable data useful forarranging, measuring, sponsoring or encouraging of team members.

43



In previous section was shown that MSR is a research multi-purpose fieldrelated to the software evolution area, but there are new challenges andopportunities in which researchers have started to working on. Someof them are focused on making easier the mining process, researchingnew data sources or simplifying access to MSR techniques. In the nextparagraphs those challenges and opportunities are shown.

2.5.1 Thinking in Distributed VersionControl Systems

All the works mentioned in Section 2.4, independently of which problemthey addressed, were based on data extracted from centralized versioncontrol systems (CVCSs) such as SVN or CVS, but we think that newopportunities are emerging with the growing of Distributed Version Con-trol Systems (DVCSs). During the last 10 years several DVCSs, suchas Git, Bazaar and Mercurial, have been developed and many of themare still undergoing rapid evolution and gaining adepts; this fact is high-lighted by the adoption of this technology by large and important opensource projects, such as Mozilla Project, Linux Kernel Project, MySQL,Perl and Eclipse Foundation, and is a fact of the current importance ofthis new flavor of version control systems.

From a technical point of view, the most important feature of DVCSsis that they do not have fixed client-server architecture, allowing individ-ual developers to be servers or clients, depending on particular circum-stances, like in peer-to-peer models. In this way, developers can workon source code without being connected to a central repository, andtherefore most of the operations are performed faster since no networkis involved. From a conceptual point of view, DVCSs work in terms ofchanges instead of versions, forcing the developers to think about thechanges themselves as first class concept within their version controloperations. The major argument behind this crucial variation is thatwhen someone manages changes (patches) instead of versions, mergingworks better, and therefore, developers can branch any time they need,because going back is easier, as is mentioned in [51] and in the technical

44


talk16 offered in 2007 by Linus Torvalds, the creator and leader of theIT project. In this regard, David Miller, a senior Linux developer fromthe Red Hat project, mentioned two years after the migration of theLinux Kernel to BitKeeper (another DVCS):

BitKeeper I was in merge hell every time a new kernel was released.Now I do real work instead of wasting time on repeated merging.

Regarding team structure, DVCSs allow large teams, such as thosethat occur in some OSS projects, to easily create or implement (orchange between) different workflow models. So, the classic centralmodels, the “dictator and lieutenant model” or the “integration modelmanager model”, are two examples of the possibilities brought by theuse of DVCSs [27]. These models are shown in the Figure 2.2.

Developer Developer

Developer Developer

DeveloperDeveloper Developer Developer

Sharedrepository

Dictator

LieutenantLieutenant

Blessedrepository

Central Model Integration model manager model

Developer private

Developer private

Developer private

Developer public

Developer public

Developer public

Integrator

Blessedrepository

Figure 2.2 Development workflows models. Adapted from [27].

Recently many open and closed source projects are proposing tomigrate, or have already migrated, their repositories from a CVCS to aDVCS [52]. Researchers are starting to find out the rationales behind themigration decisions and determine the consequences for the organization

16Tech talk: Linus Torvalds on Git:http://www.youtube.com/watch?v=4XpnKHJAok8

45


and activities of the development team. Besides, they think, this newgeneration of version control systems comes with the promise of newdata, which will lead to new research questions related to how DVCSsaffect processes, products and people around software projects.

The promises and perils of DVCSs are mentioned by [52], but thistime the authors look the problem from a research point of view. Theyobserve that researchers could recover more history from repositoriessince any developer can make her own repository or branch. Thoselocal workspaces contain all the information about merges, branchesand commits done in the DAGs17 of multiples repositories. Also it ispossible to recover the information related to the original committer orreviewer of a change using information stored in each patch, in frontto the centralized repositories where only allowed committers (the oneswho have writing permissions) appear as code contributors for projects.

Technically, this new way of management historical data brings apair of benefits to MSR researches: First, the distributed repositoriesare smaller in size than centralized ones but contain more informationabout contribution and workflows, and second, the extraction of datais faster since a connection with a repository is not required. Also, isimportant to mention that these kind of version control systems havebeen few explored until these days, making them an attractive area tomove in.

2.5.2 Integrating and Redesigning Repositories

Some MSR studies, for example the ones which intend to support main-tenance tasks, have the necessity of linking different data sources. InSection 2.2 was mentioned that data generated along the evolution ofsoftware products can be stored in several kind of repositories; this factmay cause a loss of traceability between related artifacts; for instance,a bug report stored in a bug tracker and its respective code fix storedin a version control system have an evident relationship that cannot behandled by most of these systems. This problem has been addressedby researchers in order to develop mechanisms for linking data between

17DAG: acronym of Directed Acyclic Graph, refers to the relationship betweendifferent patches.

46


repositories, most of them have focused their efforts in relate bug reportsand source code patches.

In 2009, Bachmann [53] and Bernstein presented an automaticmethod for link bug reports stored in Bug Zilla and source code patchesstored in SVN using XML parsers and regular expressions analyzers tosearch bug reports identifiers on the CVS log commit; they mentionedseveral problems they had to tackle in the processes of extraction, pro-cessing and validation. As a result, they found their proposal had ahigher precision than other authors’ techniques. We think that one ofthe problems of this kind of studies is the dependency on the “goodpractices” of the developing teams, because the content of log commitscan change from one team to another; so, if two teams describe com-mits in different ways maybe the linking technique get better or worseresults [53].

In similar way as Herraiz, Robles and Gonzalez-Barahona expressedin [29], we think that MSR researchers could attack the linking pro-cess in terms of adapting or extending current open source repositories(or developing new ones) systems for support automatically or at leastassisted linking between data stored in different repositories, e.g., bugreports, source code artifacts, IRC logs, documentation, etc.

2.5.3 Simplifying MSR Techniques

In Table 2.3 is shown that there are tools developed for solving specificMSR task; it is possible to find ones which help in tasks of measuringcontribution, visualizing historical data or measuring software quality.For those tools mentioned before and four additional cited by Has-san in [7], we determined the last source code activity, the availabil-ity of source code and binary or runnable scripts; also, we annotatedif the tools are Web based or IDE plug-ins. These data are shownin Table 2.4.

We find that most of the projects are not in active developing, only5 of 13 have been updated in the last year, and 2 do not have web page;also, tools are commonly offered as source code, or binary scripts; thispoint can be considered as a use limitation, because only experimentedusers could have the ability to use them freely; on the contrary, we

47


found that DevMeter and MyLyn (both of them in active developing)are implemented as web tools and IDE plug-ins, respectively; this pointcould facilitate the adoption of MSR techniques by no-researcher users.

Table 2.4 Current state of MSR tools.

Tool Last Code Source Binary Web IDE WebUpdate Code or Scripts Tool Integration Page

Alitheia Core 2011 yes no yes no yesCVSAnaly 2010 yes yes no no yesCVSScan 2005 no yes no no yesDevMeter 2010 yes yes yes no yesDrJones no info no info no info no info no info no infoeRose 2005 no no no Eclipse yesGluTeos 2007 yes no no no yesHATARI no info no no no Eclipse yesHipikat 2005 no no no Eclipse yesMailingListStats 2007 yes no no no yesMylar (MyLyn) 2011 yes yes no Eclipse yesProgrameter (Commercial) no info no no yes no yesSLOCCount 2006 yes yes no no yesSoftChange 2003 yes yes no no yesXIA no info no info no info no info no info no info

As is mentioned by Hassan in [7], researchers must think in terms ofusers, taking into account the usability and accessibility of their tools,because MSR techniques must help team members to solve real lifeproblems (solving bugs, understanding systems, measuring quality, etc.),not only research ones. Therefore, new efforts in this field must beoriented to facilitate the access to MSR techniques integrating theminto IDEs, or as web tools for fast and easy access to developers, teamleaders and users.

2.6 SUMMARY

This chapter shows a short review about the main concepts of miningsoftware repositories (MSR); this field was born to support evolutionaryand maintenance tasks by mining data stored in the software repositoriesof developing projects. Most common repositories can be categorizedin: Historical, communication logs, source code or other repositories(this categorization depends on the data they stored).

48


Some of the problems related to mining large software repositoriesare the preprocessing, extraction and analysis tasks, because they involvethe processing of large amount of data generated during the softwarelife cycle; in this way, some researchers have tackled this problem bydeveloping tools or by designing techniques in order to facilitate miningprocesses. Examples of these techniques are the metadata analysis,static source code analysis, and social network analysis.

Also, three main research trends and opportunities were presented:First, the popularization and adoption of distributed version control sys-tems by large open source projects highlight the emerging importanceof this technology, it promises new data and technical benefits for de-velopers and MSR’s researchers. Second, the limitation from currentrepositories systems needs to be addressed by developing extensions orcreating new systems that support mining process naturally. The lastfuture trend remarks the importance of simplifying the access to MSRtechniques for researchers and non-researchers; this could be done byimplementation of tools as web-based tools and as IDE plug-ins, bringingto new users the benefits of this kind of studies.

At last, MSR is presented as an active field of research which canbring multiples benefits for software evolution researchers or for new oractive software developers.

49


REFERENCES

[1] Lehman, M. Programs, Life Cycles, and Laws of Software Evo-lution. In: Proceedings of the IEEE : 1980.

[2] Lehman, M. On Understanding Laws, Evolution and Conser-vation in the Large Program Life Cycle. Journal of Systems andSoftware, 1: 1980; 213–221.

[3] Lehman, M., Ramil, J. F., Wernick, P. D., Perry, D. E.

& Turski, W. M. Metrics and Laws of Software Evolution TheNineties View. In: Proceedings 4th International Software MetricsSymposium (METRICS ’97): 1997.

[4] Ramil, J. F. & Lehman, M. Metrics of Software Evolution asEffort Predictors - A Case Study. In: Proc. Int Software Mainte-nance Conf : 2000, 163–172. doi:10.1109/ICSM.2000.883036.

[5] Lehman, M. & Ramil, J. F. An Approach to a Theory ofSoftware Evolution. In: Proceedings of International Workshopon Principles of Software Evolution (IWPSE’01) Vienna, Austria:2001.

[6] Kagdi, H., Collard, M. L. & Maletic, J. I. A Surveyand Taxonomy of Approaches for Mining Software Repositories inthe Context of Software Evolution. J. Softw. Maint. Evol., 19(2):2007; 77–131. ISSN 1532-060X. doi:http://dx.doi.org/10.1002/smr.344.

[7] Hassan, A. E. The road ahead for Mining Software Repositories.In: Proc. FoSM 2008. Frontiers of Software Maintenance: 2008,48–57. doi:10.1109/FOSM.2008.4659248.

[8] Sayyad, J. & Lethbridge, C. Supporting Software Mainte-nance by Mining Software Update Records. In: ICSM ’01: Pro-ceedings of the IEEE International Conference on Software Mainte-nance (ICSM’01). IEEE Computer Society, Washington, DC, USA:2001. ISBN 0-7695-1189-9, 22.

50


[9] Canfora, G., Cerulo, L. & Penta, M. D. IdentifyingChanged Source Code Lines from Version Repositories. In: Pro-ceedings of the Fourth International Workshop on Mining SoftwareRepositories: 2007.

[10] Hassan, A. E. Mining Software Repositories to Assist Developersand Support Managers. Tesis Doctoral, University of Waterloo,Ontario, Canada: 2004.

[11] Knab, P., Pinzger, M. & Bernstein, A. Predicting defectdensities in source code files with decision tree learners. In: Pro-ceedings of the 2006 international workshop on Mining softwarerepositories: 2006.

[12] Morisaki, S., Monden, A., Matsumura, T., Tamada, H.

& Matsumoto, K. Defect Data Analysis Based on Extended As-sociation Rule Mining. In: Proceedings of the Fourth InternationalWorkshop on Mining Software Repositories: 2007.

[13] Panjer, L. Predicting Eclipse Bug Lifetimes. In: Proceedings ofthe Fourth International Workshop on Mining Software Reposito-ries: 2007.

[14] Lamkanfi, A., Demeyer, S., Giger, E. & Goethals, B.

Predicting the severity of a reported bug. In: Proc. 7th IEEEWorking Conf. Mining Software Repositories (MSR): 2010, 1–10.doi:10.1109/MSR.2010.5463284.

[15] Hassan, A. E. & Holt, R. C. Predicting Change Propagation inSoftware Systems. In: ICSM ’04: Proceedings of the 20th IEEE In-ternational Conference on Software Maintenance. IEEE ComputerSociety, Washington, DC, USA: 2004. ISBN 0-7695-2213-0, 284–293.

[16] Ying, A. T. T., Murphy, G. C., Ng, R. & Chu-Carroll,

M. C. Predicting Source Code Changes by Mining Change History.IEEE Trans. Softw. Eng., 30(9): 2004; 574–586. ISSN 0098-5589.doi:http://dx.doi.org/10.1109/TSE.2004.52.

51


[17] Krinke, J., Gold., N., Yue, J. & Binkley, D. Cloning andCopying Between GNOME Projects. In: Proc. 7th IEEE WorkingConf. Mining Software Repositories (MSR): 2010, 98–101. doi:10.1109/MSR.2010.5463290.

[18] Amor, J., Robles, G. & González, J. Effort Estimation byCharacterizing Developer Activity. In: 8th Workshop on SoftwareEngineering Economics, Shangai China: 2006.

[19] Amor, J., Robles, G., González, J. & Navarro, A. Dis-criminating Development Activities in Versioning Systems: A CaseStudy. In: Proceedings PROMISE 2006: 2nd. International Work-shop on Predictor Models in Software Engineering : 2006.

[20] Anvik, J. & Murphy, G. Determining Implementation Exper-tise from Bug Reports. In: Proceedings of the Fourth InternationalWorkshop on Mining Software Repositories: 2007.

[21] Linstead, E., Rigor, P., Bajracharya, S., Lopes, C. &Baldi, P. Mining Eclipse Developer Contributions via Author-Topic Models. In: Proc. Fourth Int. Workshop Mining SoftwareRepositories ICSE Workshops MSR ’07 : 2007. doi:10.1109/MSR.2007.20.

[22] Kalliamvakou, E., Gousios, G., Spinellis, D. &Pouloudi, N. Measuring Developer Contribution from SoftwareRepository Data. In: 4th Mediterranean Conference on InformationSystems: 2009, 129–132.

[23] Koch, S. & Schneider, G. Effort, Cooperation and Coordina-tion in an Open Source SSoftware Project: GNOME. InformationSystems Journal, 12: 2002; 27–42.

[24] Crowston, K. & Howison, J. The social structure of Free andOpen Source software development. First Monday, 10(2): 2005;1–21.

[25] Yu, L. & Ramaswamy, S. Mining CVS Repositories to Un-derstand Open-Source Project Developer Roles. In: Proc. Fourth

52


Int. Workshop Mining Software Repositories ICSE Workshops MSR’07 : 2007, 8. doi:10.1109/MSR.2007.19.

[26] Shihab, E., Jiang, Z. M. & Hassan, A. E. On theuse of Internet Relay Chat (IRC) meetings by developers of theGNOME GTK+ project. In: Proc. 6th IEEE Int. Working Conf.Mining Software Repositories MSR ’09 : 2009, 107–110. doi:10.1109/MSR.2009.5069488.

[27] Chacon, S. Pro Git. Apress, Berkely, CA, USA: 2009. ISBN1430218339, 9781430218333.

[28] Fogel, K. Producing Open Source Software. O’Reilly: 2005.

[29] Herraiz, I., Robles, G. & González, J. M. Researchfriendly software repositories. In: IWPSE-Evol ’09: Proceedings ofthe joint international and annual ERCIM workshops on Principlesof software evolution (IWPSE) and software evolution (Evol) work-shops. ACM, New York, NY, USA: 2009. ISBN 978-1-60558-678-6,19–24. doi:http://doi.acm.org/10.1145/1595808.1595814.

[30] Kagdi, H. Mining Software Repositories to Support SoftwareEvolution. Tesis Doctoral, Kent State University: 2008.

[31] Gousios, G. & Spinellis, D. Alitheia core: An extensible Soft-ware Quality Monitoring Platform. In: Proceedings of the 31rstInternational Conference of Software Engineering Research DemosTrack : 2009.

[32] Georgios, G. & Diomidis, S. A Platform for Software Engi-neering Research. In: MSR ’09: Proceedings of the 6th WorkingConference on Mining Software Repositories.: 2009.

[33] Robles, G. Empirical Software Engineering Research on LibreSoftware: Data Sources, Methodologies and Results. Tesis Doc-toral, Escuela Superior de Ciencias Experimentales y Tecnologia,Universidad Rey Juan Carlos: 2005.

[34] Liu, Y., Stroulia, E. & Erdogmus, H. Understanding theOpen-Source Software Development Process: A Case Study with

53


CVSChecker. In: Proceedings of the First International Conferenceon Open Source Systems: 2005, 11–15.

[35] Voinea, L., Telea, A. & van Wijk, J. J. CVSscan: Visual-ization of Code Evolution. In: Proceedings of the 2005 ACM sym-posium on Software visualization. ACM, St. Louis, Missouri: 2005.ISBN 1-59593-073-6, 47–56. doi:10.1145/1056018.1056025.

[36] Robles, G., González, J. M. & Rishab, G. GlueTheos: Au-tomating the Retrieval and Analysis of Data from Publicly AvailableSoftware Repositories. In: Proceedings 1st International Workshopon Mining Software Repositories: 2004, 20–32.

[37] Herraiz, I., Robles, G., Amor, J., Teofilo, R. &González, J. M. The Processes of Joining in Global Dis-tributed Software Projects. In: Proceedings of the 2006 inter-national workshop on Global software development for the practi-tioner. ACM, Shanghai, China: 2006. ISBN 1-59593-404-9, 27–33.doi:10.1145/1138506.1138513.

[38] German, D. Mining CVS Repositories, the SoftChange Ex-perience. IEE Seminar Digests, 2004(917): 2004; 17–21. doi:10.1049/ic:20040469.

[39] Amor, J., Robles, G., González, J. M. & Herraiz, I.

From Pigs to Stripes: A travel Through Debian. In: DebConf5(Debian Annual Developers Meeting), Helsinki, Finland : 2005.

[40] Wu, X. Visualization of Version Control Information. ProyectoFin de Carrera, University of Victoria: 2003.

[41] Hassan, A. E. & Holt., R. C. Using Development HistorySticky Notes to Understand Software Architecture. In: IWPC ’04:Proceedings of the 12th IEEE International Workshop on ProgramComprehension. IEEE Computer Society, Washington, DC, USA:2004. ISBN 0-7695-2149-5, 183.

[42] Zimmermann, T., Nagappan, N. & Zeller, A. SoftwareEvolution, chapter Predicting Bugs from History. Springer: 2008,69–88.

54


[43] Guo, P., Zimmermann, T., Nagappan, N. & Murphy, B.

Characterizing and Predicting Which Bugs Get Fixed: An EmpiricalStudy of Microsoft Windows. In: Proceedings of the 32th Interna-tional Conference on Software Engineering (ICSE 2010): 2010.

[44] Nino, Y. & Aponte, J. DevMeter: Una Herramienta que Midela Contribucion de los Desarrolladores: 2010.

[45] Winter, M. Developing a Group Model for Student Soft-ware Engineering Teams. Proyecto Fin de Carrera, Universityt ofSaskatchewan: 2004.

[46] Ebert C., B. M. S. A., Dumke R & R., D. Best Prac-tices in Software Measurement. Springer, 1 edición: 2004. ISBN3540208674.

[47] López, L., Robles, G. & González, J. Applying Social Net-work Analysis to the Information in CVS Repositories. IEE SeminarDigests, 2004(917): 2004; 101–105. doi:10.1049/ic:20040485.

[48] Ohira, M., Ohsugi, N., Ohoka, T. & ichi Matsumoto,

K. Accelerating Cross-project Knowledge Collaboration Using Col-laborative Filtering and Social Networks. SIGSOFT Softw. Eng.Notes, 30: 2005; 1–5. ISSN 0163-5948. doi:http://doi.acm.org/10.1145/1082983.1083163.

[49] Huang, S.-K. & Liu, K. M. Mining version histories to verify thelearning process of Legitimate Peripheral Participants. In: Proceed-ings of the 2005 international workshop on Mining software reposi-tories, MSR ’05. ACM, New York, NY, USA: 2005. ISBN 1-59593-123-6, 1–5. doi:http://doi.acm.org/10.1145/1082983.1083158.

[50] Yuan, L., Wang, H., Yin, G., Shi, D. & Mi, H. Mining Rolesof Open Source Software. In: Proc. 2nd Int Software Engineeringand Data Mining (SEDM) Conf : 2010, 548–554.

[51] Ruparelia, N. B. The history of version control. SIGSOFTSoftw. Eng. Notes, 35(1): 2010; 5–9. ISSN 0163-5948. doi:http://doi.acm.org/10.1145/1668862.1668876.

55


[52] Bird, C., Rigby, P., Barr, E., Hamilton, D., German,

D. & Devanbu, P. The promises and perils of mining git. In:Proc. 6th IEEE Int. Working Conf. Mining Software RepositoriesMSR ’09 : 2009, 1–10. doi:10.1109/MSR.2009.5069475.

[53] Bachmann, A. & Bernstein, A. Data Retrieval, Processingand Linking for Software Process Data Analysis. Informe técnico,University of Zurich, Department of Informatics: 2009.

56

Software Visualizationto Simplify the Evolution

of Software Systems

David MontañoLeslie Solorzano

Henry Roberto Umaña-Acosta

ABSTRACTSoftware Visualization can be used to address the problems associatedwith the evolution of software systems. It offers a wide set of possibil-ities as it was shown since the first appearance of visualization in early80’s. Based on the progress made in the area, this chapter presents thedifferent techniques developed in the past, as well as the most novelsolutions. However, despite of the efforts done by researchers, the fieldneeds to face new challenges proposed by the software industry, such asnew programming paradigms.

3.1 INTRODUCTION

Size and complexity of software systems make the understanding andmaintenance task difficult to be completed. Besides, these systemsare abstract entities that cannot be represented in a way they can beunderstood by humans. Software Visualization (SV), through the useof various kinds of imagery techniques, provides strategies to facilitatethe understanding and reduce the apparent complexity of the softwaresystem [1]. Specifically, the challenge of Software Visualization is tofind effective mappings between software aspects and their graphicalrepresentations using visual metaphors [1] that allow to understand thesoftware system.

57


Considering this, there have been defined three main sources of in-formation: Static, dynamic and evolutionary history [2]. These sourceshave been used in the development of several visualization tools, most ofthem based on the use of graphs, geometrical figures, colors, and com-binations of them. These tools include Tarantula [3], Structure1011,CodeCrawler [4], SeeSoft [5], X-Ray [6], STAN2, CVScan [7] andEPOSee3, among others. On the other hand, these sources of infor-mation have been used in the development of three dimensional tools,such as sv3D [8], Vizz3D [9] and CodeCity [10]. Despite the favorableand promising aspects of three dimensional space, such as less errorprone perception [11] and a more direct mapping between software ar-tifacts and their representation, it has not been designed a system thatthe users feel comfortable with.

3.2 BACKGROUND ON SOFTWARE VISUALIZATION

As stated before, software visualization pretends to become a tool thatsupports the tasks associated with the development of software systems.It includes mainly understanding and maintenance tasks. Consideringthe goal of SV, it is easily embodied in the software evolution researchfield whose objective is to study the phenomenon of an evolving systemthat changes according to external factors, such as the environment,technologies, and business modeling.

In this section an overview of what has been done in the SV area isgiven, as well as how it helps the software evolution process.

3.2.1 How Software Visualization Supports SoftwareEvolutions Tasks

Mens et al. proposed in [12] a set of challenges associated with soft-ware evolution, and SV systems are able to contribute to at least sevenof them in several ways. First, helping to preserve and improve soft-ware quality, since SV tools can support, for example, understanding

1http://www.headwaysoftware.com/products/structure101/index.php2http://stan4j.com/3http://www.st.uni-trier.de/eposoft/eposee/index.html

58

SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTIONOF SOFTWARE SYSTEMS

processes and error fixing tasks. Second, supporting model evolutionby visualizing different software artifacts, and therefore, allowing devel-opers to embrace the underlying model in a more practical way. Also,providing different views (like filtering mechanisms) that promote bet-ter modeling of the system. Third, supporting multi-language systemsby using metaphors that are independent of the information source for-mats. Fourth, increasing managerial awareness by exposing views easilyunderstandable by managers and stakeholders. Fifth, integrating datafrom various sources, and in this way, providing a more complete view ofthe system. Sixth analyzing huge amounts of data by using 3D systemswhere it is possible to analyze more information. Seventh, assistingsoftware evolution training and teaching by making easier to highlightconcepts when they are seen by students [13].

3.2.2 The Software Visualization Pipeline

The visualizing process is divided into three main sections, namely: In-formation extraction, analysis, and visualization. This process can beused in several software areas, such as: Data mining, software reposito-ries analysis, and reverse engineering. Having in mind the basic elementsof this process, it can be seen that SV is easily extendable to supportother tasks. However, its effectiveness will depend on the metaphorchosen, as it will be explained in the next sections. Each step in thevisualization pipeline is explained in Table 3.1.

3.2.3 Overview of Visualization Tools

Since the first appearance of SV in algorithm visualization [14], hundredsof tools whose intention is to provide a simple mechanism to understanda software system have been developed. Table 3.2 briefly describes a setof tools developed in the area. It highlights the wide spectrum of thesetools; they range from 2D to 3D spaces, from graph representations toanimations and also includes static, dynamic and evolutionary sourcesof information.

59


Table 3.1 Steps in the visualization pipeline.

Step DescriptionInformationextraction

This step is in charge of taking the different sourcesof information (static, dynamic or evolutionary), pro-cessing them, and finally, getting them ready for thenext step.

Analysis Once the information is loaded, it needs to be pro-cessed (e.g., calculate metrics values) in order toprepare a high level view of the raw data that theuser can study and understand easily.

Visualization Once the information is analyzed, it needs to be ren-dered to the user. At this point the visualization toolneeds to define a graphical metaphor that correctlyrepresents the underlying information.

3.2.4 Sources of Information Commonly Used

As it was shown in Table 3.2 the applications developed in the area ofSV are based on three main sources of information:

• Static source of information is referred to the data that can beextracted without running the program. Therefore, no informa-tion from runtime can be extracted. The main source of staticinformation is the source code, but documents, diagrams, require-ments and others can also be found.

• The visualization that depends on the data extracted from theexecution of the program is called dynamic. It is based on runtimeinformation, such as content of variables, conditions executed,stack size and so on. This kind of data is difficult to obtainbecause of the lack of mechanisms to gather information fromthe program memory.

• The study of software evolution is a complete research field, sev-eral interesting proposals have been written Regarding the visual-ization of data resulting from the analysis of this source of infor-mation. It was proposed in [15] that the basic unit of informationneeded to visualize software evolution is called a Maintenance Re-quest (MR). It refers to the delta of change in a software system

60


(a committed revision in a control versioning system like SVN orCVS). These deltas are taken together to analyze and visualizethe evolutionary information about a software system.

Table 3.2 List of applications taken from [13].

Tool Description Visualization Source oftechniques information

sv3D [8] It uses three dimen-sional polycylinders as ametaphor

It uses 3D, colors, heights anddeepness to show information

Static

Vizz3D [9] Visualization as cities It uses color, heights and a realmetaphor to show software com-ponents and their relations

Static

Tarantula [3] It uses a SeeSoft-likerepresentation to showthe results of a set oftests

It uses color and geometricalshapes to show the results

Dynamic

Structure [16]* Visualization of Java de-pendencies at differentlevels of abstractions

It uses a two dimensionalmetaphor, based on the Eclipseplatform set of icons and graphsto show relations

Static

SeeSoft [5] Visualization of changesthrough time

It uses color and geometricalshapes to show the results of theanalysis

Evolution

X-Ray [6] It visualizes relation-ships in Java sourcecode

It uses geometrical shapes, links(lines) and some basic colors torepresent dependencies

Static

EPOSee [17] It shows informationabout the evolution offiles in the source code.

It uses pixel-map and supportsgraph representations to showrelations among files in the ver-sion control system

Static

SHriMP [18] It shows the dependen-cies in a program codeand other kinds of ar-tifacts like architecturaldesign and documenta-tion

It uses a two dimensional ap-proach to show the analyzed in-formation. It is based on rectan-gles, color and links among theparts of the visualization

Static

X-Tango [19] It visualizes the execu-tion of a program

It uses an animated version ofgeometrical shapes and color toshow the behavior of the pro-gram

Dynamic

*http://www.headwaysoftware.com/products/structure101/index.php

61


3.2.5 Differences of Software Visualizationand Modeling Languages Like UML

UML [20] was first conceived as a union of the methodologies proposedby Grady Booch, Ivar Jacobson, and James Rumbaugh. They partici-pated in the evolution of what was the first UML specification, whichcame out in 1997 under the organization of the Object ManagementGroup4. UML clearly follows the same principles of SV (i.e., tries tomake the process of maintaining and understanding easier). The maindifference is the way they face the problem: While software visualizationtries to provide a view of an already written software system, the UMLspecification aims to provide a view before the actual code is written.UML is a modeling language and also a visualization tool, although itdoes not provide a view of a system in terms of its metric values. De-spite the effort invested in the modeling stage of any methodology, therewill always be the need to determine the current state of a system. Thiscurrent state and its metrics can be determined by a visualization toolthat analyzes for example the source code.

3.3 SV TECHNIQUES

Based on the three sources of information, researchers have developed awide spectrum of techniques. They try to use as much visual techniquesas possible, in order to enable the metaphor with enough data about theanalyzed system. In the following sections, the definitions of metaphoras well as the techniques developed in the past are explored. Theseare divided into different sections depending on the mechanism used torepresent the software system.

3.3.1 Metaphors

One of the fundamental concepts behind any kind of visualization is themetaphor. It was defined by Lakoff [21] as “a rhetorical figure whoseessence is the understanding and experiencing one kind of thing in termsof another”. In software visualization, metaphors are the most important

4After a negotiation for the “UML” name with Rational.

62


concern because the information that is going to be visualized does nothave a natural visual representation.

Since the beginning of SV, metaphors have been developed usingdifferent techniques, such as bar charts, pie charts, cylinders, pixel-maps,buildings within cities, and even galaxies in the universe.

When building metaphors designers must consider a set of basicaspects, as defined by Graçanin [15], before they can be included in avisualization system; these aspects include:

1. Scope of representation: Software systems usually consist of thou-sands of lines of code and the visualization tool has to renderinformation related with them. This vast amount of informationoften causes confusion to the final user. Any metaphor shouldallow the user to limit the scope of the information that is beingvisualized, so he can decide what information is relevant or not.

2. Medium of representation: One kind of medium is 2D or 3D visu-alization type. The medium has an important role when buildinga software visualization system as it usually depends on the kindof information that is being visualized.

3. Visual metaphor: This aspect refers to what visual elements themetaphor uses to display information to the user. This includes ge-ometric shapes, such as lines, dots, circles, squares, and polygons,or real-world entities, such as buildings, trees, planets, and so on.These elements may have a color (in a color scale) to representanother aspect of a software artifact. Considering these elementsand how they are used, two important aspects of a metaphor are:

• Consistency of the metaphor : It refers to the correct use ofthe metaphor. This means that there must be a mappingbetween software entities and entities in the visualization toavoid misunderstandings due to the representation of differ-ent software entities or properties with the same graphicalelement in the visualization.

• Semantic richness of the metaphor and complexity : Themetaphor should be rich enough to provide as much

63


representations as different aspects of the software that isbeing visualized.

4. Abstractedness (Ellison): The user of the visualization systemshould be able to choose the level of detail in the software sys-tem that is being evaluated. In this way the user may choosefrom direct representation, structural representation, synthesizedrepresentation, and analytical representation.

5. Ease of navigation and interaction: Since normally the visualiza-tion system is able to provide too much information, it shouldallow the user to know what information is visualizing, what partof the system is being visualized, what level of abstraction hasbeen selected, and it should allow the user to navigate in an un-derstandable way according to some usability criteria. This is animportant aspect in 3D visualizations, where the user can easilyget lost.

6. Level of automation: Software visualization systems need to beautomated (i.e., extract, analyze, and render all the informationfrom a software system with a minimum interaction with the user).

7. Effectiveness [22]: It indicates the efficacy of the metaphor as amedium for representing the information. The metaphor should beable to convey the analyzed information from the software system.For example, it could be important to know if the system is ableto show ordinal values as well as cardinal ones.

8. Expressiveness [22]: It refers to the capability of the metaphor tovisually represent all the analyzed data. As in the case of semanticrichness, the metaphor must provide a considerable number ofvisual properties so the parameters obtained by the analysis canbe represented in the view.

3.3.2 2D Approaches

As it was mentioned in 3.2.4, one of the main sources of information isthe source code. To address the challenges proposed by the analysis of

64


this source, techniques have been created to make use of a bidimensionalspace, mainly diagrams, such as:

• Control-flow diagrams: In 1947 Von Neumann [23] created oneof the most famous ways to visualize the flow of a program byusing geometrical figures to represent actions or events within theapplication. These actions are represented by Rectangles whenthe flow of the program refers to events, activities, processes,functions, and other general statements, and by Diamonds whenthe flow reaches a point where a decision has to be made. Thisis a simple yet powerful way to better explain and understandbasic algorithms, for example sorting algorithms, but gradually,degrades when the program gets bigger. For this reason, sincethen, researchers have developed tools to automatically generatediagrams with the proper layout and configuration.

• Structograms: Nassi-Shneiderman proposed a new way to dia-gram programs based on rectangles to represent them (see Fig-ure 3.1). Since it does not have a representation of the GOTOstatement, the programmer is forced to write programs withoutit (when this representation was proposed, Object-Oriented Pro-gramming was not as popular as it is today), so making themmore structured and easier to maintain and understand.

Figure 3.1 Structograms (sequence, loop and conditional) [2].

• Jackson diagrams: They represent a program as a tree hierarchyproviding a format to depict the structure of the source code.Figure 3.2 shows the diagrams for three control structures.

• Control structure diagrams: These diagrams put the control-flowcharts and the source code together. They assemble the diagraminto the source code by showing on the side of the figure that

65


A

B C

A

B *C

A

C1 C2

B C

Figure 3.2 Jackson diagrams representation (sequence, loop and conditional) [2].

represents the statement. For example, if a line contains a condi-tional, it is marked with a diamond on the left side; similar withloops, where a vertical bar indicating all the block is placed onthe left side, as it is shown in Figure 3.3.

beginstatement

end

statement

for(...) loop

end loop

statement

if test then

else

end if

statement

statement

Figure 3.3 Basic control structure diagrams (sequence, loop and conditional) [2].

One of the most important visualization tools found within this cat-egory is SeeSoft [5]. This tools intent to provide a view of how thesoftware system has evolved through time. It uses a bidimensional spacebased on rectangles that represent each file of the system, whereas thecontent of the rectangle defines how each line of code has changed. InFigure 3.4 this metaphor is depicted. There, it is possible to see thatthe color scale represents how much the file has changed during thedevelopment or maintenance of the software system.

Another example of bidimensional visualization based on the historyof a program can be found in the evolutionary information. In thiscase, researchers have developed interesting techniques, such as supportgraphs [24], fractals [25], and pixel maps. The first of them, a supportgraph, takes the concept of evolutionary coupling and represents theevolution of a system using a graph-like representation, where each node

66


Figure 3.4 SeeSoft tool example. Taken from [2].

represents a file and a measure of coupling is defined, so it can be used asa weight. This weight is later interpreted as a distance between nodes.Figure 3.5 shows an example of this kind of visualization. The secondtechnique, fractals, was developed as a tool to help stakeholders tocomprehend and measure the effort of each developer within a developerteam. It is based on a set of nested rectangles where the size of eachrectangle represents the contribution of a developer. Each exampleillustrated in Figure 3.6 represents a type of development environmentusing this method. Finally, the third technique, pixel maps, representsthe concept of evolutionary coupling; a matrix is built from the files ofthe system; each row and column represents a file. Each cell has a colorindicating the strength of the coupling between the two files representedby the corresponding column and row. By using this mechanism, it ispossible to determine the emerging blocks that indicate the presenceof related concepts in the code, leading to an easier understanding andhelping to support impact analysis.

A similar approach to highlight associations among files is employedin pixel maps, as shown in Figure 3.7. In this case the files are placed asa matrix where each cell represents, by using a color scale, the weight

67


of the relation between the two corresponding files. With this kind ofvisualization is possible, once again, to see emerging patterns in thesystem architecture.

Figure 3.5 Mozilla Firefox example of evolutionary coupling. Taken from [2].

(a) Developer (b) Few Developers (c) Many balance developers

Figure 3.6 Fractal representation. Taken from [2].

68


Figure 3.7 Pixel-maps example. Taken from [2].

3.3.3 3D Approaches

These approaches have not been widely explored compared to the 2Dtechniques. However, Stasko in [26] states, about visualizing in a three-dimensional world, “by adding an extra spatial dimension, we supplyvisualization designers with one more possibility for describing someaspect of a program or system”, thus more information is easily rep-resented. In addition, when using a 3D metaphor, it has been suggestedthat the perception is less error prone if software objects are mapped tovisual objects, as there is a natural mapping between them [11].

There are some cases where a third dimension5 is needed. An ex-ample of this is the visualization of structural change. It needs to useat least two dimensions to reflect the internals of a system at a specifictime and it needs another one to render the information about how ithas changed over time.

Techniques like the one used by SeeSoft have been extended to a3D space. This is the case of sv3D [8] and its successor SeeIT 3D [13],where a software artifact is represented by a container and a set ofpolycylinders represents the components of the artifact, e.g., a class isa container and its methods are the polycylinders. This representationallows to visualize more information and the user feels more comfortable

5Can be spatial or temporal.

69


with it than in the case of 2D approaches, as stated before. It evenallows to represent a wide spectrum of software artifacts, such as sourcecode and relational databases. Figure 3.8 depicts the concept of thismetaphor.

Figure 3.8 sv3D and SeeIT 3D metaphor.

A special use case of a metaphor like the one presented before isshown in Figure 3.9. It shows the stages of a sorting algorithm usinga third dimension to represent how it has solved the problem. In thefront row, the polycylinders are unsorted and step by step (representedby each row) they are sorted. Finally, in the last row they are sortedshowing how the sorting process was completed.

Another important example of using a 3D space and its advantagesis proposed by the CodeCity [10] tool. The purpose of this tool isto render a software system as a city where the user is able to locatehimself easily. CodeCity employs a mapping from software componentsto city, objects so that each package represents a district, each buildingrepresents a class, the height of the building is specified by the numberof methods and the width of it represents the number of attributes.Figure 3.10 illustrates this metaphor at work.

70


Figure 3.9 Sorting steps using a third spatial dimension. Taken from [2].

Figure 3.10 CodeCity metaphor. Taken from [10].

Finally, it is worth mentioning a special case of 3D visualizationbased on UML. After the definition of this language, an enhancementwas proposed in [16, 27]. They used a 3D space to visualize the sameelements of UML, but its success was not as important as UML due tothe difficulty for drawing the shapes on a paper or a whiteboard.

71


3.3.4 Virtual Environments

Virtual environments give the user a unique type of immersion and nav-igation because of the way they represent and render the information.This level of interaction is achieved by presenting a world to the userwhere he is able to interact with different objects that are mapped tosoftware components/artifacts.

Research in this area is far more complicated as it requires morehuman and technological resources. Therefore, the work in the area hasnot been as popular as 2D approaches. Despite of the costs, certain workhas been done; this is the case of ImsoVision [28], which visualizes C++source code. It employs geometrical figures to represent the differentcomponents in the software system. While this work has had impacton the C++ community, Software World [1] has done a similar jobfor the Java language; it uses a metaphor based on elements from realworld like countries, districts and buildings, to represent the source code.Additionally, one of the most important visualization tools is CAVE6,proposed in [29], which uses a cubicle where the user interacts with theworld presented on it.

Distributed VEs are a special kind of virtual environment wheremany users, distributed in different places, interact with the visualizationat the same time. By providing a tool capable to support this kind ofoperations, all users involved in the visualization can interact with eachother and work in a collaborative style.

3.4 TOWARDS A BETTER SOFTWARE VISUALIZATIONPROCESS

Even though the efforts done in the SV area, the emerging technologiesand other paradigms propose challenges that need to be addressed bynew metaphors and techniques. They include new software program-ming paradigms, such as Aspect Oriented Programming (AOP), moredynamic languages, the processing of higher level languages, more flex-ible metaphors and some educational issues.

6Not only for software visualization.

72


3.4.1 Other Programming Paradigms

With the development of more and more languages available for writingprograms, it is mandatory to build better visualization systems basedon these new models, such as aspect oriented programming. AOP is anextension of the object-oriented model based on cross-cutting conceptswithin a system (e.g., the logging capabilities). If a SV tool is going toanalyze a system of this type, it needs to know how the interactions ofaspects are defined and implemented. In this way, a SV system couldexplain adequately the underlying system and achieve the objective ofproving better understanding. This specific case was explored in [30]but few of the tools developed so far provide a novelty solution to thisproblem. Instead, they have tried to extend the models used in object-oriented systems or even employed in older models.

Dynamic languages are a second example where SV has not beenwidely introduced. These languages differ from compiled languages,so that the behaviors are calculated at runtime rather than defined atcompile time7. Although these behaviors can be emulated in compiledlanguages, the syntax and understanding process is much easier in dy-namic languages because they are built in that specific way. Examplesof dynamic languages are: Groovy, JavaScript, Objective-C, PHP, andPython. Therefore, new metaphors that enable the presentation of thiskind of information need to be defined. It can be runtime information,more important in this case, or static information referred to the sourcecode.

3.4.2 Include Other Languages

In software development there are more languages apart from the classicprogramming languages (i.e., Java, PHP, .Net, C, C++, etc.). DomainSpecific Languages (DSL), Architecture Description Language (ADL)and metamodeling are a examples of such languages. DSL are languagesor specifications designed to be applied in a specific domain, for example,R for statistics. Considering that not every aspect of a software systemis specified in a general purpose language, it is necessary to be able to

7See http://en.wikipedia.org/wiki/Dynamic_programming_language formore information.

73


analyze sources like DSLs or even analyze the languages in which theycan be written (Groovy language).

Another topic that has been partially explored using the classicsources of information are the architectures of a software system. Specif-ically, ADLs should be considered as sources of information for SV tools.An ADL is a language to define and express how the architecture of asystem is implemented. Hence, ADLs could be very useful sources ofinformation that would facilitate the visualization of bigger components,and therefore, enable the rendering of high level views of the analyzedsystem.

Finally, a last example of other languages that could be visualizedis the metamodeling techniques. It is a mechanism used to define con-straints, terms, rules, and concepts about the model employed to solvethe domain specific problem. A visualization tool may add this infor-mation to the model itself and help to address the inherent problems inthe software evolution process.

As it can be seen there are other languages that can be consideredas alternative or additional sources of information for SV tools.

3.4.3 Better and More Flexible Metaphors

Most of the metaphors defined for software visualization have been de-signed for a particular problem or source of information, as it was ex-plained in Section 3.3. The next step in SV should be providing betterand more flexible metaphors that allow the user to visualize severalsources of information with just one view. If a mechanism of this natureis provided, the learning curve of the visualization tools is reduced be-cause the final user only needs to comprehend how a metaphor behaves,instead of learning different metaphors for each different source of in-formation. The results would be even better if the tool considers othersources and languages like the ones explained in the previous sections.

3.4.4 Educational Issues

At the educational level the SV process can be improved if this subjectis taught as part of a software evolution course. At least at UniversidadNacional de Colombia, there is no clear may way to teach the benefits

74


of SV and how it can be used in production environments to supportthe software evolution process.

Students of software engineering or software evolution courses canstart reviewing the state of the art of software visualization. Next,they can be users of various visualization tools, when doing softwaremaintenance tasks. After that, they will be able to work on the designand implementation of new tools, based on real world needs, instead ofresearchers’ assumptions of a world they may not know.

3.5 SUMMARY

This chapter presented the bases of software visualization. As it wasseen they range from 2D approaches using graph-like representations,pixel maps, geometrical shapes and colors, to virtual environments wherethe user is able to navigate a world representing the analyzed softwaresystem. These approaches are mainly based on three sources of infor-mation: Static, dynamic, and evolutionary. Static is referred to thedata that are extracted from sources of information that do not needthe program to be running (source code, documents, and so on). Dy-namic information is gathered from a running system, for example, thevalues placed in the execution stack of the program. And evolutionaryinformation is processed from the different software repositories, suchas bug trackers and version control systems.

At the end of the chapter some considerations that should be takeninto account for making the visualization process/tools better were pre-sented. They include the embracement of new programming paradigms,other languages different from the classic ones, the development of newmetaphors, and some basic considerations that could lead to a biggercommunity around the software visualization tools.

75


REFERENCES

[1] Knight, C. & Munro, M. Comprehension with in virtual en-vironment visualisations. In: Program Comprehension, 1999. Pro-ceedings. Seventh International Workshopon: 1999, 4–11.

[2] Diehl, S. Software visualization: visualizing the structure, be-haviour, and evolution of software. Springer Verlag: 2007.

[3] Jones, J. A., Harrold, M. J. & Stasko, J. T. Visualizationfor fault localization. In: Proceedings of ICSE 2001 Workshop onSoftware Visualization. Toronto, Ontario, Canada: 2001, 71–75.

[4] Lanza, M. Codecrawler-lessons learned in building a softwarevisualization tool. In: CSMR ’03 Proceedings of the Seventh Eu-ropean Conference on Software Maintenance and Reengineering,tome 2003: 2003.

[5] Eick, S. C., Steffen, J. L. & Jr, E. E. S. Seesoft-a tool forvisualizing line oriented software statistics. IEEE Transactions onSoftware Engineering, 18 Issue 11: 1992; 957–968.

[6] Malnati, J. X-Ray: An Eclipse Plug-in for Software Visualiza-tion. Proyecto Fin de Carrera, Universita della Svizzera italiana:2007.

[7] Voinea, L., Telea, A. & van Wijk, J. J. CVSscan: vi-sualization of code evolution. In: Proceedings of the 2005 ACMsymposium on Software visualization, SoftVis ’05. ACM, New York,NY, USA: 2005. ISBN 1-59593-073-6, 47–56.

[8] Marcus, A., Feng, L. & Maletic, J. I. 3D representationsfor software visualization. In: Proceedings of the 2003 ACM sym-posium on Software visualization, SoftVis ’03. ACM, New York,NY, USA: 2003. ISBN 1-58113-642-0, 27–ff.

[9] Lowe, W. & Panas, T. Rapid construction of software compre-hension tools. International Journal of Software Engineering andKnowledge Engineering, 15(6): 2005; 995–1025.

76


[10] Wettel, R. & Lanza, M. CodeCity: 3D visualization of large-scale software. In: Companion of the 30th international conferenceon Software engineering, ICSE Companion ’08. ACM, New York,NY, USA: 2008. ISBN 978-1-60558-079-1, 921–922.

[11] Ware, C., Hui, D. & Franck, G. Visualizing object orientedsoftware in three dimensions. In: Proceedings of the 1993 confer-ence of the Centre for Advanced Studies on Collaborative research:software engineering - Volume 1, CASCON ’93. IBM Press: 1993,612–620.

[12] Mens, T., Wermelinger, M., Ducasse, S., Demeyer, S.,Hirschfeld, R. & Jazayeri, M. Challenges in software evo-lution. In: Principles of Software Evolution, Eighth InternationalWorkshop on: 2005, 13–22.

[13] Montano, D. Development of a 3D tool for visualization ofdifferent software artifacts and their relationships. Proyecto Finde Carrera, Universidad Nacional de Colombia, Departamento deIngenieria de Sistemas e Industrial: 2010.

[14] Baecker, R. & Sherman, D. Sorting out sorting. Video shownat SIGGRAPH-81: 1981. Video.

[15] Graçanin, D., Matković, K. & Eltoweissy, M. Softwarevisualization. Innovations in Systems and Software Engineering,1(2): 2005; 221–230.

[16] Irani, P., Tingley, M. & Ware, C. Using Perceptual Syn-tax to Enhance Semantic Content in Diagrams. IEEE ComputerGraphics and Applications, 21(5): 2001; 76–85. ISSN 0272-1716.

[17] Burch, M., Diehl, S. & Weigerber, P. EPOSee: A ToolFor Visualizing Software Evolution. In: 3rd IEEE InternationalWorkshop on Visualizing Software for Understanding and Analysis:2005.

[18] Storey, M.-A., Best, C. & Michaud, J. SHriMP Views: AnInteractive Environment for Exploring Java Programs. InternationalConference on Program Comprehension, 0: 2001; 0111.

77


[19] Stasko, J. T. Tango: A Framework and System for AlgorithmAnimation. ACM SIGCHI Bulletin, 21: 1990; 59–60. ISSN 0736-6906.

[20] OMG. UML Specification. ccccccccchttp://www.omg.org/spec/UML/2.2/Infrastructure/PDF/.

[21] Lakoff, G. & Johnson, M. Metaphors we live by. ChicagoLondon: 1980.

[22] Mackinlay, J. Automating the design of graphical presentationsof relational information. ACM Transactions on Graphics, 5: 1986;110–141. ISSN 0730-0301.

[23] Goldstine, H. H. & Von Neumann, J. Planning and codingof problems for an electronic computing instrument. Institute forAdvanced Study: 1947.

[24] Zimmermann, T., Weisgerber, P., Diehl, S. & Zeller,

A. Mining Version Histories to Guide Software Changes. In: Pro-ceedings of the 26th International Conference on Software Engi-neering, ICSE ’04. IEEE Computer Society, Washington, DC, USA:2004. ISBN 0-7695-2163-0, 563–572.

[25] D’Ambros, M., Lanza, M. & Gall, H. Fractal Figures: Vi-sualizing Development Effort for CVS Entities. In: Proceedings ofthe 3rd IEEE International Workshop on Visualizing Software forUnderstanding and Analysis, VISSOFT ’05. IEEE Computer So-ciety, Washington, DC, USA: 2005. ISBN 0-7803-9540-9, 16–.doi:http://dx.doi.org/10.1109/VISSOF.2005.1684303.

[26] Stasko, J. T. & Wehrli, J. Three-dimensional computationvisualization. In: Visual Languages, 1993., Proceedings 1993 IEEESymposium on: 1993, 100 –107.

[27] Irani, P. & Ware, C. Diagrams based on structural objectperception. In: Proceedings of the working conference on Advancedvisual interfaces. ACM, Palermo, Italy: 2000. ISBN 1-58113-252-2,61–67. doi:10.1145/345513.345254.

78


[28] Maletic, J. I., Leigh, J., Marcus, A. & Dunlap, G. Visu-alizing object oriented software in virtual reality. In: Proceedings ofthe 9th International Workshop on Program Comprehension. ACM:2001, 21–13.

[29] Cruz-Neira, C., Sandin, D. J., DeFanti, T. A., Kenyon,

R. V. & Hart, J. C. The CAVE: audio visual experience auto-matic virtual environment. Communications of the ACM, 35(6):1992; 64–72.

[30] Pfeiffer, J. H. & Gurd, J. R. Visualisation-based tool sup-port for the development of aspect-oriented programs. In: Proceed-ings of the 5th international conference on Aspect-oriented soft-ware development, AOSD ’06. ACM, New York, NY, USA: 2006.ISBN 1-59593-300-X, 146–157. doi:http://doi.acm.org/10.1145/1119655.1119676.

79


Incremental Change:The Way that Software

Evolves

Juan Romero-SilvaMario Linares-Vásquez

Jairo Aponte

ABSTRACTChange is an unavoidable characteristic of software. The research field ofsoftware maintenance and software evolution emerges precisely from thischaracteristic. Incremental Change (IC) is an essential part of softwareevolution because it deals with the addition of features and propertiesto the software. In this chapter, we make a review of several techniquesrelated to Incremental Change as a method to implement new featuresor correct bugs in software. IC embraces change in the same way thatiterative and agile methodologies of software development do. Thus, ICis also a fundamental piece of iterative and agile development.

4.1 INTRODUCTION

Change is a fundamental element of software development processes.However, plan-driven methodologies traditionally try to avoid changeby using different strategies from other branches of industry and en-gineering (i.e., planning during the beginning of a project in order to“freeze” requirements). These models have been successful for productmanufacturing. That success was the reason for the adoption of theplan-driven waterfall model in software development. However, in soft-ware development it is not easy to know the requirements in advance.Thus, a “complete design” before the implementation is impossible to

81


fulfill, and using plan-driven methodologies in software becomes almostuseless. Rajlich [1] quotes a report that states that approximately 16%of software projects using waterfall methodology have succeeded. Theremaining 84% failed because they exceed by far budget, timeliness orboth; or because they were canceled.

Several authors have studied the problem of changing requirementsfrom several points of view; some of the most notorious works include:

• The software evolution laws described by Lehman [2,3].

• The software maintenance categorization proposed by Swanson[4].

• The staged model for the software life cycle proposed by Bennettand Rajlich [5].

Incremental change appears always that software needs to evolvebecause of the environment changes, users want new features, or somekind of issue must be resolved. These cases usually occur after initialsoftware delivery, but if software is developed with an agile or even aniterative methodology, incremental change appears since the first itera-tion is completed. Thus, this chapter intends to show the importance ofIC in the software development process and how researchers are makingseveral efforts to address the steps related to IC. The structure of the restof the chapter is as follows: Section 2 describes the steps to completean incremental change. Section 3 presents a review of research relatedto concept and feature location. Besides, differences and similaritiesbetween them are established. Section 4 shows a review of techniquesfor impact analysis. Finally, Section 5 draws some conclusions.

4.2 INCREMENTAL CHANGE IN THE SOFTWAREDEVELOPMENT PROCESS

After several years of trying to avoid the change during software develop-ment processes, the acceptance of change as an inherent feature of thesoftware helped developers to adopt iterative and agile methodologies.These methodologies embrace change as a fundamental piece. Thus,iterative and agile methodologies have so much in common with incre-mental change. Febbraro and Rajlich in [6] present an agile methodology

82

INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

based on the implementation of one change for each iteration. In Figure4.1, the IC activities are presented. The start point for an IC processis a change request. Then, the successive activities are the extractionof concepts from the request, the location of these concepts in sourcecode, and the change impact analysis. Next, the developer preparesthe source code for the change by doing a previous refactoring (this isan optional activity), updates the code by implementing the changes,and incorporates these changes into the source code (by propagatingthe effects through elements identified during impact analysis). Some-times, incorporation and update are done together and can not be easilyseparated as two activities. Finally, the programmer propagates changesthat were not expected and makes a new refactoring. This refactoring isonly needed when “bad smells” are located after updating, incorporationand change propagation.

ConceptExtraction

ConceptLocation

ImpactAnalysis

Prefactoring

Actualization

IncorporationChangePropagationPostfactoring

ChangeRequest

Figure 4.1 Incremental change activities.

4.2.1 Software Maintenance vs. Software Evolution

Incremental change was first described only as a maintenance task thatcan only be performed after the release of the software system, whenchange requests are done. Agile software development processes basedon incremental change understand the results of the first iteration asthe first release. From this point of view, the next iterations incrementthe existing software by adding features, showing an evolutionary viewof incremental change.

83


According to [7], the differences between maintenance and evolutionare deep, because maintenanc e denotes the idea of keeping a system onexecution without fundamental changes in the design, while evolutioninstead tries to produce enhancements that often require profound de-sign modifications. In software, changes are produced in order to meetthe unsatisfied needs of users. Few times, software systems do not needarchitectural changes, but frequently these changes affect the architec-ture widely. In this latter case, changes do not look for preservation,but for innovation. This is why the term evolution is considered to bemore adequate when discussing about software.

Besides, in [5], Rajlich and Bennett present even more drastic dif-ferences between evolution and maintenance. They proposed a softwarelife cycle, which is composed by five stages and strongly distinguishes theevolution stage (second stage after initial development) from the ser-vicing one (after the evolution stage). In the evolution stage, softwarechanges without degrading its architecture; during this stage, the soft-ware incorporates new features maintaining the integrity of its structure.When the software moves to the servicing stage, it stops its evolutionand begins maintenance. On this stage the architecture begins to de-grade and changes are increasingly harder to be done. The maintenancefinishes when the architecture is so degraded that changes become im-possible or prohibitively expensive.

4.2.2 Activities of Incremental Change

There is a consensus on the activities related to incremental change. Oc-casionally some activities are not performed; for example, if the softwarearchitecture is prepared for the change the prefactoring activity is notneeded, or if the program is short enough and the programmer knows itvery well, concept location may be a trivial task done intuitively. How-ever, in complex and large software systems it is likely that all activitieshave to be done. According to several works by Rajlich [1, 6, 8, 9], theactivities of incremental change are:

1. Change request: The discovery of a bug. A request for a newfeature or an enhancement are typical change requests, whichcan be asked by the users or someone in the development team.

84


These change requests are usually done in natural language, andare formulated in terms of domain concepts [8]. Since the goalis to implement the functionality described using the concepts inthe change request, the next steps for incremental change makea vast use of concepts.

2. Concept extraction: As the request is made in natural language,it contains lots of words, domain concepts can be found amongall those words. In this step, the developer must extract thedomain concepts from the change request, this activity has to bedone in order to formulate the queries to locate the concepts inthe source code. As developers extract concepts from the changerequest, the ability to compare them with concepts in source codecould increase the effectiveness of the next activity; recent workhas been done to enhance the extraction of concepts from sourcecode [10,11].

3. Concept location: The developer extracts the domain conceptsinvolved in the change request, in order to locate the parts ofthe code that have to be changed. This activity is called conceptlocation. It is done by formulating a query and processing someof the results that show possible places where the concepts in thequery are implemented in the source code. In early research [12],concept location was used as a synonym of feature location butin this chapter a comparison that shows the differences betweenboth terms (according with the current definitions from the sameauthors [13]) is presented.

4. Impact analysis: As different entities or classes have to collabo-rate to achieve an objective, usually the implementation of a newfunctionality involves several parts of code. During this activity,developers search for other parts of code that have to be modifiedwith the change request. The source code is not the only artifactaffected by the change. There are also non-executable files (con-figuration files, media files, user manuals, etc.) that are impactedby the change. These non-source code artifacts also need to beidentified in order to successfully complete the change.

85


5. Prefactoring: The refactoring activity is defined as an activity inwhich the internals of the program change while the observablebehavior remains the same. This is done to maintain or even geta better architecture of a software application. The prefactoringis a refactoring task that is done before implementing the changeidentified in previous steps. It is necessary because sometimesthe architecture is not prepared adequately for the change. Theobjective of this activity is to try to make these changes easier.

6. Actualization: The implementation of the concept is done fully,perhaps by adding new classes or modifying the existing ones. Itdepends on whether the concept is implicit or explicit.

7. Incorporation: When the change implemented during the actual-ization activity requires the addition of new classes, these classesneed to be intertwined with the old code. This is called incorpo-ration.

8. Change propagation: Once the change is implemented, all sourcecode elements interacting with changed elements may be affected.During impact analysis the programmer identifies the parts ofcode affected by the change. Change propagation is the activityin which all these identified changes are effectively implemented,while doing this task, there can always appear parts that needmodifications and that were not identified during impact analysis.This can be expressed better if impact analysis is explained as apart of incremental change design, while change propagation is apart of incremental change implementation [14].

9. Post-factoring: During the implementation of changes, there isalways the possibility of injecting “bad smells” into the code. Thepost-factoring activity is another refactoring done with the objec-tive of reducing or removing negative impact of changes in thearchitecture of the program.

10. Testing: Testing is done through the whole process. Existing testsmust not be broken after the change is implemented; obviously,in some cases, tests need to be updated to match the updated

86


code. This is done specially during refactoring, actualization, andincorporation. This task also shows another significant similaritybetween incremental change and agile methodologies, both referto testing as a fundamental activity.

11. Documentation: Non-source code artifacts also have to be up-dated when a change is done in the code, this updating task isdone throughout the process. After post-factoring, some changesthat were not previously identified have to be reflected in the doc-umentation. That is why it can be considered as a final step inan incremental change of software.

4.3 CONCEPT AND FEATURE LOCATION


Only the source code that is understood, can be modified. That is whydiscussing software comprehension is an important issue in incrementalchange. While there are many definitions for concept in the literature,this work uses the definition by Rajlich in [15]: A concept is a unit ofknowledge that can be processed by a human mind. Thus, the kinds ofconcepts to consider in software development domain are:

• Domain concepts: This type of concepts form the vocabulary ofend users. They are used to describe the problem.

• High-level design concepts: These are related to the implemen-tation of functionalities in source code. Architectural patternsamong other terms from the solution domain may be included.

• Conditions of failure: Software system concepts that the userbarely understands. Some of these probably remain hidden untilprogramming tasks.

Later, Rajlich extends the definition of concept [16], showing it asa triad composed of name, intention, and extension:

• Name is the label used to identify the concept.

• Intention represents the meaning, and

87


• Extension includes the set of all the things described by the con-cept.

From the work described above, the central role of concepts in soft-ware comprehension can be inferred. In [17] the authors state that oneof the biggest difficulties in achieving software comprehension is whatthey called the concept assignment problem. It is related to the problemof knowing what parts of the source code are implementing a specificrequirement of the software [18]. One of the principal causes for the con-cept assignment problem is that the concepts used in the requirementsspecification are part of the domain problem, while the concepts foundin source code are high level design concepts from the application do-main. But this is not the only cause, another important difficulty comesfrom the fact that some concepts cannot be implemented in just onesoftware component, but are implemented through several components,this is known as the delocalization problem [19].

4.3.2 Concept Location

Once the software development team receives a change request, thefirst task to be done is to identify the parts of software artifacts thatneed to be modified. This is done in two steps: First, the developerlocates where to start making changes (concept location), and then,the developer performs the impact analysis of these changes (this isdiscussed in the next section).

Usually, the change request is expressed in natural language, andincludes the domain concepts that must be implemented. The pro-grammer now needs to find the parts of source code related to theconcepts in the change request because those are the parts that willprobably change. Precisely, the problem arises at this point, becausethe concepts from the change request are domain concepts, and theconcepts in source code (where the developer is searching) are designconcepts or failure conditions; therefore, these are concepts from differ-ent abstraction levels.

In order to minimize the gap between the domain concepts and theconcepts expressed in source code, most concept location techniquesuse an intermediate representation, as seen in Figure 4.2. Reducing this

88


difference between levels of representations is the motivation behindconcept location. If the objective is achieved, the search space wherethe programmer explores looking for the place to implement a changerequest will be smaller than before.

IntermediateRepresentation Source code

associations

Changer request(Human-oriented)

Some changerequest spec expressedusually innaturallanguage,using domainconcepts

Figure 4.2 Concept location techniques frequently use an intermediate representation ofsource code [20].

There are some situations where concept location is done intuitively;for example, when the programmer has a lot of experience with thesource code that is being changed, or when the size of the softwaresystem is small enough. However, when the project is big, or the de-veloper does not have much experience with the code or when despitehaving an experienced developer, there has not been contact with theproject for a while other alternatives beyond intuitive location have tobe used. Concept location techniques are extensively classified into twotypes: Static and dynamic techniques.

4.3.3 Static Techniques

The static techniques do not require the software to be executed. Theyare largely based on textual information, which is present in softwareartifacts. For the same reason these techniques are usually inexpensiveand can be used even if source code is not complete.

String Pattern Matching

The search is conducted directly on the source code looking for stringsthat match a regular expression. This is probably the most used tech-nique by the software developers, maybe because learning it is easy and

89


almost natural for developers. The most important part of this tech-nique is the selection of the search pattern. As this step is done by theprogrammer, the method is considered highly dependant on the user’sjudgment. Another weakness is that it only uses the text found in thesource code as the search space, leaving its structure unused.

Dependency Search

In [12] the use of Abstract System Dependence Graph (ASDG) as arepresentation of source code is described. The ASDG is used to facil-itate the search experience to the developer. The authors’ abstractionworks at function and global variables levels. The technique starts byselecting one source code component to begin the search, and construct-ing a search graph with its neighbors. The graph is extended as morecomponents are visited. Finally, when the components that implementthe concept are found, the process finishes. The selection of the firstcomponent is one of the critical steps of this technique. This can bedone either randomly, or as a product of a previous exploration, or byselecting the top component (perhaps a main method).

Information Retrieval-based Techniques

IR techniques have been successfully used in the completion of tasksaimed at extracting information from unstructured sources. That is whyresearchers decided to use them to perform the location task. In [21],several approaches are shown:

• VSM: Vector Space Model used by Zhao et al. [22]. It constructsfeature documents with words extracted from the requirementsand design documentation. Then, these documents are matchedagainst query documents derived from identifiers in functions fromthe source code. This approach considers two documents as sim-ilar if they share the same terms, which leads to polysemy andsynonymy problems.

• LSI: Latent Semantic Indexing analyzes the relation between wordsand documents. In the case of concept location, the words areextracted from the queries, while the documents are the source

90


code artifacts. LSI generates a vector for each document, and thenuses it to compute similarity with other documents (the query isconstructed as another document) [18]. According to [23], LSIoutperforms VSM because it deals with polysemy and synonymy.

• Language Modeling: This technique calculates the conditionalprobability of generating a query Q given a document D based ona probabilistic language model derived from document D [21].

Dynamic Techniques

Contrary to static techniques, these do require execution of software.Therefore, the requirements of these techniques are higher than those ofthe static ones. In order to achieve execution, the source code must becomplete. These techniques also need test cases, because the executionstarts with a test case and ignores any software artifact that can not beexecuted.

The Software Reconnaissance Method [24] is one of the first dy-namic approaches for the concept location problem. First, it preparesthe source code to make the production of traces possible. Test casesare selected. Some of these test cases execute the characteristic the de-veloper is searching for, and some are unrelated with the characteristic.After the execution, the traces are compared. Components exercised bytest cases related with the characteristic, and that were not exercised intest cases unrelated, are the results of the technique.

In [25], some slicing techniques that can be used in the context ofconcept location are presented. These methods act by restricting thebehavior of software to a specific zone of interest (a slice). In this way,the search space is reduced to the components in traces related to theslice.

In short, dynamic techniques collect traces and then analyze themin order to create sets of components related and non-related to theconcept of interest.

91


Hybrid Techniques

Hybrid techniques that combine static and dynamic analysis have be-come the interest of researchers. In [23], a technique that uses staticand dynamic analysis is described. It uses Latent Semantic Indexing(LSI), as described above, as the static part, and Scenario-based Proba-bilistic Ranking (SPR) as the dynamic analysis. In SPR some scenariosthat exercise the feature and some scenarios that do not exercise thefeature are defined to collect traces. The traces collected are examinedto split the events in two sets, one with relevant and one with irrelevantevents. Then, the traces are used to find events whose frequency in therelevant set is greater than their frequency in the irrelevant set. The twotechniques are applied independently and the result of each techniqueis considered the result from an expert. Finally a weight is assigned toeach expert showing the final results. An interesting fact is that theresults were better when the weights were approximately the same foreach expert (0,5).

In [26] is proposed a technique that uses dynamic analysis to collecta single execution trace. After that, LSI is used to rank only the methodsin the execution trace (not all methods in the source code). That way,the dynamic part (the execution trace collected) filters information forLSI (the static part). Finally, they add one more filter by using webmining techniques in order to exclude irrelevant elements.

Feature or Concept?

The words feature and concept were used as synonyms for a long time,just recently, these terms have begun to be used in specific ways. Ac-cording to [13], features are a subset of concepts. A feature is a conceptthat can be exposed with a user interface, and can be selected by a sys-tem user. A feature is a special kind of concept which describes theobservable functionality of software while is executed.

Concept and feature location can be viewed as slightly differenttasks. However, since features are sets of concepts, feature locationcould be described as an specialization of concept location. That is, anyconcept location technique should be able to locate where to initiate theimplementation of a feature, but not all concepts can be located usinga feature location approach.

92


According to [18], dynamic techniques are better suited for featurelocation (understanding features as concepts that can be viewed by theuser when running the program with appropriate data), while staticapproaches can better locate the remaining concepts that are presentin the source code, but not necessarily can be selectable by a final user.With this in mind, typifying the change request could help the developerin the selection of the best technique: Using dynamic techniques (orhybrid techniques with more weight on the dynamic part) when theconcepts searched are features, and static techniques (or hybrid withmore weight on the static side) in other cases.

CL Techniques Classification

Besides the classic taxonomy that divides techniques into static anddynamic, more recently, with the addition of hybrid, here we considerother ways of classifying techniques.

Source

Taking into account the source of information used during the applica-tion of the technique, we basically divide them into techniques that usesource code artifacts, and techniques that incorporate non-executableartifacts (diagrams, help documentation, etc.). Most techniques do notuse non-source code artifacts at all, in spite of IR-based models aregood at extracting information from texts, and some of these artifactsare composed of text. As can be seeing in Table 4.1, from the techniqueswe know, only Cognitive Assignment [21] uses both types of sources.Table 4.1 shows that IR-based techniques have been widely applied tosource code artifacts. There can also be seen that the same techniqueshave not been used in non-source code artifacts, even when could beapplied to these types of elements.

Intermediate Representation

As previously stated, intermediate representations are used to reduce thegap between domains of concepts. The representation is constructed inorder to conduct the search in it, instead of performing it directly on

93


Table 4.1 Classification of IR-based techniques according to source code applicability.

Technique source code non-source codeSoftware Reconnaissance xGrep-based [20] xASDG [12] xVSM (IR) [22] xLSI (IR) [18] xCognitive Assignment [21] x xSet-based trace recollection[24,25]

x

Set-based with FCA & ASDG[27]

x

SPR [28] xPROMESIR [23] xFCA + IR [29] xData fusion [26] x

the source code. As the representation is an abstraction of source code,the search space is, then, reduced.

Traces are representations of particular executions of software. So,they can only be used by dynamic techniques. As Table 4.2 shows, everytechnique; dynamic or hybrid, makes use of traces.

IR-based approaches use vectors that represent documents. Theindexes of the vectors are the terms extracted from documents, and thevalues in the vector are either the frequency of appearance of the term,or just a binary representation that states the appearance or not of theterm in the document.

Graphical representations seem to be more used by hybrid tech-niques. Both dynamic and static approaches, are capable to obtaindependencies that can be easily represented in graphs on their own.Other graphic representation used is the lattice. Grep-based techniquesdo not use any representation at all. The search is conducted directlyin the source code.

Table 4.2 shows a classification of the techniques according to theintermediate representation used.

94


Table 4.2 Classification of IR-based techniques according to intermediate representation.

Technique Graphical Documents Traces NoneSoftware Reconnaissance xGrep-based xASDG x xVSM (IR) xLSI (IR) xCognitive Assignment xSet-based trace recollection xSet-based with FCA & ASDG x xSPR xPROMESIR x xFCA + IR x xData fusion x x x

Granularity

As Table 4.3 shows, most techniques provide a scope to the search atthe method or function level. Class or file level is typically considered byresearchers to be extremely coarse, because some files can be really large.On the other side, statement level granularity is extremely fine-grained,and probably too expensive to achieve while the benefits obtained shouldbe minimal.

Table 4.3 Classification of IR-based techniques according to granularity of results.

Technique Method StatementSoftware Reconnaissance xGrep-based xASDG xVSM (IR) xLSI (IR) xCognitive Assignment xSet-based trace recollection xSet-based with FCA & ASDG xSPR xPROMESIR xFCA + IR xData fusion x

95


Information Type

Source code, as the principal source used, exposes semantic information(mainly substantives and verbs), but also shows structural information(inheritance, dependencies, coupling, etc.). Late Hybrid techniques tendto use both types, while earlier techniques choose one and stick to it.Table 4.4 shows the information type used by each technique revisitedin this chapter.

Table 4.4 Classification of IR-based techniques according to the type of information used.

Technique Semantic StructuralSoftware Reconnaissance xGrep-based xASDG xVSM (IR) xLSI (IR) xCognitive Assignment xSet-based trace recollection xSet-based with FCA & ASDG xSPR xPROMESIR x xFCA + IR x xData fusion x x

4.4 IMPACT ANALYSIS

Once the first place to implement the changes has been located, thenext step is to predict the impact these changes will have on all soft-ware artifacts. This is called software change impact analysis. Theimplementation of the concept, in the last step, introduces some con-straints on the related software components. For example, the changeof a method’s signature or the addition of a component are changesthat have effects on the rest of the program. The output of this activityis the impact set, which is a collection of components that need to bechanged in order to maintain consistency in the source code, this is alsoimportant for the software engineer because it helps budgeting the costof a change.

96


Earlier papers explain change impact analysis as a task intended tounderstand better the scope and determine the possible effects of achange. The results of this task can be used to support the planningand management activities of software change [30]. In this way, im-pact analysis is an activity related to planning, analyzing, designing andmanaging an incremental software change. This task also serves to thegreater good of avoiding the injection of bugs in the program.

Just like concept location, impact analysis techniques have beentraditionally divided into static and dynamic. In [31], static techniquesthat use dependencies between components can be found. Accordingto [32], previous papers show that techniques that use expert judgmentor source code inspections are inaccurate or too expensive.

Orso et al. [33] compare two dynamic techniques called PathImpactand CoverageImpact. Both need to instrument and execute the systemwith a test suite, and are considered to include every method that canbe affected. PathImpact collects and compresses the traces. It requireslarge amounts of disk space in order to save traces, and also needs largeamounts of time to compress the collected traces. CoverageImpactintersects a slice for all the definitions in the function, with a particularexecution of it. Both techniques show high precision. The principalweakness of this approaches is related to the resources consumed.

In [34], two techniques that perform the analysis while the programis executed are presented. This techniques are categorized under thelabel of online techniques. One of this techniques is called PI_Allin1.As the name suggests, this algorithm is similar to PathImpact sinceit also instruments the program, and considers that the impact set ofa function f includes every function called after f, and every functionthat f can return into. The difference resides in that while PathImpactcollects the trace, compresses it, and analyzes it; PI_Allin1 analyzes thetrace while is in execution by means of a matrix that saves the functionsthat can be affected. These algorithms (PI and PI_Allin1) show anapproach better suited for programs that use global variables. Thesetechniques are considered pessimistic because the impact set includesevery function that can be affected. More optimistic algorithms can beused in object-oriented programs, because the probability that a functioncalled after the function f will be impacted, when not called directlyneither transitively by f, is much lower.

97


In [35], a static technique based on SVD is proposed and com-pared to PathImpact and CoverageImpact. The technique uses softwarechange records to find components that change together. Then, thisinformation is used in order to obtain an impact set based on histori-cal evidence of change. The technique obtains less precision, but theperformance is quite higher.

Another static technique showed in [32] uses historical change re-quests and revision comments to find textual similarities, and based onthe results computes the impact set. The static techniques weaknessesare related with poor precision compared to dynamic techniques. Thetrade-off here is related to performance, because static methods are lessresource intensive by far. Dynamic techniques usually show better preci-sion, but are unable to find non-executable impacted software artifacts.By their nature, static techniques are able to find documents, help files,and configuration files that need to be changed.

One big problem during change impact analysis are the hidden de-pendencies. When class A and B are not related, but class C is relatedto both and silently propagates a change, there is a hidden dependency.These are extremely difficult to find. In [36], the technique executesthe whole test suite, extracts invariants from the traces, and uses it tosearch for hidden dependencies.

4.5 SUMMARY

Software changes can be considered from evolution or maintenance per-spective. IC can be the way that software evolves, but is also the pathto degradation of architecture. It, is extremely important to manageit in order to keep software on the evolution stage, and avoid an earlymove to maintenance.

Another important difference available currently in literature is theone made between concept and feature. With features being concepts“touchable” by users during normal use of the application, dynamic ap-proaches seem better suited for feature location.

This chapter shows how some techniques can be used in differenttasks related to software incremental change. Almost the same tech-niques applied to perform concept location can be applied in search of

98


the whole change impact set. Most techniques make use only of sourcecode, but as in impact analysis is important every artifact that canbe affected, we think more research can be done to incorporate non-executable components. The static and dynamic techniques in bothIC activities show strengths and weaknesses. Combinations of the ap-proaches seem to enhance the results in a significant way, so we believeseveral works using hybrid techniques are still waiting to appear. Herecan also be observed that IR plays an important role as a source forimplementations to improve the ability to perform IC activities.

Incremental changes are an unavoidable characteristic of software.Managing these changes in a systematic manner gives developers op-portunities to enhance its comprehension, evolutionary and maintenancetasks.

99


REFERENCES

[1] Rajlich, V. Changing the paradigm of software engineering.Communications of the ACM, 49(8): 2006; 67–70. ISSN 00010782.doi:10.1145/1145287.1145289.

[2] Lehman, M. & Belady, L. A. Program Evolution: Processesof software change. Academic Press: 1985.

[3] Lehman, M., Ramil, J., Wernick, P., Perry, D. &Turski, W. Metrics and laws of software evolution-the ninetiesview. In: Software Metrics Symposium, 1997. Proceedings., FourthInternational : 1997, 20 –32. doi:10.1109/METRIC.1997.637156.

[4] Swanson, E. B. The dimensions of maintenance. In: Proceedingsof the 2nd international conference on Software engineering, ICSE’76. IEEE Computer Society Press, Los Alamitos, CA, USA: 1976,492–497.

[5] Rajlich, V. & Bennett, K. A staged model for the softwarelife cycle. Computer, 33(7): 2000; 66 –71. ISSN 0018-9162. doi:10.1109/2.869374.

[6] Febbraro, N. & Rajlich, V. The Role of Incremental Changein Agile Software Processes. In: AGILE 2007 : 2007, 92 –103.doi:10.1109/AGILE.2007.58.

[7] Godfrey, M. & German, D. The past, present, and future ofsoftware evolution. In: Frontiers of Software Maintenance, 2008.FoSM 2008.: 2008, 129 –138. doi:10.1109/FOSM.2008.4659256.

[8] Rajlich, V. & Gosavi, P. A case study of unanticipated in-cremental change. In: Software Maintenance, 2002. Proceedings.International Conference on: 2002. ISSN 1063-6773, 442 – 451.doi:10.1109/ICSM.2002.1167801.

[9] Rajlich, V. & Gosavi, P. Incremental change in object-oriented programming. Software, IEEE, 21(4): 2004; 62 – 69.ISSN 0740-7459. doi:10.1109/MS.2004.17.

100


[10] Abebe, S. & Tonella, P. Natural Language Parsing of ProgramElement Names for Concept Extraction: 2010. ISSN 1063-6897,156 –159. doi:10.1109/ICPC.2010.29.

[11] Ratiu, D. & Heinemann, L. Utilizing Web Search Engines forProgram Analysis: 2010. ISSN 1063-6897, 94 –103. doi:10.1109/ICPC.2010.26.

[12] Chen, K. & Rajlich, V. Case study of feature location usingdependence graph. In: Program Comprehension, 2000. Proceed-ings. IWPC 2000. 8th International Workshop on: 2000, 241 –247.doi:10.1109/WPC.2000.852498.

[13] Chen, K. & Rajlich, V. Case Study of Feature Location UsingDependence Graph, after 10 Years. In: Program Comprehension(ICPC), 2010 IEEE 18th International Conference on: 2010. ISSN1063-6897, 1 –3. doi:10.1109/ICPC.2010.40.

[14] Buckner, J., Buchta, J., Petrenko, M. & Rajlich, V.

JRipples: a tool for program comprehension during incrementalchange. In: Program Comprehension, 2005. IWPC 2005. Proceed-ings. 13th International Workshop on: 2005. ISSN 1092-8138, 149– 152. doi:10.1109/WPC.2005.22.

[15] Rajlich, V. & Wilde, N. The role of concepts in programcomprehension. In: Program Comprehension, 2002. Proceedings.10th International Workshop on: 2002. ISSN 1092-8138, 271 –278. doi:10.1109/WPC.2002.1021348.

[16] Rajlich, V. Intensions are a key to program comprehension. In:Program Comprehension, 2009. ICPC ’09. IEEE 17th InternationalConference on: 2009. ISSN 1063-6897, 1 –9. doi:10.1109/ICPC.2009.5090022.

[17] Biggerstaff, T., Mitbander, B. & Webster, D. The con-cept assignment problem in program understanding. In: SoftwareEngineering, 1993. Proceedings., 15th International Conference on:1993. ISSN 0270-5257, 482 –498. doi:10.1109/ICSE.1993.346017.

101


[18] Marcus, A., Sergeyev, A., Rajlich, V. & Maletic, J. Aninformation retrieval approach to concept location in source code:2004. ISSN 1095-1350, 214 – 223. doi:10.1109/WCRE.2004.10.

[19] Letovsky, S. & Soloway, E. Delocalized Plans and ProgramComprehension. Software, IEEE, 3(3): 1986; 41 –49. ISSN 0740-7459. doi:10.1109/MS.1986.233414.

[20] Marcus, A., Rajlich, V., Buchta, J., Petrenko, M. &Sergeyev, A. Static techniques for concept location in object-oriented code: 2005. ISSN 1092-8138, 33 – 42. doi:10.1109/WPC.2005.33.

[21] Cleary, B., Exton, C., Buckley, J. & English, M. Anempirical analysis of information retrieval based concept locationtechniques in software comprehension. Empirical Software Engi-neering, 14(1): 2009; 93–130. Cited By (since 1996): 1.

[22] Zhao, W., Zhang, L., Liu, Y., Sun, J. & Yang, F. SNI-AFL: towards a static non-interactive approach to feature location.In: International Conference on Software Engineering (ICSE 04).ACM/IEEE.: 2004.

[23] Poshyvanyk, D., Gueheneuc, Y.-G., Marcus, A., Anto-

niol, G. & Rajlich, V. Feature Location Using ProbabilisticRanking of Methods Based on Execution Scenarios and Informa-tion Retrieval. Software Engineering, IEEE Transactions on, 33(6):2007; 420 –432. ISSN 0098-5589. doi:10.1109/TSE.2007.1016.

[24] Wilde, N. & Casey, C. Early field experience with the Soft-ware Reconnaissance technique for program comprehension. In:Software Maintenance 1996, Proceedings., International Confer-ence on: 1996, 312 –318. doi:10.1109/ICSM.1996.565034.

[25] Gallagher, K. & Lyle, J. Using program slicing in softwaremaintenance. Software Engineering, IEEE Transactions on, 17(8):1991; 751 –761. ISSN 0098-5589. doi:10.1109/32.83912.

[26] Revelle, M., Dit, B. & Poshyvanyk, D. Using Data Fu-sion and Web Mining to Support Feature Location in Software.

102


In: Program Comprehension (ICPC), 2010 IEEE 18th Interna-tional Conference on: 2010. ISSN 1063-6897, 14 –23. doi:10.1109/ICPC.2010.10.

[27] Koschke, R. & Quante, J. On dynamic feature location. In:Proceedings of the 20th IEEE/ACM international Conference onAutomated software engineering, ASE ’05. ACM, New York, NY,USA: 2005. ISBN 1-58113-993-4, 86–95. doi:http://doi.acm.org/10.1145/1101908.1101923.

[28] Antoniol, G. & Gueheneuc, Y. G. Future identification:A novel approach and a caje study. In: Software Metrics, 2005.ICSM 05 Proceedings of the 21st IEEE International Conferenceon: 2005, 357 – 366.

[29] Poshyvanyk, D. & Marcus, A. Combining Formal ConceptAnalysis with Information Retrieval for Concept Location in SourceCode: 2007. ISSN 1063-6897, 37 –48. doi:10.1109/ICPC.2007.13.

[30] Bohner, S. Impact analysis in the software change process: a year2000 perspective. In: Software Maintenance 1996, Proceedings.,International Conference on: 1996, 42 –51. doi:10.1109/ICSM.1996.564987.

[31] Arnold, R. & Bohner, S. Impact analysis-Towards a frame-work for comparison. In: Software Maintenance, 1993. CSM-93,Proceedings., Conference on: 1993, 292 – 301. doi:10.1109/ICSM.1993.366933.

[32] Canfora, G. & Cerulo, L. Impact analysis by mining softwareand change request repositories. In: Software Metrics, 2005. 11thIEEE International Symposium: 2005. ISSN 1530-1435, 9 – 29.doi:10.1109/METRICS.2005.28.

[33] Orso, A., Apiwattanapong, T., Law, J., Rothermel, G.

& Harrold, M. An empirical comparison of dynamic impactanalysis algorithms. In: Software Engineering, 2004. ICSE 2004.Proceedings. 26th International Conference on: 2004. ISSN 0270-5257, 491 – 500. doi:10.1109/ICSE.2004.1317471.

103


[34] Breech, B., Tegtmeyer, M. & Pollock, L. A Compari-son of Online and Dynamic Impact Analysis Algorithms. SoftwareMaintenance and Reengineering, 2005. CSMR 2005. Ninth Euro-pean Conference on: 2005.

[35] Sherriff, M. & Williams, L. Empirical Software Change Im-pact Analysis using Singular Value Decomposition. In: SoftwareTesting, Verification, and Validation, 2008 1st International Con-ference on: 2008, 268 –277. doi:10.1109/ICST.2008.25.

[36] Vanciu, R. & Rajlich, V. Hidden dependencies in softwaresystems. In: Software Maintenance (ICSM), 2010 IEEE Inter-national Conference on: 2010. ISSN 1063-6773, 1 –10. doi:10.1109/ICSM.2010.5609657.

104

Software EvolutionSupported

by Information Retrieval

Angélica Veloza-SuanMario Linares-Vásquez

Henry Roberto Umaña-Acosta

ABSTRACTInformation Retrieval (IR) techniques have been used traditionally inanalysis of free text and documents, but with the arising of new workareas proposed by software evolution researches IR Techniques are be-coming a necessary support of the activities in the evolutionary modelof software development; some of them, such as incremental changeand software comprehension, deal with problems related to informationextraction and querying in source code. One of the challenges of soft-ware evolution activities is that software evolution analysis requires todeal with large scale repositories, and IR is a good method for extract-ing information from large repositories. Thus, this chapter shows howthe implementations of some activities of the evolutionary model aresupported on IR techniques.

5.1 INTRODUCTIONThe evolutionary model of software development [1] appears as a para-digm in which the development process is addressed from the perspectiveof adaptation and not from the standpoint of prediction, as in tradi-tional development. Software is a product that is evolving, with a lifecycle [2] in which incremental change (IC) is continuous. This incremen-tal change requires a set of activities that support the development and

105


maintenance of software, in order to ensure application quality and tofacilitate its implementation by any team member (newcomer or expert)in any stage of the development process. Thus, the incremental changehas positioned itself as the focus of research in software development,with activities such as concept location, automatic categorization ofrepositories, automatic summarization and traceability, among others.Because incremental change inputs are software artifacts (documenta-tion and code) or some kind of query that represents change, IC activitiesrequire the use of text analysis techniques like those used in informationretrieval.

The information retrieval approach is used in software evolution tasksin the same way; there is a query, then retrieval is executed on the in-formation source using that query, and finally the relevant results arepresented, although some implementation details for each task are dif-ferent, such as the information sources, the format of presentation ofresults, and the way in which the query is built.

The purpose of this chapter is to show how the techniques andthe concept of information retrieval are used in the context of the ac-tivities/tasks of the evolutionary model of software development. Thestructure of the chapter is as follows: Section 2 describes the purposeof information retrieval and the techniques which are generally used forthe case of document analysis. Section 3 describes the activities of theevolutionary model of software development, with emphasis on those inwhich information retrieval is relevant. Section 4 presents how informa-tion retrieval is applied on the software evolution activities. Section 5presents the summary.

5.2 INFORMATION RETRIEVAL

The Information Retrieval (IR) addresses issues of representation, stor-age, organization, and access to information [3]. The IR attempts tomodel, design, and implement systems capable of providing fast andefficient access to large amounts of information, in order to present theuser the most relevant elements from a collection or repository of ob-jects (documents, multimedia, images). The relevance of the objects isestimated based on a query that expresses the needs of the user [4].

106

SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

The overall operation of IR systems is based on the search for occur-rences of terms on a repository of objects, which are modeled using anintermediate numerical representation. This representation symbolizesthe importance (weight) of the terms regarding the objects. Based onthose weights, a numerical calculation is performed, and the result is thedegree of similarity (relevance) between the terms and the query of theuser; at the end of the process, the most relevant objects are displayedin a ranked list.

The general architecture of an IR system (see Figure 5.1) consists ofa repository, an indexing module; a query module and a ranking module.The repository has the objects on which the retrieval will be done. Theindexing module creates an intermediate representation of the corpus inorder to make faster searching on large repositories. The intermediaterepresentation usually reduces the dimensionality and represents the la-tent semantic of the corpus. The query module finds the documentsthat have a better match with the query by using the intermediate repre-sentation. The ranking module prioritizes the retrieved objects by usinga similarity measure. Then the top ranked results are displayed to theuser.

General architecture of an IR System

Guery Module Repository(Objects)User query

User

Ranking ModuleIndexing Module

1 2

6

5 4 3 Documents/ObjectsDocuments that match

with the query

Documents that match

with the query

Revelantdocuments

for the query

Figure 5.1 General architecture of an IR system.

Several models have been developed in order to address the issues ofInformation Retrieval. Figure 5.2 is a taxonomy of Information Retrievalmodels. The models are described below.

107


Information Retrieval Models

Classical Models Alternative andHybrid Models

Web Models

Boolean Retrieval Latent SemanticIndexing (LSI)

Rage Rank

Vector SpaceModel

ProbabilisticModels

Latent DirichletAllocation (LDA)

Hypertext InducedTopic Selection (HITS)

Figure 5.2 Information retrieval models.

5.2.1 Classic Models

Boolean Retrieval

It is a model based on set theory and mathematical logic. The doc-uments are represented as a set of binary weights of the index terms,and queries are represented as a Boolean expression. The relevant doc-uments are retrieved applying the logical operations of the query onthe document representation; it means that the documents that complywith that query, according to its set of terms, will be relevant to theuser [5].

Vector Space Model

It is a model in which the documents and the query are represented asvectors of terms. The representation of the documents is a matrix inwhich the columns are the documents, the rows are composed of termsand each position in the matrix contains a value that represents thenumber of occurrences of the word in the document. The calculationof similarity is based on the result of the dot product between the queryvector and the document vector [6].

108


Probabilistic Models

The foundation of probabilistic models is to determine whether the prob-ability of a given document can satisfy a query. This model requires aninitial hypothesis to establish the relevance, so there is no need to takeinto account the frequency of the terms [5, 7].

5.2.2 Alternative and Hybrid Models

Latent Semantic Indexing (LSI)

It is an extension of the Vector Space Model. LSI applies Singular ValueDecomposition (SVD) on the matrix of documents and terms, to reducethe dimension of the matrix of documents, and eliminating problems ofambiguity and synonyms. Thus, LSI generates a latent semantic bymodeling associations between terms and concepts of the documents.LSI is based on the fact that the terms are used in similar contexts withsimilar meanings [5, 7, 8].

Latent Dirichlet Allocation (LDA)

It is a Bayesian network method in which the distribution of terms ineach document is generated by a Dirichlet probability distribution. Thedocuments are modeled using a Dirichlet random variable which as-sumes a set of predefined categories. Each category is a multinomialdistribution over the terms. The probability of a document is given bythe likelihood that its words appear in a category [8, 9].

5.2.3 Web Models

Web models quantify the importance of web pages using link analysis,search and, visualization, among others. Link analysis is based on thestructure of Web pages which is seen as a directed graph between pages,and links.

109


Page Rank

This technique analyzes the links on each web page and assigns a valueto it through a probability distribution, which depends on the amountand importance of the links that the page has. Then, given a query madeby the user, a computation is made between a number of features, suchas proximity to the terms, with the page rank of the web page, to returnthe most significant pages to the query [5, 7].

Hypertext Induced Topic Selection (HITS)

This technique is based on two indicators to assess and rank the impor-tance of a Web page. The first indicator is the authority, which is basedon the values of hub of the pages that link to it. The second indicator,the hub, is based on the value of authority of the sites that it refers,which means that it takes into account the quality of the informationthat would be obtained by following the links that has to other pages.A hub value is high when the page is linked to many pages with highauthority scores [5, 7].

5.3 SOFTWARE EVOLUTION ACTIVITIES

The concept of software evolution arose with software methodologies,such as Evo [10], Spiral [11], and Rajlich Staged Model [2]. Addition-ally, the high rate of adoption of agile methodologies and the decadentwaterfall process have powered the concept of evolution as an essentialfeature of software, which must be embraced by development process.

Activities in Software Evolution can be grouped in Incremental Change(IC), Software Comprehension, Mining Software Repositories (MSR),Software Visualization, Reverse Engineering & Reenginering, and Refac-toring. Each one of them are described below.

5.3.1 Incremental Change

It is the foundation of software evolution and the manifestation of volatil-ity of requirements in software development processes. IC starts witha change request (new feature, bug, enhancement) and ends with the

110


implementation of the request on existent code. According to [2, 12],IC includes the following activities:

• Concept extraction. Change requests are described using domainconcepts and can be written such as user histories, features lists,or free text specifications. Thus, concept extraction consists inextracting relevant domain concepts from change requests speci-fications.

• Concept location. It consists in locating the places where therelevant domain concepts are implemented in the existent code.These places are the possible ones where the change request canbe implemented.

• Impact analysis. It consists in identifying the set of classes whichcan be affected by implementing the change request.

• Prefactoring. It is an opportunistic refactoring which is aimed toprepare the architecture for change request implementation.

• Actualization and change propagation. It consists in implementingthe change request and fixing inconsistencies or bugs.

• Postfactoring. It is aimed at removing “bad smells” which couldhave been injected by the change implementation.


It is the process by which developers understand software artifacts usingdomain, semantic, and syntactic knowledge, in order to build a mentalmodel of the software and its relationship with the environment. Thesoftware comprehension process includes building models from softwareartifacts and cognitive processes of the stakeholders. The tasks associ-ated with software comprehension are:

• Summarization of software artifacts.

• Software categorization.

• Traceability recovery between requirements specifications and sourcecode.

111


For further information refer to [13] and [14].

5.3.3 Mining Software Repositories

The aim of MSR is to analyze software repositories (artifacts, issue re-ports) in order to extract relevant information for learning, understand-ing, modeling, and managing evolution in software systems. MSR mostimportant tasks are:

• Clones detection in code.

• Analysis of developers contribution.

• Traceability analysis between changes implementation and changerequests.

• Automatic assignation of tasks.

• Software defect prediction.

• Analysis of patterns in the code and in the process.

See [15] for further information on MSR.

5.3.4 Software Visualization

Visualization is the process of transforming information in visual repre-sentations in order to improve its comprehension; software visualizationconsists in building visual representations of several software aspects byusing metaphors. For example, Code City is a 2D metaphor in whichseveral software metrics are visualized like a city by using poly cylindersand containers [16].

5.3.5 Reverse Engineering & Reengineering

Reengineering is the analysis and modification of software systems inorder to make a new implementation [17]. The process includes a stageof reverse engineering which provides models of the systems. Reengi-neering is usually applied on legacy systems for migration issues such aschanging database engine, programming language, architecture, etc.

112


5.3.6 Refactoring

It consists in applying transformations to the code to improve the inter-nal structure of software, preserving features and external behavior.

5.4 INFORMATION RETRIEVAL AND SOFTWAREEVOLUTION

Figure 5.3 shows the activities in software evolution, which use infor-mation retrieval techniques for their development. These activities areexplained below.

Information Retrieval and Software Evolution

Mining SoftwareRepositories (MSR)

AutomaticCategorization of Source

Code Repositories

SummarizationSoftware Artifacts

Identificationand Elimination

of Clones

Static SourceCode Analysis

Other Applications

Concept/FeatureLocation

TraceabilityRecovery

Figure 5.3 Information retrieval and software evolution activities.

5.4.1 Concept/Feature Location

Methods for concept/feature location based on IR share the followinggeneral workflow:

1. Preprocess the source code and documentation.

2. Create the corpus from the source code and the documentationpreviously preprocessed.

3. Index the corpus and create an intermediate representation. Thisrepresentation is modeled as a relationship between terms (vari-ables, methods, comments) and documents (class files) of thecorpus. The indexing process includes the decomposition of theentire search space in documents; according to the criterion of

113


granularity and the kind of language in this measure, each docu-ment can represent a package, a class or a method.

4. Formulate the query based on words that represent the concepts/features. The query is formulated manually or assisted by themodel.

5. Run the query on the corpus.

6. Retrieve the results and display them in a ranked list.

In [18], LSI is used for concept location. The corpus is built from theidentifiers and the comments in the code. In this case, each functionand each block of code external to the functions are represented asdocuments.

[19] extends the model in [18], through building a lattice of conceptsfrom the list of relevant documents generated by LSI. The lattice is builtusing Formal Concept Analysis (FCA) and it is used to select the mostrelevant attributes of the documents.

In [20], LSI and SPR (Scenarios Probabilistic Ranking) are used forfeature location, generating lists of relevant documents as two expertswho performed the evaluation independently. Finally, the two lists arecombined by a function of linear transform and a translation function,to display a ranked list with the localization.

In [21], the authors present a state of the art about the concept loca-tion techniques based on information retrieval and additionally presenta new model, called cognitive mapping technique, in which the locationis based on an expansion of the query. The query expansion makingis based on the flows and the co-occurrence of information betweenartifacts.

5.4.2 Mining Software Repositories (MSR)

IR techniques are used in MSR for several task, such as clones detection,clones removal, and static analysis.

114


Clones Detection Removal

The general process for clones detection based on IR techniques is:

1. Preprocess the code of interest and split it into units of sourcecode (preprocessing).

2. Build an intermediate representation of the code, through theapplication of mining and processing techniques (transformation).

3. Compare the transformed code units in order to find similaritiesbetween them and to detect the clones (clones detection).

4. Map the clones to the original code (mapping).

5. Display the clones in order to filter false positives by the user(filtering).

6. Cluster the clones in classes or families, to reduce the amount ofdata and facilitate the process of analysis (clustering).

In [22], LSI is applied in the transformation stage, in order to findsimilarities in code segments. This work is limited to the comparisonof comments and identifiers, returning two pieces of code as potentialclones or a cluster of potential clones, in which there is a high degreeof similarity between the sets of identifiers and comments. In [23] areshown the benefits of grouping classes using LSI clones to improve theircomprehension (relations among the clones of classes). In [24] is pre-sented a comprehensive qualitative comparison of techniques and toolsfor clones detection. In [25] it is argued the need to include informationretrieval techniques in MSR research.

Static Analysis

In [26], the vector space model and machine learning techniques areapplied on free text records (logs) to identify performance problemsin the code, without manual intervention. The methodology proposedin [26] follows the steps listed below:

115


1. Find messages inside of the log, which present particular scenarios(e.g., those who have been reported many times, or have manydifferent values and appear in multiple types of messages).

2. Group the messages based on the values of the variables from theprevious step.

3. For each group of messages, create a vector where the number ofoccurrences of the message in the group is counted.

Other Applications

Other applications that involve IR and MSR are presented in [15] and[27]. In [27], automatic classification of software changes is made basedon their descriptions, using data available in the version control systemsto discover qualitative and quantitative information on several charac-teristics of software development.

In [28] is proposed a method to predict the impact of change requestson source code. In this case, the IR probabilistic model is used for linkingthe descriptions of change requests, with the set of historical revisionsof source code affected by similar change requests previously made.

In [29] and [30] is proposed a tool that uses Vector Space Model toperform inference (implicit memory) from the possible relationships be-tween objects stored in a project, and they recommend relevant artifactsfor developers that work in a certain task.

In [31] is proposed a technique that evaluates and recommends opensource applications that contain relevant features in the implementa-tions, based on its functionality. To do this, it combines probabilisticreordering techniques and program analysis. In [32], the authors focuson the use of version control to estimate the association between thechanges of the software modules, using the probabilistic model.

In [33] is presented a model, in which using the LSI technique is ex-tracted the vocabulary of source code and automatic suggestions fromdevelopers who have more experience handling error reports are gener-ated.

In [34], data mining techniques and the probabilistic model arecombined to analyze the version history and then the authors suggest

116


additional changes that must be made. The objective of the analysisis to prevent errors due to incomplete changes, detecting the couplingbetween elements, which are not detected by the program analysis [35]uses LDA in order to find bugs automatically. [36] presents a work basedon LDA for automatic categorization of software systems implementedin different programming languages.

5.4.3 Automatic Categorization of Source CodeRepositories

Automatic Categorization of Software is achieved using IR in an overallprocess like this:

1. Define the categories, manually or automatically using IR.

2. Index the corpus, which consists in the assignment of projectsrepository, to the categories.

3. Build the intermediate representation.

4. Categorize the new project using the similarity between the newproject and the elements in the intermediate representation. Thus,the category for the new project is the one of the most similarelements in the corpus.

In [37], the problem of categorization consists in indexing compo-nents of a library of code, in order to be grouped by similar charac-teristics. The indexing is performed by generating a profile of eachcomponent, extracting from the available documentation (manuals andcomments) the representative characteristics. Each profile is a list ofbinary lexical relations between words in the text that have the greatestamount of information. According to the retrieval scheme selected bythe user, this can be done by a classical vector space model or by ahierarchy of clusters of profiles.

In [38] is proposed a tool called MUDABlue, which allows automaticand multiple categorization by analyzing the application source code. Anapplication is modeled as a document, and the identifiers in the codeas words in the document. The categories are modeled as a cluster and

117


are defined automatically from the code, using LSA over the identifiersand then grouping them using clustering. The similarity measure usedis the cosine distance. The identifier from each category is composedby the ten most representative terms from each cluster.

In [39], the identifiers are preprocessed to obtain legible names ofthe categories. The categories are built using the identifiers and thecomments in the code. The intermediate representation is built withLDA and the categories are grouped in clusters using cosine similaritymeasure, as in the case of MUDABlue [38].

5.4.4 Summarization of Software Artifacts

Summaries can be generated from the documentation or from the code.Additionally, they can be general summaries or summaries that are rel-evant to a query made by the user. To generate the summaries whichare relevant for a query, the task is the same as text retrieval and dealsbasically as an extraction of relevant words or phrases (as in the case ofthe automatic categorization). In the general summaries case, artifactsare analyzed without a user query that leads the process. The generalprocess for generating summaries by using IR is:

1. Decompose documents in phrases or paragraphs (preprocessing).

2. Build the matrix of occurrence of words per sentence and perdocument. Then to apply transformations such as SVD or otherIR technique (intermediate representation).

3. For each sentence or paragraph, its relevance to the documentis calculated, and a summary is constructed with the paragraphsthat have a higher relevancy. If the generation of the Summary isrelevant to a query, the relevance of each sentence or paragraphis calculated against the query (Summary generation).

In [40] is presented a model that creates general summaries of textusing LSI. [41] and [42] build summaries based on the source code.

118


5.4.5 Traceability Recovery

The general approach of the traceability recovery by using IR includesseveral processes over the documents and source code, prior to the anal-ysis of their similarity. Free text documents are indexed by a vocabularyextracted for themselves, using the following preprocessing steps:

1. Apply stemming and to remove stop-words.

2. Remove all capital letters.

3. Convert words in plural to singular, and to change the verbs toinfinitive.

The query is built with the identifiers in the source code. The stepsfor making the query are:

1. Split composite identifiers.

2. Apply the same process that takes place over free text documents.

After preprocessing steps, the classifier computes the similarity be-tween queries and documents, and returns a ranked list of documentsfor each component of the source code.

In [43], vector space model and probabilistic models are used for re-cover traceability between source code and free text documents. In [8],the authors make an extension of the vector space model and compareits performance with LSI and the probabilistic model. In [44], severalvariants of the vector space model are applied in order to make traceabil-ity between requirements and UML artifacts, source code and test cases.In [45], LSI is used to retrieve traceability between free text documentsand source code. In [46] and [47], traceability recovery is performed be-tween different types of artifacts (interaction diagrams, test cases, usecases, among others), by using LSI. In [48] is proposed an improvementto the performance of dynamic requirements traceability, incorporat-ing three strategies to probabilistic retrieval algorithm. Finally, in [49]is analyzed the equivalence of some traceability recovery methods; thetechniques analyzed were Jensen-Shannon, vector space model, LSI, andLDA.

119


5.5 SUMMARY

The aim of software evolution tasks is to support software develop-ment process, by improving the mechanisms for software comprehension,change implementation, maintenance, and tasks assignation in develop-ment teams. For example, analyzing the explicit/implicit semantic insource code is a cross-cutting challenge for software evolution tasks,and it is one of the classic problems in Information Retrieval.

Using IR techniques as a support for software evolution tasks hascontributed to the development of research fields in software evolution.The general model for Information Retrieval has been applied on thesoftware evolution tasks (except for Software Visualization) naturally.The difference between how IR techniques are applied depends on howthe query is used in the task. For example, in concept location, the queryis defined by the user or assisted by the model in an automatic manner; inclones detection, the query is built automatically from software artifacts.

The reason why the integration between IR and Software Evolu-tion has been achieved is the fact that software artifacts are considereddocuments. Software evolution tasks are complex for machine learningtechniques, but if the software artifacts are documents, then, interme-diate representations can be generated and thus IR techniques can beapplied to extract information by using queries which represent the userneeds. IR has provided software evolution with a framework for resolv-ing issues related with the tasks and the nature of research problems insoftware evolution.

In spite of the use of IR techniques in software evolution, there aresome tasks such as refactoring, software visualization, and some MSRtasks, which are not supported on IR yet. But tasks such as incrementalchange, software categorization, and summarization have widely usedseveral IR models.

120


REFERENCES

[1] Rajlich, V. Changing the Paradigm of Software Engineering. In:Communications of the ACM: 2006.

[2] Rajlich, V. & Bennett, K. A Staged Model for the SoftwareLife Cycle. In: Computer, Vol. 33, Issue 7 : 2000.

[3] Baeza-Yates, R. & Ribeiro-Neto, B. Modern InformationRetrieval. Addison-Wesley Longman: 1999.

[4] Baeza-Yates, R. Information Retrieval in the Web: BeyondCurrent Search Engines. International Journal on ApproximatedReasoning, 34: 2003; 97 – 104.

[5] Dominich, S. The Modern Algebra of Information Retrieval.Springer-Verlag Berlin Heidelberg: 2008.

[6] Grossman, D. & Frieder, O. Information Retrieval: Algo-rithms and Heuristics. In: Springer, Second edition.: 2004.

[7] Manning, C. D., Raghavan, P. & Schutze, H. Introductionto Information Retrieval. Cambridge University Press: 2009.

[8] Deerwester, S., Dumais, S. T., Furnas, G., Landauer,

T. K. & Harshman, R. Indexing by Latent Semantic Analysis.Journal of the American Society for Information Science, 41: 1990;391 – 407.

[9] Blei, D. M., Ng, A. Y. & Yordan, M. I. Latent DirichletAllocation. Journal of Machine Learning Research, 3: 2003; 993 –1022.

[10] Gilb, T. ACM SIGSOFT Software Engineering Notes. Evolution-ary Development, 6: 1981; 17.

[11] Boehm, B. A spiral model of software development and enhance-ment. IEEE Computer, 21: 1998; 61 – 72.

[12] Febbraro, N. & Rajlich, V. The Role of Incremental Changein Agile Software Processes. In: Agile: 2007.

121


[13] Storey, M. A. Theories, tools and research methods in programcomprehension: past, present and future. Software Quality Journal,14: 2006; 187 – 208.

[14] O’Brien, M. Software Comprehension - A Review & ResearchDirection. Informe técnico, Department of Computer Science &Information Systems. University of Limerick: 2004.

[15] Kagdi, H., Collard, M. & Malletic, J. A survey and taxon-omy of approaches for mining software repositories in the contextof software evolution. Journal of Software Maintenance and Evo-lution: Research and Practice, 19: 2007; 77 – 131.

[16] Wettel, R. & Lanza, M. Program Comprehension throughSoftware Habitability. In: Proceedings of the 15th IEEE Interna-tional Conference on Program Comprehension: 2007.

[17] Chikofsky, E. J. & Cross, J. Reverse engineering and designrecovery: A taxonomy. IEEE Software, 7(1): 1990; 13–17.

[18] Marcus, A., Sergeyev, Rajlich, V. & Maletic, J. An In-formation Retrieval Approach to Concept Location in Source Code.In: 11th Working Conference on Reverse Engineering : 2004.

[19] Poshyvanyk, D. & Marcus, A. Combining Formal ConceptAnalysis with Information Retrieval for Concept Location in SourceCode. In: 15th IEEE International Conference on Program Com-prehension: 2007.

[20] Poshyvanyk, D., Gueheneuc, Y., Marcus, A., Anto-

niol, G. & Rajlich, V. Feature Location Using ProbabilisticRanking of Methods Based on Execution Scenarios and Informa-tion. IEEE Transactions on Software Enginnering, 33: 2007; 420– 432.

[21] Cleary, B., Exton, C., Buckley, J. & English, M. Anempirical analysis of information retrieval based concept locationtechniques in software comprehension. Empirical Software Engi-neering, 14: 2009; 93 – 130.

122


[22] Marcus, A. & Maletic, J. Identification of High-Level ConceptClones in Source Code. In: 16th IEEE International Conference onAutomated Software Engineering : 2001.

[23] Tairas, R. & Gray, J. An Information Retrieval Process to Aidin the Analysis of Code Clones. In: Empirical Software Engineering :2009.

[24] Roy, C., Cordy, J. & Koschke, R. Comparison and evalu-ation of code clone detection techniques and tools: A qualitativeapproach. In: Science of Computer Programming : 2009.

[25] Walenstein, A. & Lakhotia, A. Clone Detector EvaluationCan Be Improved: Ideas from Information Retrieval. In: SecondInternacioanl Workshop the Detection of Software Clones: 2003.

[26] Wei, X., Huang, L., Fox, A., Patterson, D. & Jordan,

M. Detecting large-scale system problems by mining console logs.In: ACM SIGOPS 22nd symposium on Operating Systems Princi-ples: 2009.

[27] Mockus, A. & Votta, L. G. Identifying reasons for softwarechanges using historic databases. In: 16th IEEE International Con-ference on Software Maintenance: 2000.

[28] Canfora, G. & Cerulo, L. Impact Analysis by Mining Soft-ware and Change Request Repositories. In: 11th IEEE InternationalSymposium on Software Metrics: 2009.

[29] Cubranic, D. & Murphy, G. C. Hipikat: Recommendingpertinent software development artifacts. In: 25th InternationalConference on Software Engineering : 2003.

[30] Cubranic, D., Murphy, G. C., Singer, J. & Booth, K. S.

Hipikat: A project memory for software development. In: IEEETransactions on Software Engineering : 2005.

[31] Grechanik, M. & Poshyvanyk, D. Evaluating recommendedapplications. In: International workshop on Recommendation sys-tems for Software Engineering : 2008.

123


[32] Colaco, M., Mendonca, M. & Rodrigues, F. Mining Soft-ware Change History in an Industrial Environment. In: BrazilianSymposium on Software Engineering : 2009.

[33] Matter, D., Kuhn, A. & Nierstrasz, O. Assigning BugReports using a Vocabulary-Based Expertise Model of Developers.In: 6th IEEE International Working Conference on Mining SoftwareRepositories: 2009.

[34] Zimmermann, T., Zeller, A., Weissgerber, P. & Diehl,

S. Mining Version Histories to Guide Software Changes. IEEETransactions on Software Engineering, 31: 2005; 429 – 445.

[35] Lukins, S. K., Kraft, N. A. & Etzkorn, L. H. Source CodeRetrieval for Bug Localization Using Latent Dirichlet Allocation. In:Working Conference on Reverse Engineering : 2008.

[36] Tian, K., Revelle, M. & Poshyvanyk, D. Using LatentDirichlet Allocation for automatic categorization of software. In:6th IEEE International Working Conference on Mining SoftwareRepositories.: 2009.

[37] Maarek, Y. S., Berry, D. M. & Kaiser, G. E. An infor-mation retrieval approach for automatically constructing sofwarelibraries. IEEE Transactions in Software Engineering, 17: 1991;800 – 813.

[38] Kawaguchi, S., Garg, P. K., Matushita, M. & Inoue, K.

MUDABlue: An automatic categorization system for Open Sourcerepositories. Journal of Systems and Software, 79: 2006; 939 –953.

[39] Tian, K., Revelle, M. & Poshyvanyk, D. Using LatentDirichlet Allocation for Automatic Software Categorization of Soft-ware. In: 6th IEEE Working Conference on Mining Software Repos-itories: 2009.

[40] Gong, Y. & Liu, X. Generic Text Summarization Using Rel-evance Measure and Latent Semantic Analysis. In: 24th annual

124


international ACM SIGIR conference on Research and developmentin information retrieval : 2001.

[41] Haiduc, S., Aponte, J. & Marcus, A. Supporting pro-gram comprehension with source code summarization. In: 32ndACM/IEEE International Conference on Software Engineering :2010.

[42] Haiduc, S., Aponte, J., Moreno, L. & Marcus, A. Onthe Use of Automated Text Summarization Techniques for Sum-marizing Source Code. In: 17th Working Conference on ReverseEngineering : 2010.

[43] Antoniol, G., Canfora, G., Casazza, G., De Lucia, A.

& Merlo, E. Recovering Traceability Links between Code andDocumentation. In: IEEE Transactions on Software Engineering :2002.

[44] Settimi, R., Cleland-Huang, J., Khadra, O. B., Mody,

J., Lukasik, W. & DePalma, C. Supporting Software Evolu-tion through Dynamically Retrieving Traces to UML Artifacts. In:7th International Workshop on Principles of Software Evolution:2004.

[45] Marcus, A. & Maletic, J. I. Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing. In:25th International Conference on Software Engineering : 2003.

[46] Lucia, A. D., Fasano, F., Oliveto, R. & Tortora, G. En-hancig an Artefact Management System with Traceability Recov-ery Features. In: 20th IEEE International Conference on SoftwareMaintenance: 2004.

[47] Lucia, A. D., Fasano, F., Oliveto, R. & Tortora, G. Re-covering Traceability Links in Software Artifact Management Sys-tem using Information Retrieval Methods. In: ACM Transactionson Software Engineering and Methodology : 2007.

[48] Cleland-Huang, J., Settimi, R., Duan, C. & Zou, X.

Utilizing Supporting Evidence to Improve Dynamic Requirements

125


Traceability. In: International Requirements Engineering Confer-ence: 2005.

[49] Oliveto, R., Gethers, M., Poshyvanyk, D. & DeLucia,

A. On the Equivalence of Information Retrieval Methods for Au-tomated Traceability Link Recovery. In: IEEE 18th InternationalConference on Program Comprehension: 2010.

126

Reverse Engineeringin Procedural Software

Evolution

Óscar ChaparroFernando Cortés

Jairo Aponte

ABSTRACTSoftware Engineering is much more than development of new software;it also involves understanding, maintenance, re-engineering and qualityassurance of software. In these and other processes, Reverse Engineer-ing plays an important role, especially in legacy systems. In SoftwareUnderstanding, Reverse Engineering is vital since it abstracts detailedinformation and presents structured knowledge of software to the user.In the same way, it allows performing maintenance activities in a morecontrolled way because it provides information to measure the impact ofchanges. Reverse Engineering also gives useful information about howa system is designed and allows assessing software from many perspec-tives, for example, to identify software clones or to measure some at-tributes such as coupling or cohesion. This chapter presents an overviewof some techniques in the field, an analysis of their applicability in anindustrial procedural software system, and some future trends that willdirect research in this discipline.

6.1 INTRODUCTION

Software Reverse Engineering is a field that consists of techniques, toolsand processes to get knowledge from software systems already built. Themain goal of this discipline is to retrieve information about how a soft-ware system is composed and how it behaves according to the relations

127


of its components [1]. Reverse Engineering aims at creating represen-tations of software in other forms or in more abstract representations,independently from the implementation (e.g., visual metaphors or UMLdiagrams). Techniques in the field are employed for many purposes [1]:To reduce software systems’ complexity, to generate alternative viewsof software from several architectural perspectives, to retrieve “hidden”information as a consequence of a long evolution of systems, to detectside effects or non-planned design branches [2], to synthesize software,to facilitate reuse of code and components, to assess the correctnessand reliability of systems, and to locate and track defects or changesfaster.

Reverse Engineering (RE) arises as an important area in softwareengineering, since it becomes necessary in Software Evolution. Soft-ware Evolution is an inevitable process, basically because most factorsrelated to software (and technology) change [3]: Business factors suchas business concepts, paradigms, processes, etc., and technology factorssuch as hardware, software, software engineering paradigms, etc. RE isespecially important in regard to legacy systems since they suffer degra-dation and have a long operational life, producing a loss of knowledgeabout how they are built. Generally, the changes in this kind of softwareactually happen but are not well-documented. Instead, knowledge aboutchanges is kept by people, but people are volatile: People leave projectsand companies, and people forget easily. Therefore, knowledge needsto be extracted directly from software (source code) and its behavior(run-time information), which are both the main sources of informationfor RE. In this sense, the general problem is how to extract informationor knowledge from software artifacts.

Reverse Engineering can be considered as a more general field com-pared to Software Comprehension, as the first one involves more fieldactions and not only comprehension of code (e.g., RE is also used forfault localization), although sometimes the comprehension process is asecondary result of it. The philosophy about RE is the application oftechniques to know how software works, how it is designed and how thisdesign allows software to behave the way it does. So, in this process itis perfectly natural that the comprehension part appears as an indirectconsequence or as a motivation.

128

REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

Since Reverse Engineering is vital for software development pro-cesses, this chapter presents an overview of some techniques in thearea. The chapter exposes an analysis of the applicability of some tech-niques in an industrial procedural software system, as a first step in thedevelopment of a RE tool for Oracle Forms applications. The chapter isorganized as follows: In Section 6.2, the needs, benefits and purposes ofRE are reviewed, in the context of Software Understanding and Mainte-nance. Section 6.3 presents a review of some techniques in the field. Ananalysis of the applicability of techniques in an industrial Oracle Formsapplication is exposed in Section 6.4. Later, an assessment of RE tech-niques and tools is addressed in Section 6.5, since the results of thistopic allow setting a common research and development environment inRE for future work. Finally, future trends in the field and conclusionsare presented in Section 6.6.

6.2 REVERSE ENGINEERING CONCEPTSAND RELATIONSHIPS

The term Reverse Engineering has been used to refer to methods andtools related to understanding (or comprehending) and maintaining soft-ware systems. RE techniques have been used to perform systems exam-ination, so in Software Understanding and Maintenance, RE has been auseful medium to support these processes.

6.2.1 Reverse Engineering and SoftwareComprehension

Software Comprehension is the process performed by an engineer or adeveloper to understand how a software system works internally. Theunderstanding process involves the comprehension of the structure, thebehavior and the context of operation of a program. Along with these at-tributes, the explanation of problem domain relationships is required [4].Understanding is one of the most important problems in Software Evo-lution; it is said that between 50 and 90% of the effort in maintenancestages is devoted to this task [5]. Commonly, system documentation isout of date, incorrect or even inexistent, which increases the difficulty ofunderstanding what the system does, how it works and why it is codedthat way [6].

129


Several comprehension models and cognitive theories are reviewedby Storey in [7]. It could be said that the main models, or at least themost common, are top-down and bottom-up. Top-down comprehen-sion strategy is basically the mapping between previous system/domainknowledge and the code, through formulation, verification and rejectionof hypotheses. In terms of the implementation of a RE tool, this processtypically includes rule matching to detect how code chunks achieve sub-goals within a specific feature or plan [8]. When performing automaticRE to legacy software, bottom-up approach is commonly used becausetop-down approach requires detailed knowledge about the “goals theprogram is supposed to achieve ” [9]. In bottom-up understanding,software instructions are taken to form and infer logic and semanticgroups, categories and goals. The automation of this approach is verycomplex [8, 9], and is not supposed to be solved by a single technique,because of the semantic gap between code and domain knowledge of asystem. Additionally, developers actually need a variety of functionali-ties and information that a technique or implementation can not achieveor provide.

6.2.2 Reverse Engineering and Software Maintenance

On the other hand, Software Maintenance is usually defined as the pro-cess made on software after its delivery to production [10]. Commonactivities in maintenance are correction of defects, performance improve-ment, and software adaptation due to changes in requirements or in busi-ness rules. Software Maintenance is divided into four categories [11]:Corrective maintenance, Adaptive maintenance, Performance enhance-ment and Perfective maintenance.

Reverse Engineering comprises the first step in Software Mainte-nance: The examination process. The change of software is executedlater; therefore, RE does not involve the changing process [10].

6.2.3 Reverse Engineering Concepts

Software Reverse Engineering was defined by Chikofsky et al. [1] as theprocess of analyzing a system to:

130


• Identify the system’s components and their inter-relationships,and

• Create representations of the system in another form or at a higherlevel of abstraction.

A discussion of this definition is developed in [12]. In this work, au-thors state that this definition does not fit to all techniques in the field;for example, program slicing does not recover system’s components andrelationships. This definition does not specify what kinds of representa-tions are considered and the context in which the process is executed,so the role of automation and the knowledge acquisition process are notclear. In this sense, the authors propose a more complete definition:“Reverse Engineering includes every method aimed at recovering knowl-edge about an existing software system in support to the execution ofa software engineering task”.

The process of Reverse Engineering is divided into four phases [13]:Context parsing, Component analyzing, Design recovering and Designreconstructing. Figure 6.1 shows the whole process.

Asif in [14] presents the elements involved in the Reverse Engineeringprocess:

• Extraction at different levels of abstraction,

• Abstraction for scaling through more abstract representations,

• Presentation for supporting other process such as maintenance,and

• User specification allowing the user to manage the process, themappings for transformation of representations, and software ar-tifacts.

Khan et al. [2] define the general process of Reverse Engineering,which consists in four phases: Data extraction from software artifacts,data processing and analysis, knowledge persistence in a repository, andpresentation of knowledge. The authors also present some benefits andapplications of Reverse Engineering. RE is used:

• To ensure system consistency and completeness with specification.

131


Source code Context Parsing

Component Analyzing

DesingRecovery

DesingReconstruction

IntermediaryRepresentation

Format

ComponentSpecification

Componentinterrelationship

Sytem Architecture

System Requirement

Desing Model

Domain Knowledge

Documentation

Expert Knowledge

Figure 6.1 Reverse Engineering process according to [13].

• To support verification and validation phases.

• To assess the correctness and reliability of the system in the de-velopment phase, before it is delivered.

• To trace down software defects.

• To evaluate the impact of a change in the software (for estimatingand controlling the maintenance process).

• To facilitate understanding by allowing the user to navigate throughsoftware in a graphical way (software visualization is an importantaspect of RE this topic is detailed in Chapter 3).

• To provide more speed in Software Maintenance and Understand-ing. A requirement of a RE tool is the fast and correct gener-ation of cross-reference information and different representationsof software. In Section 6.5.2 some features of tools are reviewed.

132


• To measure re-usability through pattern identification.

RE embraces a broad range of techniques, from simple ones, suchas call graphs extraction, to more elaborated ones, such as architec-ture recovery. The trends go towards more sophisticated and automaticmethods. The next section presents some techniques in the field.

6.3 TECHNIQUES IN REVERSE ENGINEERING

Methods or techniques (used indistinctly in this chapter) in ReverseEngineering are automatic solutions to one or more problems in thefield [12]. This section provides a partial revision of techniques, dividedinto two categories: Standard and specialized techniques.

6.3.1 Standard Techniques

Standard techniques include mostly basic descriptive techniques, suchas dependency models or structural diagrams. The objective of standardtechniques is to obtain system structure and dependencies at differentlevels of abstraction, by applying basic source code analysis and AbstractSyntax Tree (AST)1 processing. According to [2], a RE tool typicallyprovides the following views:

• Module charts: They present relationships between system com-ponents. A module is a group of software units based on a crite-rion. Theoretically, modules must have a well-defined function orpurpose.

• Structure charts: In a general sense, they present software ob-jects categorized and linked by some kind of relationship, generally,method calls. According to [15]2, this kind of diagrams also showsdata interfaces between modules. Examples of structure charts areentity-relationship models and class diagrams. Actually, modulecharts are structure charts.

1An AST is a tree representation of the syntactic structure of source code. Forexample, AST View is an Eclipse plug-in for visualizing AST for java programs:http://www.eclipse.org/jdt/ui/astview/

2See Chapter 15 of [15]: Additional Modeling Tools.

133


• Call graphs: They describe calls and dependencies between ob-jects at different levels of granularity. These diagrams are builtby analyzing AST from code. The level of granularity is set (forinstance functions, methods, classes or variables) and then, theAST is traversed to find the usage of objects. Call graphs areimportant for change propagation analysis.

• Control-flow diagrams: At low/medium levels of granularitythey present the systems execution flow, in which control struc-tures (e.g., IF or FOR / WHILE) guide the flow. Another diagramof this type is the Control Structure Diagram (CSD) [16], in whichsource code is enriched through several graphical constructs called“CSD program components/units”, to improve its comprehensibil-ity.

• Data-flow diagrams: They are graphs that show the flow of data(in parameters and variables) through functionalities, modules, orfunctional processes [15].

• Global and local data structures, and parameter lists: Theseallow going to a fine-grained level of software.

The automatic extraction of these diagrams involves several opera-tional tasks on code and its AST. For example, in data flow diagrams,the transformation and the storing of data must be obtained; in thiscase every parameter needs to be tracked between and inside methodsor functions to know what operations include them and how they areused. Before this process, modules need to be determined. In addition,some of these diagrams are the source of information for techniquessuch as dependencies analysis of code and data, and the evaluation ofchanges impact in code (see Impact Analysis in Chapter 4).

6.3.2 Specialized Techniques

Other techniques, called specialized techniques in this chapter, includeoperations on software artifacts that are intended not only to describesoftware but to extract knowledge from it. Business rules extraction[17,18], natural text processing [18], execution traces processing [19,20],

134


pattern extraction [21], module extraction [22, 23], and programmingplans matching using artificial intelligence techniques [9], are some ex-amples. Some of them are described briefly in this section.

Business Rules Extraction

Business rules extraction refers to the discovery of some important con-ditions that produce business actions. The key is to find those conditionsand translate them into the business domain. Recovery of business rulessupports the preservation of legacy business assets, allows optimizingthe business model and assists system forward engineering [17,18]. Theprocess of business rules extraction is depicted in Figure 6.23.

Figure 6.2 Business rules knowledge through abstraction levels.

In the case of [17] and [18], the authors address business rules ex-traction of COBOL systems from the analysis of AST and some par-ticular constructs in COBOL language. The authors define a format toexpress business rules: <conditions> <actions>, where <conditions>are boolean expressions and <actions> are “action expressions” whichare executed only if the conditions are true. This format is based on the

3Picture based on the image located in http://blog.erikputrycz.net/

projects/business-rules-extraction/ (February, 2011).

135


definition of production rules in the specification of SBVR (Semanticsof Business Vocabulary and Business Rules) [24].

The procedure proposed to extract business rules is the following:First, the branching and calculations statements are identified, and sec-ond, the context of those statements is constructed. The branchingstatements contain CALL and PERFORM words; the first allows exe-cuting external programs and the second allows transferring the controlflow to a paragraph4. The context of the statements is a set of all con-ditions (local and global conditions), in which branching and calculationoperations happen.

The key point in their work is that not only production rules are re-trieved, but also comments, identifiers, assignments, code blocks, busi-ness rules dependencies and exceptions. CALL and PERFORM state-ments and the analysis of local and global conditions allow defining acontext, and on the other hand, comments, identifiers and other “arti-facts” allow assigning semantic information to that context.

On the other hand, according to Baxter et al. [25], automated ex-traction of business rules from code can only be heuristic because busi-ness vocabulary and business rules are independent from implementationand they are not present in code. As they say, code simply suggests orhints about business rules, so these clues in code (code fragments, op-erations and functions, error messages, program comments, etc.) areextracted to form approximations of business rules. Of course, somerules will be missed or incorrect, so the problem is how to get better ap-proximations. Other approaches for extracting business rules are slicingcriterion identification and program slicing [26–28], data analysis [27,28]and text processing from documents [29].

Programming Plans Matching

One problem in Reverse Engineering is how to find the meaning of code.In the automation of this process the analysis of code identifiers andconcept extraction are almost required. However, another approach ismaking matches between chunks of code and programming plans storedin a repository [8,9]. A programming plan is a “design element in terms

4A paragraph in COBOL is a sequence of statements.

136


of common implementation patterns” [8]. A plan can be considered asa programming pattern or template, and can be generic or domain-specific. Some examples of plans are READ-PROCESS-LOOP andREAD-EMPLOYEE-INFO-CALCULATE-SALARY; the first is a genericplan which means “reading input values and perform some actions toeach value” at implementation level. The second is a specific domainplan that is a specialization of the first one because at implementationlevel is almost the same as the first plan, but has a more semanticstereotype. This means that the repository has a set of generic andspecialized plans organized in a hierarchy. In [8], the author state isthat the recognition of programming plans against information of theAST of code is better; in terms of searching cost, if the plan library ishighly organized, each plan has indexing and specialization/implicationlinks to other plans5.

However, the matching task is a NP problem, so the process is com-putationally expensive [9]. The work presented in [9] addresses thisproblem by applying artificial intelligence techniques. The approach isa two-step process. The first step is a Genetic Algorithm execution tomake an initial filtering of the plan library based on “relaxed” matchingbetween code chunks [30] and programming plans stored in the library(the repository). The second step uses Fuzzy Logic to perform a deepermatching. The output of the whole approach is a ranked set of pro-gramming plans according to a similarity value with a chunk of code.

In summary, the objective to be achieved by these works is to findprogramming plans similar to a portion of code. Programming plansare stereotypical patterns of code (generic or domain-specific patterns),therefore it is possible to assign high level concepts to programs, oncethe matching process has been performed.

Execution Traces Analysis

As a complement to Static Analysis in RE, processing of run-time infor-mation is commonly used in what is called Dynamic Analysis. Maybethe most common source is the system execution trace.

5According to Quilici [18], a plan consists of inputs, components, constraints,indexes, and implications.

137


Execution traces analysis refers to execution trace processing to findpatterns in traces that have a specific function. The advantage of tracesis that they show the portions of software that are being executed in aspecific execution scenario. In this way, the search space is smaller thanthe one in static analysis because the executed portions of code are theonly ones considered.

Two problems are detected in dynamic processing: First, knowledgeof the system is required to perform this analysis, and second, this dy-namic analysis produces huge amounts of information (long traces). Theformer problem refers to the fact that it is not possible to capture the(infinite) entire execution domain of a system. If there is no knowledgeabout how the system works, using and executing all its functionalitiesis not possible. If it is not necessary to know about the entire system,and the knowledge about the use of specific functionality actually exists,this would not be a problem. The latter problem depends on how thesoftware is built at low level, how the code is instrumented6 and howmuch information the user needs. Other problems related to dynamicanalysis are low performance, high storage requirement and cognitiveload in humans [4].

Object-oriented software has been the most common object of studyin traces analysis. For example, in [20], the problem of identifying clus-ters of classes is addressed based on a technique that reduces the “noise”in execution traces through the detection of what the authors call “tem-porally omnipresent elements”, which represent execution units (groupsof statements) distributed in the trace. In this sense, noise representsinformation that is not specific to a behavior of interest. For this, sam-ples of a trace are taken and the distribution of each element along thesamples is calculated through a measure of temporal occurrence. Onthe other hand, dynamic correlation is used to cluster elements. Twoelements are dynamically correlated if they appear in the same samples,so the measure of correlation is based on the number of samples in whichthey occur. The clustering part takes all elements whose correlation ishigher than a fixed threshold, thus grouping elements in components.

6Code instrumentation refers to the use of software tools or additional portionsof code in the system, through which execution traces and behavior information ofthe system are gathered.

138


In summary, the noise of traces is removed and then clustering isapplied to the filtered traces. That work presents an industrial experi-ment of the approach. The system of study was a two-tier client-serverapplication: The client was a Visual Basic 6 application of 240Klocand the server was comprised of 90Kloc of Oracle PL/SQL (Procedu-ral Language/Structured Query Language) code. The client code wasinstrumented since no trace generation environment was found, and inthe case of the server, Oracle tracing functions were used. The systemwas executed over a use-case, producing a trace of 26,000 calls.

Similarly, the authors in [19] define “utility element” as any elementin a program that is accessed from multiple places within a scope ofthe program. The objective of the authors is to remove these utilityelements or classes from the analysis. This is achieved by calculatingthe proportion of classes that call each other (fan-in analysis) iterativelyby reducing the analysis scope or by applying the technique on a set ofpackages. The utility-hood metric, U , of the class C, is defined as

U = |IN | /(|S| − 1)

where S is a set of classes considered in the analysis and IN is a subsetof classes that use C. Besides this metric, the standard score (z-score)was considered to determine possible utility classes: classes with largeand positive z-score values are possible utilities. Once the filtering isperformed, the depiction of components is done by a tool that generatesUse Case Maps7. In this latter step, calls between classes and conditionsof execution (control-flow statements) are considered.

As it is noticed, the main challenge in execution trace analysis is howto reduce and process the trace, so the final result is a good abstractionof what a system does under a specific execution scenario. The mainproblems to be addressed are the definition of the information that thetraces should have, the metrics and procedures that should be used forfiltering, and the way to analyze and represent the reduced traces, sothat they can express knowledge about the system.

Other dynamic analysis works employ Web mining [31], associationrules and clustering [32–34], and reduction techniques [35]8.

7For more information about this type of models go to www.usecasemaps.org8For more information about dynamic analysis techniques see [4].

139


Module Extraction

One important fact for architecture recovery and other areas in RE ismodule extraction. This is performed by defining a set of criteria thatprovide a semantic clustering of components to form cohesive modules.For example in [22] and [23], Hill Climbing and Simulated Annealing areused to form clusters of components based on fan-in/out analysis. Theapproach starts by building a Module Dependency Graph, then randompartitions (clusters) of the graph are formed as the initial clustering con-figuration, and later the partitions are rearranged iteratively by changingone component from one cluster to another. The objective is to find theoptimal configuration based on the concepts of low coupling and highcohesion, which is achieved by considering fan-in/out information. Thiswas accomplished by maximizing the objective function, which the au-thors called Modularization Quality (MQ). The general process is shownin Figure 6.3.

Figure 6.3 Modularization process based on clustering/searching algorithms [23].

Text Processing

Text processing refers to the processing of source code as text. Thisimplies the processing of morphological, syntactic, semantic and lexical

140


elements of code. Text processing takes advantage of identifiers and howstatements are organized to extract semantic information of artifacts:classes, methods, packages, etc. The requirement is that code is wellwritten, i.e., the identifiers express semantic information of the business,and follow some general parameters about how they are defined.

In [18], the key-phrase extraction algorithm KEA9 is used to translatebusiness rules into specific domain business terms, from documentation.For this, they connect documents to business rules. The key point is thatdocuments contain technical description of variables, so it is possible toestablish a direct mapping between rules and documents.

More information about text processing is presented in Chapter 1.

6.4 APPLICATION OF TECHNIQUES

This section includes an analysis of the applicability of some techniquesto a business procedural software application.

6.4.1 Description of the System

The business system is called Financial Management System (SGF)10.It is an Oracle Forms11 business application, which manages financialand administrative information of an organization.

The main characteristics of SGF are:

• It is a two-tier client-server application. It is implemented in Or-acle Forms and in PL-SQL/SQL.

• The system has about 2072 DB tables, 507 views, 1689 DB storedprocedures and 1097 forms. The system has about 700,000 linesof code.

9For more information about the KEA algorithm see http://www.nzdl.org/Kea(February, 2011)

10Acronym in spanish which means “Sistema de Gestión Financiera”.11The last version of Oracle Forms up to date is 11g. For more informa-

tion about this technology please visit http://www.oracle.com/technetwork/

developer-tools/forms or http://en.wikipedia.org/wiki/Oracle_Forms

(March, 2011)

141


• The system is implemented in Oracle Forms 6i technology, whichis no longer supported by Oracle.

• Its life cycle is about ten years of evolution and it has several clientssuch as Universidad Nacional de Colombia, Policía Nacional deColombia and EPM Bogotá12.

• The coupling of this system is relatively high because businesslogic is integrated as in server as well as in client side.

The application of RE to this system is performed with three pur-poses: First, to redocument and understand the system because it wasowned by other company; second, to facilitate the maintenance tasks inthis system; and third, to get representations of the system at differentlevels of granularity as a first step for the re-engineering process that theowner company, ITC S.A.S13, will carry out in the short/middle term.

6.4.2 Considerations for Applying ReverseEngineering

In order to perform RE to this system, the application of static anddynamic approaches can be considered as convenient. In one hand, asPL/SQL is a procedural language there is no object resolution at run-time14, so the usage of static methods is prominent. On the other hand,it is possible to apply dynamic analysis using execution traces in orderto restrict the search space; this is useful when a developer wants tofocus on some part of the system and in a particular execution scenario.In this case, there are two alternatives: using tracing mechanisms ofOracle15, or performing code instrumentation, which provides a better

12These are large organizations in Colombia.13http://www.itc.com.co14Oracle provides the programming technique called Dynamic SQL. According

to the opinion of several developers who maintain the system, the usage of thistechnique is low, so this topic is not addressed in RE process, at least for now.

15Some Web sites about Oracle tracing mechanisms are:http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_

trace.htm, http://docstore.mik.ua/orelly/oracle/prog2/ch26_01.htm andhttp://www.devshed.com/c/a/Oracle/Debugging-PLSQL-Code/1/ (March,2011)

142


customization degree. Two problems arise here: First, the places ofthe code, where the tracing code is inserted, need to be determined(for example, at the beginning of a procedure or just after every con-trol statement), and second, the trace produced could be huge. Theseproblems depend on the specific user needs and the required balancebetween these issues, so the main way (maybe the only way) to tacklethem is by performing empirical studies.

Additionally, it is possible to try inferring more knowledge from whatis already known of the system. In this way, a mix between top-downand bottom-up approaches is convenient and practical [8]. On the otherhand, the application of some RE techniques for Object-Oriented Soft-ware (OOS) to procedural software is possible in some way, especiallywhen modeling behavior, since these two paradigms have a commoncomponent: The procedural part [36]16

6.4.3 Application of Standard Techniques

Standard views and techniques are suitable for almost all kind of soft-ware. The application of some views to SGF is described below.

• Module charts: The problem about extraction of modules ishow to group modules. For example, a basic approach is to groupobjects by operations on tables, i.e., grouping forms, procedures orpackages that perform DELETE, UPDATE or INSERT operationson tables. Then, based on a heuristic (for example, the number ofobjects that uses a specific table, given a threshold) the modulescan be formed.

• Structure chart: This can be viewed as just categorization ofobjects and their usages. For example, the natural structure of aform (composed of canvases, data blocks, record groups, etc.17)or a table (columns, triggers, etc.).

16In OOS, the data is encapsulated together with procedures but in proceduralsoftware the latter are separated from data.

17For more information about Oracle Forms components see [37] or review thepresentation hosted in http://www.cse.iitb.ac.in/dbms/Data/Courses/DBIS/

DBIS-Fall2000/course_demos/OracleForms/FormsBuilder.ppt (March, 2011)

143


• Call graph: It is just call dependencies between PL/SQL objects:Procedures, functions, program units, triggers, etc. Call graphsallow searching paths, evaluating the impact of a change (forexample, a change in the signature of a procedure) or calculatingmetrics, such as, coupling.

• Control flow diagrams: Beyond constructing this kind of dia-grams, which is relatively easy, the identification of key controlstructures in code is proposed (see Section 6.4.4).

• Data-flow diagram: In which the flow of data (in parameters andvariables) across procedures, functions and packages, is displayed.

Some of these diagrams have been already implemented in a pro-totype of a Reverse Engineering Tool for Oracle Forms applications18,which has been tested on SGF. In Figure 6.4, the graphical interface ofthe prototype is displayed.

Figure 6.4 Prototype of the Oracle Forms Reverse Engineering Tool.

6.4.4 Application of Specialized Techniques

The specialized techniques presented in this chapter are feasible to beapplied to SGF. However only some of them are going to be consideredin this section.

18The tool can be found at http://code.google.com/p/retool/

144


Simple Business Rules Extraction

The format (<conditions> <actions>) presented by Putrycz et al. [17]is a good start point for formalizing business rules. The conditionscould be extracted from IF and FOR statements. In the case of FORstatements, the analysis of cursors is very important because they definespecific restrictions of actions to be performed. For example, in Figure6.5, there is a tax calculation which depends on the data queried bythe cursor C_MOVIDP, which has several restrictions: The type of theprocess (PECL_TIPO IN (‘DC’, ‘DF’, ‘DD’, ‘AM’)) and the date of themovement (MOVI_FECHA > ‘01/01/2010’).

Figure 6.5 Example of a CURSOR and its use in a FOR statement.

The business rule associated with this code could be the following:

• <conditions>:for all MOVI_MOVI, with PECL_TIPOE IN (‘DC’,

‘DF’, ‘DD’, ‘AM’), with MOVI_FECHA > ‘01/01/2010’.

• <actions>: TOTAL_TAX := TOTAL_TAX + CALCULATE_TAX(REC.MOVI)

In some way, this rule is readable but it depends on the identifiersmeaning. In this case, there are two alternatives: using replacementrules (for example, MOVI_MOVI → movement) or using the descriptionof tables and columns, assuming that this information is available. Thefirst option could be exploited because there are some conventions in

145


SGF with respect to the identifiers, for example a variable named v_ciasmeans company (cias → company). So, the initial idea is to generatesimple rules (also some heuristics) to perform this replacement by usingsimple matching; the process would not be completely automatic.

Text Processing and Dependencies Analysis

The basis considered in this analysis is constructing business rules fromthe processing of control structures in code. A problem would be thatnot all control structures are important when extracting rules; it is pos-sible that the implementation contains control structures that do notcontribute in a rule. For instance, Figure 6.6 shows a code block thatdoes not represent a business rule per se, instead is just a validation ofparameters and the recording of its result.

Figure 6.6 Example of a code block that does not represent a business rule.

Consequently, the initial hypothesis is that some control statementsin PL/SQL code, together with non-control statements, actually expressrules about how the system should behave, according to the business itmodels. Then, the problem to address is how to find these statements.

The preliminary proposal for addressing this problem is applying firsttext processing techniques and then dependencies analysis. The inputof the process is the code of a PL/SQL object (package, function, pro-cedure, trigger, etc.) and the output is a list of control statements

146


(the conditions) that are likely to represent business rules. Each controlstatement would have an operational context, which would be trans-lated into non-control statements (the actions) that complement theconditions.

The text processing step has as objective the recognition of someidentifiers (names of variables, functions, procedures, etc.), and theirplaces in code, that could be important. This step explores the codewithout performing a deep analysis, by using simple techniques such asword frequency and relationships between words.

The second step would exploit those statements that contain theidentifiers previously recognized. The control statements and their in-ner code blocks are analyzed, by considering variables, procedures andfunction dependencies, and data operations, such as INSERTs, UP-DATEs or DELETEs. The basic idea is to find a heuristic that evaluatesthe probability of a statement being part of a business rule. For ex-ample, the heuristic should have into account the following elements instatements: If-then statements, exception and raise statements, update,inserts, delete operations, etc.

Once the business statements are recognized the next step is pro-cessing them to form human readable business rules. The initial approx-imation to tackle this is using parsing, syntactic analysis, and businessrules templates.

6.5 REVERSE ENGINEERING ASSESSMENT

Assessment of techniques and tools is an important subject in RE, sincethe results of this process allow setting a common research and devel-opment environment in RE for future work. This section details someconcepts and procedures about how should be the evaluation and whatfactors should be considered when assessing RE techniques and tools.

6.5.1 Assessment of Techniques

Assessment of techniques refers to measure their effectiveness when per-forming a specific task, for example, a maintenance task such as defectfixing. In general, assessment of techniques and tools is difficult becauseReverse Engineering is a goal-driven process, so there is a wide range

147


of scenarios to compare [12]. The effectiveness includes correctivenessand performance.

The main instrument to measure characteristics of techniques andtools is through empirical studies. According to [12], the first step isto have a defined scope of the discipline as requirement for empiricalevaluation. The second step is the definition of a common and agreedtaxonomy of methods and tools under investigation19. Finally, as thirdstep, a framework of empirical studies in the field is required. Theauthors propose some taxonomy criteria: Method or tool, dynamic orstatic, input required, output produced, interaction supported, requireduser guidance, task applicability and scalability. According to this frame-work, there are six dimensions in regard to designing and classifyingempirical studies in Reverse Engineering:

• Type of study: Experience report, case of study, experiment,observational study and systematic review.

• Object of study: It is basically a method or a tool.

• Purpose: Conceptual proposition, proof of concept, quantifica-tion, comparison, conditioned comparison, review and post-facto.

• Focus: Usefulness and usability.

• Population: Humans or programs.

• Context: For example, factors that influence Software Mainte-nance tasks.

6.5.2 Assessment of Tools

Assessment of tools can be considered in terms of quality evaluation, butthis depends on the concept of quality. Khan et al. [2] mention the fol-lowing quality criteria: Absence of defects or errors, which is important,especially in RE tools that display specific information (for estimation

19An attempt to provide this taxonomy is performed by the authors through anopen wiki: http://lore.cmi.ua.ac.be/reWiki/index.php/Main_Page However,the wiki is outdated.

148


and impact analysis this factor is critical); and user requirements sat-isfaction. In the latter case, the goals that a developer can achievewith the tool are more important: How he can evaluate the impactof a specific maintenance task (time and precision) and the knowledgeobtained (automatic knowledge acquisition). This is associated withthe features that a tool provides; actually, literature about assessmentof tools considers mostly the amount and type of functionalities as aquality judgment.

Storey, in [7], presents some general features that Program Com-prehension tools should have. Some of them are required in ReverseEngineering tools as well:

• Documentation: In top-down comprehension, program level doc-umentation improves maintenance tasks in terms of time andnumber of errors.

• Browsing and navigation support: Developers switch betweentop-down and bottom-up models, so flexible browsing needs to besupported. Control-flow and data-flow links or breadth-first anddepth-first browsing should be provided by tools.

• Searching and querying: Developers focus on specific compre-hension objects so filtering mechanisms should be provided.

• Multiple views: Combination and cross-referencing views are re-quired since developers employ several comprehension strategies.This is also important when describing a system from differentperspectives.

• Context-driven views: Depending on some attributes, metricsor conditions, some views are more appropriate than others, evenfor different kind of users (e.g., developers, architects, etc.).

• Cognitive support: More cognitive support benefits comprehen-sion, through visualization, browsing and navigation, etc.

Other features required in a RE tool include concept assign capabil-ities, searching and keeping of history, control-flow, call graphs, pruned

149


call trees, entity fan-in, capabilities for maintainers, software visualiza-tion support, integration with Integrated Development Environments(IDE), among others.

In [38], the authors perform an evaluation of four Reverse Engi-neering tools (Refine/C2, Imagix 4D, Rigi, and SNiFF+), following theexperimental framework for evaluating software technology presentedin [39]. The context of evaluation for determining the differences ofthe tools was the recovery of architectural information from embeddedsoftware systems. First, based on the author’s knowledge in using REtools in commercial contexts, a set of assessment criteria grouped infunctional categories was defined. Second, tools were assessed accord-ing to the criteria by reviewing the availability degree of functionalities.The assessment categories and some of their criteria are listed bellow:

• Analysis: This category refers to the parsing process. Some crite-ria are source languages, incremental parsing, fault tolerant pars-ing, abortable parsing, parsing results, and parse speed.

• Representation: In this category the usability and graphical rep-resentations are assessed. Some criteria are speed of generation,static/dynamic views, filters, scopes and grouping, sorting, layeredviews, and view edition.

• Editing/browsing: Switching between abstract and fine-grainedlevels is a requirement when performing a task that involves Re-verse Engineering. Some criteria are searching functions, historyof browsing, editor integration, and highlighting of objects andsource code.

• General capabilities: For instance, supported platforms, multi-user support, toolset extensibility, storing capabilities, exporting,automatic generation of documentation and on-line help.

On the other hand, this and other approaches give a good list ofassessment criteria but they do not give a quantitative methodologyof evaluation. In this sense, a quantitative approach is exposed in [2],which is based on the MECCA (Multi-Element Component Comparisonand Analysis) method. The idea is to define a set of mandatory at-tributes that a Reverse Engineering tool must have. These attributes

150


are arranged and split in sub-attributes hierarchically. Later, a weightpercentage is assigned to each attribute and sub-attribute, and a scoreis assigned to each last-level sub-attribute according to a scale. Theresult of the evaluation of this method is a number that indicates howappropriate is the tool, depending on the defined attributes. For ex-ample, if the result for a tool X is 7 and the result for a tool Y is 8,then the tool Y would be better than the other one, in terms of thefunctionality20.

The application of the MECCA method for evaluating a tool is onlyperformed if mandatory attributes regarding to functionality are present.According to the authors, attributes are grouped in five categories: Userfunctions or functionality, interface, I/O operation, metrics, and verifi-cation strategy (Figure 6.7).


Reverse Engineering provides useful methods and tools in many do-mains such as Software Maintenance and Comprehension. At least,since 1990 [1] a lot of work has been done to solve different problemsin RE in a successful way [41]. In this chapter some methods and tech-niques proposed in the research literature were reviewed in order to, first,show the wide spectrum of proposals and solutions; second, present thegeneral scene of the area; and third, show an analysis of how some tech-niques could be applied to an industrial procedural software system, asan evidence of their applicability.

One of the main tendencies in RE is the development of more auto-matic, usable and useful tools for software developers and stakeholders,for supporting comprehension and maintenance tasks. Also, new impor-tant approaches will continue emerging; for instance, Model Driven De-velopment and Model Driven Architecture (MDD/MDA) developmenthas increased in last years, therefore is expected that the ArchitectureDriven Modernization (ADM) process changes the RE concepts andmethods in terms of models and visual languages through several levelsof abstraction. Some work in ADM can be found in [42] and [43].

20For the computational method of the MECCA model please refer to [40].

151


Functionality

Interface

I/O operation

Metrics

MandatoryFunctions

OptionalFunctions

Man ManchineInteraction Ability

Interface to Other Tools

Accessability to the Repository

Scope of Programming

Languages

Report Generation Capability

Metrics Collection

Metrics Analysis

ConsistencyChecking

ConstrainingMechanisms

VerificationStrategy

38%

90%

10%

23%

19%

11%

9%

60%

40%

40%

30%

30%

65%

35%

60%

40%

ReverseEngineering Tool

Figure 6.7 Application of the MECCA approach [2].

Besides, the proposal of new combined techniques is important sinceit allows getting more effectiveness when solving common problems inReverse Engineering, Software Comprehension and Maintenance. In thissense, Section 6.4 presents some elements that combine business rulesextraction, text processing, and dependencies analysis. On the otherhand, more applications of the field will emerge; for example, RE forassessing security issues in software systems, or for inferring new businessrules.

152


From another perspective, the field has also problems to solve. Re-search and academic community should work to get a more structuredfield of study. Standard taxonomies and frameworks are needed [12,44]for, among other purposes, assessing RE methods and tools in a quan-tifiable way [12], and for defining a strong basis for research in eachsubdomain of RE: Business rules extraction, module and architecturerecovery, concept extraction, etc.

In the same way, the development of usable and useful tools (usefulfor different tasks) is a critical requirement. For example, tools shouldbe flexible and extensible, they should provide filtering and searchingmechanisms and scalable features. As Tonella stated [12], “the pos-sibility to customize and extend a tool clearly affects its usability andadaptability”. Therefore, the research of new methods and the develop-ment of useful tools are mandatory, because as industry keeps adoptingpractices and tools, RE will continue consolidating as a major field.

Finally, the applicability and the proposal presented in Section 6.4will continue evolving to a more structured and effective method and so-lution. A more detailed analysis will be performed for determining whatother factors would be convenient to consider from the RE perspective.

153


REFERENCES

[1] Chikofsky, E. J. & Cross II, J. H. Reverse Engineering andDesign Recovery: A Taxonomy. IEEE Softw., 7(1): 1990; 13–17.ISSN 0740-7459. doi:http://dx.doi.org/10.1109/52.43044.

[2] Skramstad, T. & Khan, M. Assessment of reverse engineeringtools: A MECCA approach. In: Assessment of Quality SoftwareDevelopment Tools, 1992., Proceedings of the Second Symposiumon: 1992, 120 –126. doi:10.1109/AQSDT.1992.205845.

[3] Bennett, K. H. & Rajlich, V. Software maintenance andevolution: a roadmap. In: ICSE ’00: Proceedings of the Conferenceon The Future of Software Engineering. ACM, New York, NY, USA:2000. ISBN 1-58113-253-0, 73–87. doi:http://doi.acm.org/10.1145/336512.336534.

[4] Cornelissen, B. Evaluating Dynamic Analysis Techniques forProgram Comprehension. Tesis Doctoral, Delft University of Tech-nology: 2009.

[5] Müller, H. A., Tilley, S. R. & Wong, K. Understandingsoftware systems using reverse engineering technology perspectivesfrom the Rigi project. In: Proceedings of the 1993 conference of theCentre for Advanced Studies on Collaborative research: softwareengineering - Volume 1, CASCON ’93. IBM Press: 1993, 217–226.

[6] Baxter, I. D. & Mehlich, M. Reverse engineering is re-verse forward engineering. Science of Computer Programming,36(2-3): 2000; 131 – 147. ISSN 0167-6423. doi:DOI:10.1016/S0167-6423(99)00034-9.

[7] Storey, M.-A. Theories, tools and research methods in programcomprehension: past, present and future. Software Quality Journal,14: 2006; 187–208. ISSN 0963-9314. 10.1007/s11219-006-92164.

[8] Quilici, A. A memory-based approach to recognizing program-ming plans. Commun. ACM, 37: 1994; 84–93. ISSN 0001-0782.doi:http://doi.acm.org/10.1145/175290.175301.

154


[9] Burnstein, I., Saner, R. & Limpiyakorn, Y. Using anartificial intelligence approach to build an automated program un-derstanding/fault localization tool. In: Tools with Artificial Intelli-gence, 1999. Proceedings. 11th IEEE International Conference on:1999. ISSN 1082-3409, 69 –76. doi:10.1109/TAI.1999.809768.

[10] Canfora, G. & Cimitile, A. Software Maintenance, tome 2,chapter 2. World Scientific Pub. Co: 2002, 15–20.

[11] Swanson, E. B. The dimensions of maintenance. In: Proceedingsof the 2nd international conference on Software engineering, ICSE’76. IEEE Computer Society Press, Los Alamitos, CA, USA: 1976,492–497.

[12] Tonella, P., Torchiano, M., Du Bois, B. & Systa, T.

Empirical studies in reverse engineering: state of the art and futuretrends. Empirical Softw. Engg., 12: 2007; 551–571. ISSN 1382-3256. doi:10.1007/s10664-007-9037-5.

[13] Lu, C. W., Chu, W., Chang, C. H., Chung, Y. C., Liu,

X. & Yang, H. Reverse Engineering, tome Vol. 2, chapter 18.World Scientific Pub. Co: 2002, 447–466.

[14] Asif, N. Software reverse engineering process: Factors, elementsand features. International Journal of Library and Information Sci-ence, Vol. 2(7): 2010; pp. 124–136.

[15] Yourdon, E. Structured Analysis Wiki.http://yourdon.com/strucanalysis: 2011.

[16] Jgrasp.org. The Control Structure Diagram (CSD) (tutorial).http://www.jgrasp.org/: 2009.

[17] Putrycz, E. & Kark, A. Recovering Business Rules fromLegacy Source Code for System Modernization. In: Advancesin Rule Interchange and Applications, (eds.) Paschke, A. &Biletskiy, Y., tome 4824 in Lecture Notes in Computer Science.Springer Berlin / Heidelberg: 2007, 107–118. 10.1007/978-3-540-75975-19.

155


[18] Putrycz, E. & Kark, A. Connecting Legacy Code, BusinessRules and Documentation. In: Rule Representation, Interchangeand Reasoning on the Web, (eds.) Bassiliades, N., Gover-

natori, G. & Paschke, A., tome 5321 in Lecture Notes inComputer Science. Springer Berlin / Heidelberg: 2008, 17–30.10.1007/978-3-540-88808-65.

[19] Hamou-Lhadj, A., Braun, E., Amyot, D. & Lethbridge,

T. Recovering Behavioral Design Models from Execution Traces.In: Software Maintenance and Reengineering, 2005. CSMR 2005.Ninth European Conference on: 2005. ISSN 1534-5351, 112 – 121.doi:10.1109/CSMR.2005.46.

[20] Dugerdil, P. Using trace sampling techniques to identify dy-namic clusters of classes. In: Proceedings of the 2007 confer-ence of the center for advanced studies on Collaborative research,CASCON ’07. ACM, New York, NY, USA: 2007, 306–314. doi:http://doi.acm.org/10.1145/1321211.1321254.

[21] Gall, H. C., Rol, R. R. K. & Mittermeir, T. AbstractPattern-Driven Reverse Engineering : 1995.

[22] Mancoridis, S., Mitchell, B., Chen, Y. & Gansner, E.

Bunch: a clustering tool for the recovery and maintenance of soft-ware system structures. In: Software Maintenance, 1999. (ICSM’99) Proceedings. IEEE International Conference on: 1999, 50 –59.doi:10.1109/ICSM.1999.792498.

[23] Mitchell, B. & Mancoridis, S. On the automatic modular-ization of software systems using the Bunch tool. Software En-gineering, IEEE Transactions on, 32(3): 2006; 193 – 208. ISSN0098-5589. doi:10.1109/TSE.2006.31.

[24] OMG. Semantics of Business Vocabulary and Business Rules(SBVR): 2002.

[25] Baxter, H. S., I. A Standards-Based Approach to ExtractingBusiness Rules. OMG’s Architecture Driven Modernization Work-shop: 2005.

156


[26] Huang, H., Tsai, W., Bhattacharya, S., Chen, X.,Wang, Y. & Sun, J. Business rule extraction from legacy code.In: Computer Software and Applications Conference, 1996. COMP-SAC ’96., Proceedings of 20th International : 1996, 162 –167. doi:10.1109/CMPSAC.1996.544158.

[27] Wang, X., Sun, J., Yang, X., He, Z. & Maddineni, S.

Business rules extraction from large legacy systems. In: SoftwareMaintenance and Reengineering, 2004. CSMR 2004. Proceedings.Eighth European Conference on: 2004. ISSN 1534-5351, 249 –258. doi:10.1109/CSMR.2004.1281426.

[28] Shekar, S., Hammer, J., Schmalz, M., & Topsakal, O.

Knowledge Extraction in the SEEK Project Part II: ExtractingMeaning from Legacy Application Code through Pattern Match-ing. Informe técnico, University of Florida: 2003.

[29] Martinez-Fernandez, J. L., Gonzalez, J. C., Villena,

J. & Martinez, P. A Preliminary Approach to the AutomaticExtraction of Business Rules from Unrestricted Text in the BankingIndustry. In: Proceedings of the 13th international conferenceon Natural Language and Information Systems: Applications ofNatural Language to Information Systems, NLDB ’08. Springer-Verlag, Berlin, Heidelberg: 2008. ISBN 978-3-540-69857-9, 299–310. doi:http://dx.doi.org/10.1007/978-3-540-69858-6_29.

[30] Burnstein, I. & Roberson, K. Automated chunking to sup-port program comprehension. In: Program Comprehension, 1997.IWPC ’97. Proceedings., Fifth Iternational Workshop on: 1997, 40–49. doi:10.1109/WPC.1997.601262.

[31] Zaidman, A., Calders, T., Demeyer, S. & Paredaens, J.

Applying Webmining Techniques to Execution Traces to Supportthe Program Comprehension Process. In: Software Maintenanceand Reengineering, 2005. CSMR 2005. Ninth European Conferenceon: 2005. ISSN 1534-5351, 134–142. doi:10.1109/CSMR.2005.12.

157


[32] Lo, D. Mining specifications in diversified formats from execu-tion traces. In: Software Maintenance, 2008. ICSM 2008. IEEEInternational Conference on: 2008. ISSN 1063-6773, 420–423.doi:10.1109/ICSM.2008.4658094.

[33] Lo, D., cheng Khoo, S. & Liu, C. Mining temporal rules fromprogram execution traces. Int. Work. on Prog. Comprehension, vol.20, no. 4: 2007; 227–247.

[34] Safyallah, H. & Sartipi, K. Dynamic Analysis of SoftwareSystems using Execution Pattern Mining. In: Program Compre-hension, 2006. ICPC 2006. 14th IEEE International Conference on:2006, 84–88. doi:10.1109/ICPC.2006.19.

[35] Cornelissen, B., Moonen, L. & Zaidman, A. An Assess-ment Methodology for Trace Reduction Techniques. In: Proceed-ings of the 24th International Conference on Software Maintenance,(ed.) Hong Mei, K. W. IEEE Computer Society: 2008. ISBN978-1-4244-2614-0, 107–116.

[36] White, G. & Sivitanides, M. Cognitive Differences BetweenProcedural Programming and Object Oriented Programming. In-formation Technology and Management, 6: 2005; 333–350. ISSN1385-951X. 10.1007/s10799-005-3899-2.

[37] Andrade, L., Gouveia, J., Antunes, M., El-Ramly, M.

& Koutsoukos, G. Forms2Net - Migrating Oracle Forms tomicrosoft.NET. In: Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics), tome 4143 LNCS. Braga, Portugal: 2006. ISSN03029743, 261 – 277.

[38] Bellay, B. & Gall, H. An evaluation of reverse engineering toolcapabilities. Journal of Software Maintenance, 10: 1998; 305–331.ISSN 1040-550X. doi:10.1002/(SICI)1096-908X(199809/10)10:5<305::AID-SMR175>3.3.CO;2-Z.

[39] Brown, A. & Wallnau, K. A framework for evaluating soft-ware technology. Software, IEEE, 13(5): 1996; 39 –49. ISSN0740-7459. doi:10.1109/52.536457.

158


[40] Khan, M., Ramakrishnan, M., & Lo, B. Assess-ment Model for Software Maintenance Tools: A Concep-tual Framework. In: PACIS 1997 Proceedings. Paper 51.http://aisel.aisnet.org/pacis1997/51 : 1997.

[41] Canfora, G. & Di Penta, M. New Frontiers of Reverse Engi-neering. In: FOSE ’07: 2007 Future of Software Engineering. IEEEComputer Society, Washington, DC, USA: 2007. ISBN 0-7695-2829-5, 326–341. doi:http://dx.doi.org/10.1109/FOSE.2007.15.

[42] Cánovas, J. L. & Molina, J. G. A Domain Specific Languagefor Extracting Models in Software Modernization. In: ECMDA-FA ’09: Proceedings of the 5th European Conference on ModelDriven Architecture - Foundations and Applications. Springer-Verlag, Berlin, Heidelberg: 2009. ISBN 978-3-642-02673-7, 82–97.doi:http://dx.doi.org/10.1007/978-3-642-02674-4_7.

[43] Izquierdo, J. & Molina, J. An Architecture-Driven Modern-ization Tool for Calculating Metrics. Software, IEEE, 27(4): 2010;37 –43. ISSN 0740-7459. doi:10.1109/MS.2010.61.

[44] Rasool, G., Maeder, P. & Philippow, I. Evaluation of de-sign pattern recovery tools. Procedia Computer Science, 3: 2011;813 – 819. ISSN 1877-0509. doi:DOI:10.1016/j.procs.2010.12.134. World Conference on Information Technology.

159


Agility Is Not Only AboutIterations But Also About

Software Evolution

Mario Linares-VásquezJairo Aponte

ABSTRACTWhat is agility in software development? Agility is related to high-quality and fast software development. Most of people think that agilemethodologies are only about development with frequent releases overshort iterations. However, agility is more than that; agility is based ona set of values and principles that embrace the real nature of softwaredevelopment and try to address the quality required by the stakeholders.The nature of software development is defined by phenomena such asrequirements’ volatility, users’ volubility, incremental change, and uncer-tainty in schedules. The best way to address this nature is through anadaptive process, an evolutionary process, because software systems andtheir environments are in continuous evolution. Therefore, agility andagile methodologies were conceived to develop software as an adaptiveprocess some years ago; but now, they are a way to achieve an evolu-tionary process in software development.

7.1 INTRODUCTION

Plan-driven methodologies are based on two well defined stages of de-sign and construction; the former requires creative people to build de-tailed plans for construction; the latter is less expensive than design andcan be achieved in a predictable way. This manufacturing model was

161


initially adopted by software development, because the software processwas considered like a predictable one. Consequently the iterative water-fall process proposed by Royce [1] in the early 70s has been used like asequential and plan driven methodology 1 with a long up-front elicitationstage. This fact has been recognized as a big mistake, because softwaredevelopment is not a predictable process. Software development is anadaptive process [2], and prominent researchers and practitioners suchas Winston Royce, Frederic Brooks, Tom Gilb and Barry Boehm statedit during the 1980s. Brooks proposed rapid prototyping and require-ments refinement as suitable approaches for dealing with the essentialdifficulties of software development in 1987 [3]; Boehm proposed aniterative, risk-driven model which uses prototypes and user feedback in1988 [4]; Gilb suggested an evolutionary model for delivering value tostakeholders, fast and continuously, in 1988 [5].

Agile methodologies (formerly called light-weight methodologies)arose as a natural reaction of software developers community againstplan-driven methodologies, promoting adaptive development. Althoughadaptive and evolutionary development is not a new idea, the valuesand principles behind agility have promoted it during the last 10 years.Thus, agile methodologies recognize essential difficulties and the natureof software development to provide a way to develop software usingadaptive processes. For us, followers of the agile philosophy, the mostrelevant features that describe the nature of software are:

• F1 - Requirements’ volatility and Evolution. Heraclitus saidin the old times “There is nothing permanent except change”, andsoftware systems are not the exception since they are open andoperate in ecosystems together with other elements such as peo-ple, hardware, laws, business rules, and other systems. Evolutionis a fact and requirements’ volatility is a manifestation of evolu-tion because all the elements in the ecosystem are continuouslychanging.

• F2 - Users’ volubility. Users often understand their needs usingsoftware prototypes. Thus, the initial requirements are usually

1The real waterfall process proposed by Royce was conceived like an iterativeprocess.

162

AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUTSOFTWARE EVOLUTION

raw and fuzzy, and can be only refined through exploration andexperimentation (with prototypes). However, meanwhile users aretesting and using software prototypes, the infinite space that rep-resents their requirements can change easily (the odds are high).The experience of a user with a new prototype is the source ofnew requirements and a trigger for mental processes that refinethe space requirements.

• F3 - The users’ needs are also tacit knowledge. Businessrules and business logic are in the stakeholders’ minds and are de-scribed in software artifacts. Explicit knowledge is easily expressedin artifacts, even if artifacts are written in ambiguous languages(natural languages, modeling languages). However, business rulesare usually tacit knowledge and getting them out of users’ mindsonto software artifacts is a hard task.

• F4 - Software invisibility. Software is intangible, it lives in siliconchips and hard disks, and software designs. Two consequences ofthis feature are: (i) most of users think that software developmentis an easy task, and (ii) there are not full spatial representationsof software designs to perform early verification and validation.

• F5 - Code decay. Software evolution has a limit, because in-cremental changes gradually degrade the architecture of a systemthroughout its lifespan. Therefore, software systems die just likeliving systems do [6].

• F6 - Developers are non-linear components. Any softwareprocess is carried out by people, and we really think that softwaredevelopment requires creative people during each stage of theprocess (analysis, design, implementation, testing, deployment).However, software professionals are humans and cannot be con-sidered as plug-ins or replaceable parts [7].

These features describe common issues in software development pro-cesses. Can a software development process be predictable? We thinkNO, and the answer of iterative development creators, practitioners andpromoters is NO, and the answer of agile people is NO. And the an-swer of every person involved in software development should be NO.

163


Therefore, we will show you in this chapter how agile methodologiesembrace software evolution and provide the developers with means fordealing with software nature. The structure of the rest of the chapteris as follows: Section 2 describes software evolutionary models. Section3 discusses agility concepts, values and principles. Section 4 presentsbriefly the history of agile methodologies. Section 5 describes the mainagile methods. Finally, Section 6 examines the relationship betweensoftware evolution and agility.

7.2 EVOLUTIONARY SOFTWARE PROCESSES

Software evolution is not a new term, it dates from 1960s, and it wasexplicitly recognized by the community when Lehman [8] explained thatevolution is an activity different from post-deployment maintenance.There are several definitions and considerations about what softwareevolution is, but we prefer to adopt the viewpoint in which evolutionis the process of adapting the software to the environment by usingplanned and unplanned activities.

Figure 7.1 Iterative development.

The foundations for evolutionary software processes are the iterativeand incremental development. Figure 7.1 represents iterative develop-ment with n iterations. Iterative development means making severalproduct releases on iterations. Each iteration is associated with a fullsequential waterfall process, and there is a responsibility chain betweenthem; each iteration generates a product release, each one implementsa set of features, and the iterations are part of a sequence. In iterative

164


development the outcome of an iteration can be a throwaway prototypethat is discarded for the next iteration or a product release that can beused to build a new release in a further iteration.

Figure 7.2 Incremental development.

Incremental development is related to making evolvable or incremen-tal products, where each one is an increment of the previous release.Figure 7.2 summarizes an incremental development with three productreleases; each one includes the features implemented in previous re-leases and is built on the previous product. Each release can be builtusing some stages of the sequential waterfall process. However, whenincremental development includes well defined iterations (each prod-uct/prototype is built using a sequential waterfall process), it is callediterative and incremental development. Figure 7.3 depicts an iterativeand incremental development process. This kind of processes has alsobeen called evolutionary process; next we list some of them.

7.2.1 EVO

The Evolutionary Development Model (EVO) proposed by Gilb [5, 9]is an iterative and incremental process for early and on-time deliveryof results. EVO relies on making small and frequent releases of high-value-first results to the stakeholders. EVO is based on the “Plan-Do-Study-Act” cycle: Plan the EVO step, perform the EVO step, analyzethe feedback results from the EVO step and the current environment,then decide what to do next. Its principles are [10]:

1. Capablanca’s next move: There is only one move that reallycounts, the next one.

165


Figure 7.3 Iterative and incremental development.

2. Do the juicy bits first: Do whatever gives the biggest gains. Donot let the other stuff distract you!

3. Better the devil you know: Successful visionaries start from wherethey are, and what they and their customers have.

4. You eat an elephant one bite at a time: System stakeholders needto digest new systems in small increments.

5. Cause and effect: If you change in small stages, the causes ofeffects are clearer and easier to correct, if needed.

6. The early bird catches the worm: Your customers will be happierwith an early long-term stream of their priority improvements,than with years of promises, culminating in late disaster.

7. Strike early, while the iron is still hot: Release and test quicklywith people who are most interested and motivated.

8. A bird in the hand is worth two in the bush: Your next step shouldgive the best result you can get now.

166


9. No plan survives first contact with the enemy: A little practicalexperience beats a lot of committee meetings.

10. Adaptive architecture: Since you cannot be sure where or whenyou are going, your first priority is to equip yourself to go almostanywhere, anytime.

The characteristics of EVO are:

• Frequent delivery of system changes (steps).

• Steps delivered to stakeholders for real use.

• Feedback obtained from stakeholders to determine next step(s).

• The existing system is used as the initial system base.

• Small steps (ideally between 2%-5% of total project cost andtime).

• Steps with highest value and benefit-to-cost are ratios given high-est priority for delivery.

• Feedback used “immediately” to modify long-term plans and re-quirements.

• Result-oriented (“delivering the results” is prime concern).

Typical steps of the EVO process are:

1. Gather from all the key stakeholders the top few (5 to 20) mostcritical goals that the project needs to deliver.

2. For each goal, define a scale of measure and a “final” goal level.

3. Define approximately four budgets for your most limited resources.

4. Write up these plans for the goals and budgets.

5. Negotiate with the key stakeholders to formally agree the goalsand budgets.

6. Plan to deliver some goals in weekly increments (EVO steps).

7. Implement the project in EVO steps.

167


7.2.2 Spiral

The Spiral model proposed by Boehm [4] was conceived as a risk-drivenapproach for software development. Each cycle in the spiral involves theevolution of the product by addressing continuously the same sequenceof steps, for each portion of the product and for each of its levels ofelaboration (specification, modules, components, methods). With eachcycle the product is evolving by increments. Typical steps of a spiralcycle are (Figure 7.4):

1. Identify the functional requirements (objectives) of the portion ofthe product which is going to be elaborated in the current cycle.

2. Identify the possible strategies/alternatives for implementing theproduct, and the constraints imposed by each strategy.

3. Evaluate the alternatives relative to the objectives and constraints.they include evaluating the risks and proposing strategies for mit-igating them (prototyping, simulation, benchmarking, etc.).

4. Implement and test the product in order to reduce the dominantrisks.

5. Make a review involving the primary people or organizations con-cerned with the product. The review covers all products developedduring the previous cycle, including the plans for the next cycleand the resources required to carry them out.

6. Prepare the plan for the next cycle.

7.2.3 The Unified Process Family

The Unified Process (UP) is the most used model for large projects.UP is based on previous models such as the Ericsson approach (1967),the Rational Objectory Process (1997) and the Rational Unified Process(1998). The main features of UP are [11]:

• It is an iterative and incremental model.

168


3. Development and test

2. Indentify and resolve risks1. Determine objectives

4. Plan the next iteration

Cumulative cost

Progress

Risk analysis

Risk analysis

Risk analysis

Operational PrototypePrototype 1Prototype 2

Requirements Draft Detailed desing

Code

Integration

Test

Implementation

Verification and validation

Verification and validation

Concept of

requirements

Concept of

requirements

Requirementsplan

Developmentplan

Test plan

Review

Release

Figure 7.4 The Spiral model [4].

• It is use-case-driven. This means that use cases are used to specifythe functional requirements and the iterations are planned andevaluated against use cases. Use cases drive the work througheach iteration.

• It is centered on the architecture definition because RUP is fo-cused on early development and baselining of an executable archi-tecture.

• It is an adaptable framework because it may be tailored accordingto the team and the project.

• It is a bidimensional model (Figure 7.5). The horizontal axis rep-resents the dynamic aspect and the vertical axis represents thestatic aspect. Horizontal axis shows the time and how the phasesand iterations define the lifecycle aspects. The four phases of UP

169


are inception, elaboration, construction, and transition. Each oneof these has a set of well-defined goals, artifacts and milestones; asummary of the goals is showed in Figure 7.6. UP is serial in thelarge because the whole process is split in four sequential phases;and it is iterative in the small because each phase is split into iter-ations. The vertical axis represents the disciplines in the lifecycleand the effort required for each one through the iterations. Fig-ure 7.7 summarizes the main elements in the vertical dimensionof UP.

• It includes risk-mitigation by using iterations with product re-leases; assessments at the end of each phase in order to decide ifthe project continues; and an inception phase aimed to define theproject scope and the business case for the system.

The Rational Unified Process (RUP) is the commercial version ofUP developed initially by Rational [12]. RUP has a dual nature, becauseit is a process and a product. A big difference between UP and RUP isthat the RUP has nine disciplines grouped into two types (i) engineering(development) and (ii) support (Figure 7.8), meanwhile UP has only thedevelopment disciplines.

Another flavor of UP is the Enterprise Unified Process (EUP) pro-posed by Ambler [13]. It is an extension of RUP aimed to cover weak-nesses of RUP and UP. The scope of RUP is the software process;therefore, RUP does not include activities of real development processessuch as design of the IT-architecture, maintenance, operation and sup-port in production environment, and management of portfolio. Thus,EUP adds new disciplines and phases to RUP. These disciplines andphases are described as follows:

• Production and Retirement phases: The new phases representthe lifecycle after a system has been deployed. The Productionphase purpose is to keep the software in production until it iseither replaced with a new version or it is retired and removedfrom production. The Retirement phase purpose is the successfulremoval of a system from production.

• Enterprise disciplines: These are seven new enterprise manage-ment disciplines related to cross-system issues that organizations

170


Figure 7.5 The Unified Process (UP) model [11].

should address to be successful at IT. These disciplines are: En-terprise business modeling, portfolio management, enterprise ar-chitecture, strategic reuse, people management, enterprise admin-istration, and software process improvement.

• A new support discipline called Operations & Support: It is relatedto operating and supporting the software specially after deployingthe system in the production environment.

A full list of all the flavors of UP is in [14] and a summary of theUP history is in [15].

7.2.4 Staged Model

Maintenance is not just post-delivery work, it is not an uniform taskover the software life cycle because it can be performed in several ways(perfective, adaptive, corrective, preventive). Maintenance is a series

171


Inception

Elaboration

Construction

Transition

- Define project scope.- Estimate cost and schedule.- Define risks.- Develop business case.- Prepare project environment.- Identify architecture.

- Specify requirements in greater detail.- Validate architecture.- Evolve project environment.- Staff project team.

- Model, build and test system.- Develop supporting documentation.

- System testing.- User testing.- System rework.-System deployment.

Figure 7.6 The RUP phases and their goals [12].

of distinct stages, each one with different activities, tools, and businessconsequences . This is the motivation of the Staged Model proposed byRajlich et al. [16]. Thus, the software life cycle consists of five stages(Figure 7.9):

1. Initial development: The system is built from scratch to meetinitial requirements.

2. Evolution: Capabilities and functionality of the system are ex-tended to meet user needs. It is possible to make major changesin the architecture.

3. Servicing: Minor defects are repaired and simple functional changesare performed through servicing patches.

4. Phaseout: Servicing is stopped for seeking to generate revenuefrom the system as long as possible.

5. Closedown: The system is retired from production environment.

172


Template

Artifact

Activity

Workflow

Worker/Role

Discipline

is responsible for

performs

follows

creates/updates

is part of

is defined by

Figure 7.7 UP elements.

7.3 PRINCIPLES, AGILITY AND THE AGILEMANIFESTO

7.3.1 The Agile Manifesto

On February 11-13, 2001, at The Lodge at Snowbird ski resort inthe Wasatch mountains of Utah, 17 people representing Extreme Pro-gramming, SCRUM, DSDM, Adaptive Software Development, Crys-tal, Feature-Driven Development, Pragmatic Programming, and other“light-weight” methodologies, met to talk about a new way of developingsoftware different from “heavy-weight” processes (plan/document-drivenmethods). The results were an alliance around agile development anda “Manifesto for Software Agile Development” [17]. This manifesto is aset of value statements and principles that describe how people shouldimplement agility in software development:

173


Figure 7.8 The Rational Unified Process (RUP) model [12].

We are uncovering better ways of developing software by doing it andhelping others do it. Through this work we have come to value:

Individuals and interactions over processes and tools.Working software over comprehensive documentation.Customer collaboration over contract negotiation.Responding to change over following a plan.

That is, while there is value in the items on the right, we value theitems on the left more [17].

Each one of the value statements indicates a preference (the firstsegment) and an item of lesser priority (the latter segment). It does

174


First running version

Loss of evolvability

Servicing discontinued

Switchoff

Initial development

Evolution

Servicing

Phaseout

Closedown

Evolution changes

Servicing patches

Figure 7.9 The simple staged model [16].

not mean that agile people dismiss the second segments or that secondsegments are bad practices. For agile people, the first segments in thestatements are more important than the second segments. For example,agile practitioners recognize the importance of process and tools, withthe additional recognition that the interaction among skilled individualshas even greater importance.

7.3.2 Agile Principles

The second part of the manifesto is a set of principles which agile prac-titioners must follow:

175


• P1: Our highest priority is to satisfy the customer through earlyand continuous delivery of valuable software.

• P2: Welcome changing requirements, even late in development.Agile processes harness change for the customer’s competitiveadvantage.

• P3: Deliver working software frequently, from a couple of weeksto a couple of months, with a preference for the shorter timescale.

• P4: Business people and developers must work together dailythroughout the project.

• P5: Build projects around motivated individuals. Give them theenvironment and support they need, and trust them to get thejob done.

• P6: The most efficient and effective method of conveying infor-mation to and within a development team is face-to-face conver-sation.

• P7: Working software is the primary measure of progress.

• P8: Agile processes promote sustainable development. The spon-sors, developers, and users should be able to maintain a constantpace indefinitely.

• P9: Continuous attention to technical excellence and good designenhances agility.

• P10: Simplicity –the art of maximizing the amount of work notdone– is essential.

• P11: The best architectures, requirements, and designs emergefrom self-organizing teams.

• P12: At regular intervals, the team reflects on how to becomemore effective, then tunes and adjusts its behavior accordingly.

176


7.3.3 Agility in Software Development

Iterative does not mean agile, and incremental (evolutionary) does notnecessarily mean agile. However, agile methodologies are iterative andincremental. Agile methodologies consider the software process as anadaptive one and are the evolution of iterative and incremental pro-cesses. Agility in software development is supported on a set of valuesand principles which represent ways for dealing with the inherent fea-tures and difficulties of software development. These values and princi-ples are represented by the agile manifesto. Thus, agile methodologiesmain features are:

• Iterative and incremental.

• Frequent planning and feedback.

• Frequent and rapid delivery of business value in the form of high-quality working software.

• Planning based on prioritized requirements. Users define high-value requirements.

• Minimum usage of bureaucracy and overhead within the develop-ment lifecycle.

• Embracing and managing changing requirements and business pri-orities.

• Collaborative decision making and continuous customer involve-ment.

• Usage of effective and efficient ways of communication.

• Empowered teams.

• Promotion of team’s skills improvement.

• Usage of creative and ergonomic environments2

• Self-organizing and adaptive teams.

2An agile team-room wish list is presented in [18].

177


7.4 AGILE METHODOLOGIES HISTORY

The birth of agile methodologies is a process that started before the wa-terfall method [1]. The first steps of iterative methods were in the periodbetween 1930 and 1970; iterative and incremental practices (time-boxediterations and test-drive n development) were used in military projectssuch as X-15 and Mercury [19]. Larman et al. [19] present an interestingchronology of iterative and incremental practices in software develop-ment.

7.4.1 Iterative Development (1970-1990)

Iterative methodologies were widely used during the 1970s and the 1980sin research centers such as NASA and the IBM T.J Watson. Further-more, the waterfall method was wrongly adopted like a non-iterativemodel with poor results in several projects. Facts like these were thereasons why people started to promote the principles of iterative method-ologies by the end of 1980s. Concurrent movements and publicationssuch as [3–5,20] are examples of the discussion which arose around theeffectiveness and performance of waterfall model.

Another element of discussion was the need for improving the resultsof development process using people’s experience with large softwaresystems during thirty years. Brooks [3] proposed fast and iterative pro-totyping as suitable approaches for software development. Victor Basiliand Albert Turner [20] proposed a method for developing software byiterative enhancements; the method starts with an implementation ofa subset of the problem and then continues iteratively enhancing theversions until the full system is implement ed; it is an application of thestepwise refinement proposed in [21]. Tom Gilb in [5, 22] introducedEVO, an iterative and incremental methodology. He also started thediscussion about the “Evolutionary delivery” of software, based on fastand incremental delivery of high-value results. Gilb is perhaps the manwho most promoted iterative models in the 1980s. Boehm [4] intro-duced his Spiral model as an enhancement of waterfall and other classicmodels.

By the end of 1980s, the discussion about iterative methodologiesplus the ideas of Boehm, Gilb, Shultz, and Lantz [23], were used by

178


James Martin to define a process called Rapid Application Development(RAD) [24]. RAD is also recognized as a foundation for DSDM (thefirst agile methodology) and XP (the most radical of the agile method-ologies). RAD main features are iterative prototyping by using CASEtools, short iterations and small teams of motivated and experiencedpeople with high development skills.

7.4.2 The Birth of Agile Methodologies (1990-2001)

The 1990s is the decade of the agile methodologies. All the effortsduring 1970s and 1980s for promoting the iterative and incrementalmodels resulted in the foundation of the Agile Alliance and the birth ofa new methodology [2]; this new methodology is based on a set of valuestatements and principles written in the Agile Manifesto.

In January 1994, a group of 16 practitioners of RAD met at UnitedKingdom to define an iterative process based on RAD. The result of thismeeting was the Dynamic Systems Development Method (DSDM) [25].DSDM is a method based on nine principles and four values. The DSDMprocess includes three phases (pre-project phase, project life-cycle phase,and post-project phase) and five stages (feasibility study, business study,functional model iteration, design and build iteration, and implementation).

SCRUM appears in scene as an iterative process with time-boxediterations of 30 days, which was used by Ken Schwaber and Jeff Suther-land during 1993 and 1994 in EASEL Corp. SCRUM is based on aniterative model used in Honda, Canon, and Fujitsu; this model was pub-lished in the article The New Product Development Game [26] and isconsidered as the first version of SCRUM with adaptive and self-directedteam practices in a Sashimi approach3. SCRUM was published as a soft-ware development process for the first time in [27] and a refined versionwas presented in [28].

Feature Driven Development (FDD) is one of the agile method-ologies that is process-oriented and not people-oriented. It was used

3Sashimi is a japanese food which consists of fish or shellfish served in thin slices.In the context of SCRUM, Sashimi refers to making a bit of completed functionalityin time-boxed iterations.

179


in 1995 as a set of five processes designed by Jeff de Luca for imple-menting a system of the Bank of Singapore with 50 developers and aschedule of 15 months. FDD was influenced by the approach of PeterCoad for modeling with UML and colors, turning FDD into the only ag-ile methodology which uses UML explicitly as modeling language. TheFDD description was published for the first time in [29] and a deeperdescription is in [30].

Figure 7.10 The Crystal Family.

Other agile methodology which was conceived during the 1990s isCrystal. The Crystal family is the result of an empirical study carriedout by Alistair Cockburn during ten years (it started up by 1991), whichwas aimed to design an effective methodology for software developmentand analyze the impact of team size in the success of projects. Thestudy analyzed several teams and projects (specially successful projects)in order to find and document similarities between them. Two of theconclusions were:

• People-centered methodologies work better than process-centeredmethodologies.

180


• The right methodology for a project depends on the team size andthe project objectives.

Thus, the Crystal family is a set of methodologies (clear, yellow,orange, red) for software development which has to be applied accordingto the team size and criticality level [31–33]. Figure 7.10 shows theCrystal family.

Adaptive Software Development (ASD) is the result of two experi-ences of Jim Highsmith in the early 1990s as Microsoft consultant andauthor of a RAD methodology (RADical Application Development4) to-gether with Sam Bayer in 1994 [34]. The addition of the two experiencesto the usage of adaptive complex systems are the main elements in theconception of ASD [35].

Extreme Programming (XP) is the consolidation of the ideas ofthree men (Kent Beck, Ron Jeffries, and Ward Cunningham) in theC3 team [36]. The Chrysler Comprehensive Compensation (C3) systemwas initially conceived as the solution for the complex payroll system ofChrysler. The C3 project was canceled but the experience of Beck inTektronix, the practices used at Chrysler and the help of Jeffries (thefirst XP coach) were used for defining the XP methodology [37,38].

7.4.3 The Post-manifesto Age (2001-2011)

The Agile Alliance foundation and the agile manifesto were the startingpoint for the agile community and the boom of agility in software devel-opment. This is confirmed by the large number of publications in highimpact journals and magazines (Software Development, IEEE Software,IEEE Computer, Cutter IT Journal, Software Testing and Quality En-gineering, The Economist, etc.); the publication of Web sites and agileconferences; the adoption of agile methodologies in the industry; andthe foundation of new agile companies. New methodologies have beenconceived during this period such as Agile Modeling (AM), Agile Uni-fied Process (AUP), eXtreme Unified Process (XUP), Dynamic SystemsDevelopment Method vs. 4.2 (DSDM 4.2), DSDM Atern, MicrosoftSolutions Framework for Agile Development (MSF4), OpenUP, and theLean methodologies.

4It was a methodology with one-week iterations.

181


1970

1980

1990

2000

2010

Waterfall model (1970)Ericsson approach (1976)

spiral (1986)Prototyping

Methodology (1986)Evo (1988)Object factory (1987)

Rational ObjectoryProcess (1997)RUP (1998)UP (1999)EUP (2000)

RUP (2001)EUP vs 2002 (2002)EUP vs 2004 (2004)BUP (2005)

OpenUP0.9 (2006)

Rapid ApplicationDevelopment (1998)

RADIcal SoftwareDevelopment (1994)

DSDM (1995) SCRUM (1995)

C3Team (1997)

XP (1999)

ASD(1999)

Agile manifesto (2001)

New PeoductDevelopmentGame (1986)

Lean Manufacturing

(1990)

MSF (1994)

MSF4 (2004)

FDD (1997)

Crystal (1998)

PragmaticProgramming(2000)

LSD (2003)

Kanban (2007)

Scrumban (2008)

DOI(2005)DSDM4.2 (2006)DSDM Atern (2007)XUP(2005)

AUP (2004)

Figure 7.11 Agile methodologies evolution.

In the post-manifesto movement, methodologies based on lean man-ufacturing concepts are the most representative. These are Lean Soft-ware Development (LSD) [39, 40], Kanban [41] and Scrumban [42].Figure 7.11 shows the evolution of agile methodologies since 1970 andthe relationships between agile methodologies and some events in thehistory.

182


As a consequence of the agile movement Declaration of Interde-pendence (DOI) was published and signed as a set of six managementprinciples initially intended for agile project managers:

Agile and adaptive approaches for linking people, projects and value:We are a community of project leaders that are highly successful at

delivering results. To achieve these results:� We increase return on investment by making continuous flow of

value our focus.� We deliver reliable results by engaging customers in frequent in-

teractions and shared ownership.� We expect uncertainty and manage for it through iterations, an-

ticipation, and adaptation.� We unleash creativity and innovation by recognizing that individu-

als are the ultimate source of value, and creating an environment wherethey can make a difference.

� We boost performance through group accountability for results andshared responsibility for team effectiveness.

� We improve effectiveness and reliability through situationally spe-cific strategies, processes and practices [43].

7.5 AGILE METHODOLOGIES OVERVIEW

7.5.1 Extreme Programming (XP)

XP is the most popular and controversial of the agile methods. It isbased on five values (simplicity, communication, feedback, courage, andrespect), five principles (rapid feedback, assuming simplicity, incremen-tal change, embracing change, quality work) and thirteen supporting andmandatory practices (whole team, planning game, small releases, cus-tomer tests, simple design, pair programming, test-driven development,refactoring, continuous integration, collective code ownership, codingstandard, metaphor, sustainable pace). According to Fowler, the XPpractices are concrete things that a team can do day-to-day, while val-ues are the fundamental knowledge and understanding that underpinthe approach [2]. XP main features are: High customer involvement,rapid feedback loops, continuous testing, continuous planning, and closeteamwork to deliver working software at very frequent intervals, typicallyevery 1-3 weeks.

183


7.5.2 SCRUM

SCRUM is a lightweight management framework for iterative and in-cremental development; it can be applied to any kind of product. InSCRUM each iteration (sprint) is time-boxed (30 days) and the devel-opment is driven by a set of prioritized requirements (sprint backlog).This backlog is a subset of the full list of requirements for the product(product backlog). SCRUM has two moments for planning:

• Daily SCRUM: It is a daily meeting driven by three questions thateach member of the team has to answer (What have I done sincethe last daily scrum? What will I do today? What problems did Ihave?).

• Sprint planning: The customer chooses the items for the sprintbacklog and the team estimates the effort for each item.

Figure 7.12 SCRUM.

7.5.3 Feature Driven Development (FDD)

FDD is a model-driven, short-iteration process. It begins with a startupphase which is aimed to:

1. Establish an overall model of the system by modeling in color withUML (domain object modeling).

184


2. Build a features list by using the domain model and the featurenaming template.

3. Make a development plan according to the features list.

Then it continues with the construction phase with a series of two-week “design by feature, build by feature” iterations (Figure 7.13). FDDis strongly based on artifacts with colors such as the domain object, thegraphic summary of features, and the plan view.

Figure 7.13 FDD lifecycle.

7.5.4 Lean Agile Development: LSD, Kanban, Scrum-ban

Lean agile methodologies are based on the principles of Lean Manufac-turing [44,45]:

185


1. Specify value from the standpoint of the end customer by productfamily.

2. Identify all the steps in the value stream for each product family,eliminating, whenever possible, those steps that do not createvalue.

3. Make the value-creating steps occur in tight sequence, so theproduct will flow smoothly toward the customer.

4. As flow is introduced, let customers pull value from the next up-stream activity.

5. As value is specified, value streams are identified, wasted stepsare removed, and flow and pull are introduced, begin the processagain and continue it until a state of perfection is reached in whichperfect value is created with no waste.

These principles were initially applied by Mary and Tom Poppendieckfor developing the first agile methodology based on lean manufacturing:LSD [39,40]. Lean Software Development (LSD) has seven principles:

1. Optimize the whole: Optimizing a part of a system will always,over time, sub-optimize the overall system.

2. Eliminate waste: Waste is anything that does not deliver value tothe customer.

3. Build quality in: If you routinely find defects in your verificationprocess, your process is defective.

4. Learn constantly: Planning is useful. Learning is essential.

5. Deliver as fast as possible: Start with a deep understanding of allstakeholders and what they will value. Create a steady, even flowof work, pulled from this deep understanding of value.

6. Engage everyone: The time and energy of creative people are thescarce resources in today’s economy, and the basis of competitiveadvantage.

186


7. Keep getting better: Results are not the main point, the point isto develop people and systems capable of delivering results.

Figure 7.14 Kanban board.

A Kanban is a physical card used in the Toyota Production Sys-tem (TPS) as a tool of lean manufacturing. Kanban cards are used tosupport non-centralized “pull” production control. In software develop-ment, this tool is used in addition to walls or white-boards for visualizingproject progress. Thus, in a Kanban system the aim is to minimize thework-in-progress (WIP) by “pulling” the parts or tasks when needed and“pushing” instructions about how to do the tasks. Every time a task ispulled from the queue of WIP there is a signal of production with a Kan-ban card. Therefore, a kanban card is information exchanged betweenseparated processes; in the software development case, each process isa development lifecycle step. Figure 7.14 describes a typical kanbanboard.

Kanban systems are based on a strict discipline described by the “sixrules of kanban”:

1. Customer processes withdraw or “pull” items in the precise amountsspecified on the Kanban (Downstream processes).

187


2. Supplier produces items in the precise amounts and sequencesspecified by the Kanban (Upstream processes).

3. No items are made or moved without a Kanaba.

4. A Kanban should accompany each item, every time.

5. Defects and incorrect amounts are never sent to the next down-stream process5.

6. The number of Kanban is reduced carefully to reduce inventoriesand to reveal problems.

These principles and the philosophy of kanban systems are usedin new agile methodologies such as Kanban [41, 46] and a version ofSCRUM called Scrumban [42]. The main features of software develop-ment models based on kanban are:

• Requirements are split into pieces (tasks) and each one must bewritten in a card and put on the wall.

• In the kanban board (wall) there are named columns to illustratewhere each task is in the workflow.

• Workflow and progress of the process are always visible for theteam on Kanban boards.

• Work in progress must be limited, by using explicit limits of howmany tasks may be in progress at each workflow step.

• The process is optimized by measuring the lead time.

7.5.5 Agile Versions of UP: AgileUP, Basic/OpenUP

The UP family has two agile versions: AUP and Basic/OpenUP. TheAgile Unified Process (AUP) is a simplified version of RUP proposedby Scott Ambler [47]. AUP preserves the essentials of RUP and uses

5Upstream processes produce parts only if their downstream processes need them.Downstream workers withdraw or “pull” the parts they need from their upstreamprocesses.

188


agile techniques such as test-driven development, agile modeling, ag-ile change management, and database refactoring. The AUP lifecycleis quite different from RUP, because AUP disciplines are modeling6,implementation, test, deployment, configuration management, projectmanagement, and environment. The AUP principles are:

1. The AUP product provides high-level guidance.

2. Simplicity.

3. Agility.

4. Focus on high-value activities.

5. Tool independence.

6. Tailor the AUP without taking a course or buying a product.

BasicUP (BUP) is a lightweight version of RUP for small projects.It is an iterative and incremental process based on scenarios-drivendevelopment, risk management, and an architecture-centric approach.OpenUP is just the renamed version of BUP published in [49]. OpenUPis driven by the four core principles listed below:

1. Collaborate to align interests and share understanding.

2. Balance competing priorities to maximize stakeholder value.

3. Focus on the architecture early to minimize risks and organizedevelopment.

4. Evolve to continuously obtain feedback and improve.

OpenUP is organized in two different dimensions: Method contentand process content. The method content is related to the definition ofmethod elements (roles, tasks, artifacts, and guidance) and the processcontent is where the method elements are applied in a temporal sense.One of the innovations in OpenUP over RUP is that OpenUP addresses

6The goal of the modeling discipline is to understand the business of the organi-zation, the problem domain being addressed by the project, and to identify a viablesolution to address the problem domain, by using agile modeling techniques [48].

189


organization of work at personal, team and stakeholder levels (Figure7.15). The OpenUP method is focused on the following disciplines: Re-quirements, architecture, development, test, project management, andconfiguration and change Management.

Figure 7.15 OpenUP layers - http://epf.eclipse.org/wikis/openup/ (accessed and verifiedon April 10, 2011).

7.6 AGILITY AND SOFTWARE EVOLUTION

People in agile methods do not expect a detailed and complete set ofrequirement s at the beginning of the project, because the environ-ment of a software system and the requirements are subject to frequentchanges. In order to deal with this feature of the software nature, agilemethods use several strategies to minimize the risk and impact of newrequirements in the process. Let us check one typical example in controltheory; systems without feedback loops have high odds of going away ofthe desired behavior, which means there is a high probability of achiev-ing an entropic state. Entropy exists because change exists and alsobecause evolution is an inevitable natural phenomenon. If the environ-ment evolves, then the systems in the environment must evolve. Thus,the strategy to deal with entropy is to introduce feedback in the sys-tems, in order to compare the desired behavior versus the real behaviorfor making internal decisions (inside the system) and correct undesir-able results. Then, the corrective actions depend on the recurrence andthe quality of feedback. The lesser the feedback period the better the

190


results, but there are limits because it is impossible to build systemswith feedback periods equal to zero.

Figure 7.16 Communication modes and effectiveness [50].

But, how is this related to software development? Simple, softwareis a system and the process of development is a system, too; the environ-ment where software and its process “are living” is continuously evolving,then the software and the process must evolve, too. But, how can thedevelopment team know if the process is going away from the desiredresults? Simple, frequently releasing the product to the stakeholders andmeasuring how the product is giving value to them (feedback loops).And there are limits, because if you make too short iterations, the de-velopment process can become a chaotic “code and fix”. That is whyagile processes are iterative and incremental.

Feedback quality is also a factor for successful evolution of systems,and agile people know it, thus agile methodologies use and suggestto use practices to get good feedback, based on communication effec-tiveness. Alistair Cockburn in [50] describes various modes of commu-nication for increasing the effectiveness. Figure 7.16 plots the relation

191


between richness of communication modes and their effectiveness. Face-to-face conversation and design on whiteboards are the most effective.Both of these modes are the preferred modes of communication in agilemethodologies.

Table 7.1 Agile principles vs. Software features.

Principle vs. Features F1 F2 F3 F4 F5 P6P1 X X X XP2 X X XP3 X X XP4 X XP5 XP6 X X XP7 XP8 XP9 XP10 XP11 XP12 X

Agile methodologies embrace the real nature of software develop-ment. Table 7.1 shows the relationship between each one of the agileprinciples and the software features presented in the introduction. Thus,agile methodologies are a good choice for dealing with software diffi-culties and features. However, there is still an open discussion abouteffectiveness of agile methods in large projects with large teams, be-cause these kinds of methods were designed initially for small teams.Scaling agile methods can be achieved by structures of small and dis-tributed agile teams instead of typical hierarchy structures, and usingmanagement practices for coordinating the small teams. Highsmith [51]presents a structure called “hub”, which consists of a network of smallagile teams; each node in the hub is a team and in the network theremay be several feature teams, a customer team, an architecture team,a project-management team, an integration and build team, and evena center of excellence (Figure 7.17). Other examples of strategies andmodels for scaling scrum agile teams are in [52] and [53].

192


Figure 7.17 Hub organizational structure [51].


In December 2009 the SEMAT7 (Software Engineering Method andTheory) initiative was launched as a new community effort to reshape

7SEMAT initiative: http://www.semat.org/

193


software engineering based on a solid theory, proven principles, and bestpractices. The founders and first signatories of this initiative are world-class experts who have pioneered the development of software engineer-ing since its inception. This group of eminent members includes most ofthe creators of agile methodologies. Lots of new supporters are comingfrom many sides of the field and from around the world, and now thereis an organization that defines several tracks’ works.

One of the key points is that the SEMAT initiative pretends todefine a kernel of widely-agreed elements needed to build software usinga minimalist approach: The kernel would be complete when no otherelement can be removed from it. This approach is completely oppositeto the one used in the definition of heavy methodologies, where almostall type of potentially useful element was included. This means thatthis initiative recognizes that agile principles embodied the smart wayto build software, and therefore, the core of the intersection of agilemethods will be surely part of this kernel.

However, the goals of SEMAT are beyond current agile methods.One of them is to obtain a foundation for the agile creation and en-actment of software engineering methods (agile or more traditional) bypractitioners themselves. Ideally, developer teams would be able to cus-tomize, tailor and adapt the process they use, not just at the beginningof a project, but continuously as necessary over the course of a develop-ment effort. This means that agile principles will influence the way wethink about well-defined processes, since in the future software devel-opment teams will not have a prescribed method that they must followstep by step, regardless of specific circumstances and changes in theenvironment around them. Instead, the used method will be flexibleenough to be adapted anytime during the software development effort.In this regard, a consistent notation or language for describing softwareengineering practices is needed in order to allow developers to formallyexpress changes in their method, and consequently, be effective and scal-able while remaining flexible and agile. Accordingly, an agile method willevolve to suit the particular circumstances of the project in which it isused.

194


REFERENCES

[1] Royce, W. Managing the development of large software systems.In: Proceedings of the 9th International Conference of SoftwareEngineering : 1970.

[2] Fowler, M. The New Methodology : 2005. Accessed and Verifiedon April 11, 2011.

[3] Brooks, F. No Silver Bullet - Essence and Accidents of SoftwareEngineering. Computer, 20: 1987; 10–19.

[4] Boehm, B. A spiral model of software development and en-hancemnt. Computer, 21: 1988; 61–72.

[5] Gilb, T. Principles of Software Management. Addison-Wesley:1988.

[6] Eick, S., Graves, T., Karr, A., Marron, J. & Mockus,

A. Does Code Decay? Assessing the Evidence from Change Man-agement Data. IEEE Transaction on Software Engineering, 27(1):2001.

[7] Cockburn, A. Characterizing people as non-linear, first-ordercomponents in software development. Informe técnico, Humansand Technology: 1999.

[8] Lehman, M. Programs, Life Cycles, and Laws of Software Evo-lution. Proceedings of the IEEE, 68(9): 1980; 1060–1076.

[9] Gilb, K. Evolutionary Project Management and Product Devel-opment: 2007.

[10] Gilb, T. Competitive Engineering: A Handbook For SystemsEngineering, Requirements Engineering, and Software EngineeringUsing Planguage. Butterworth-Heinemann: 2005.

[11] I. Jacobson, G. B. & Rumbaugh, J. The Unified SoftwareDevelopment Process. Adisson-Wesley Professional: 1999.

195


[12] Kruchten, P. The Rational Unified Process: An Introduction.Adisson-Wesley: 2003.

[13] Ambler, S., Balbone, J. & Vizdos, M. The Enterprise Uni-fied Process: Extending the Rational Unified Process. PrenticeHall: 2005.

[14] Ambler, S. The Unified Process From A to Z : 2009. Accessedand verified on April 11, 2011.

[15] Ambler, S. History of the Unified Process: 2010. Accessed andverified on April 11, 2011.

[16] Rajlich, V. & Bennett, K. A staged model for the softwarelife cycle. Computer, 33(7): 2000; 66–71.

[17] Beck, K. Manifesto for Agile Software Development: 2001. Ac-cessed and verified on April 11, 2011.

[18] Hartmann, D. Designing Collaborative Spaces for Productivity :2007. Accessed and Verified on April 11, 2011.

[19] Larman, C. & Basili, V. Iterative and Incremental Develop-ment. Computer, 36(6): 2003; 47–56.

[20] Basili, V. & Turner, A. Iterative Enhancement: A PracticalTechnique for Software Development. IEEE Transaction on Soft-ware Engineering, SE-1(4): 1975; 390–386.

[21] Wirth, N. Program Development by Stepwise Refinement. Com-munications of the ACM, 14(4): 1971; 221–227.

[22] Gilb, T. Software Metrics. Winthrop Publishers: 1977.

[23] Lantz, K. Prototyping Methodology. Reston Pub Co: 1986.

[24] Martin, J. Rapid Application Development. Macmillan Coll Div:1991.

[25] Stapleton, J. DSDM: The Method in Practice. Adisson-Wesley:1997.

196


[26] Takeuchi, H. & Nonaka, I. The new product developmentgame. Harvard Business Review, January-February: 1986; 137–146.

[27] Sutherland, J. & Schwaber, K. Controlled Chaos: Livingon the edge. Informe técnico, Advavanced Development Methods:1996.

[28] Beedle, M. Scrum: An extension pattern language for hyper-productive software development. Pattern Language of ProgramDesign, 4: 1999; 637–651.

[29] Coad, P., LeFebvre, E. & Luca, J. Java Modelling in Colorwith UML: Enterprise Component and Process. Prentice Hall:2000.

[30] Palmer, S. & Felsing, J. A Practical Guide to Feature-DrivenDevelopment. Prentice Hall PTR: 2002.

[31] Cockburn, A. Surviving Object-Oriented Projects. Adisson-Wesley Professional: 1998.

[32] Cockburn, A. People and methodologies in software devel-opment. Tesis Doctoral, Faculty of Mathematics and NaturalSciencies- University of Oslo: 2003.

[33] Cockburn, A. Crystal Clear: A Human-Powered Methodologyfor Small Teams. Adisson-Wesley Professional: 2004.

[34] Bayer, S. & Highsmith, J. Radical Software Development.American Programmer Magazine, 7: 1994; 35–42.

[35] Highsmith, J. Adaptive Software Development: A CollaborativeApproach to Managing Complex Systems. Dorset House Publish-ing: 2000.

[36] Beck, K. Chrysler goes to extreme. Distributed Computing:1998; 24–28.

[37] Beck, K. Extreme Programming Explained. Adisson-Wesley:2000.

197


[38] Beck, K. & Fowler, M. Planning Extreme Programming. Ad:2000.

[39] Poppendieck, M. & Poppendieck, T. Lean Software Devel-opment: An Agile Toolkit. Adisson-Wesley: 2003.

[40] Poppendieck, M. & Poppendieck, T. Implementing LeanSoftware Development: From Concept to Cash. Adisson-Wesley:2006.

[41] Kniberg, H. & Skarin, M. Kanban and Scrum - making themost of both. IngoQ: 2010.

[42] Ladas, C. Scrumban - Essays on Kanban Systems for Lean Soft-ware Development. Modus Cooperandi Press: 2009.

[43] Highsmith, J. Declaration of Interdependence: 2005. Accessedand Verified on April 11, 2011.

[44] Womack, J. & Jones, D. The Machine that Changed theWorld: The Story of Lean Production. Harper Perennial: 1991.

[45] Womack, J. & Jones, D. Lean Thinking. Simon & Schuster:1996.

[46] Anderson, D. Kanban. Blue Hole Press: 2010.

[47] Ambler, S. The Agile Unified Process (AUP): 2009. Accessedand verified on April 11, 2011.

[48] Ambler, S. Agile Modelling. Wiley: 2001.

[49] Kroll, P. & MacIsaac, B. Agility and Discipline Made Easy:Practices from OpenUP and RUP. Addison-Wesley Professional:2006.

[50] Cockburn, A. Agile Software Development. Addison-WesleyProfessional: 2001.

[51] Highsmith, J. Agile project management. Addison-Wesley:2004.

198


[52] Woodward, E., Surdek, S. & Ganis, M. A Practical Guideto Distributed Scrum. IBM Press: 2010.

[53] Larman, C. & Vodde, B. Practices for Scaling Lean & AgileDevelopment. Addison-Wesley: 2010.

199


Software DevelopmentAgility in Small

and MediumEnterprises (SMEs)

Victor Escobar-SarmientoMario Linares-Vásquez

Jairo Aponte

ABSTRACTThe creation of new companies (small and medium size) in the soft-ware development industry has been influenced by the worldwide accep-tance of software as an important aspect in daily life, and the continuedgrowth of the software development industry during the last decade.The rapid pace with which the companies are founded and enter intobusiness makes them experience some drawbacks such as informality intheir processes and management models, and methodological deficien-cies. On the other hand, agile methodologies have been thought asa model for software development with high quality, using small work-groups without hierarchical organizations, and reducing the ceremonyin internal processes. These methodologies are characterized by con-stant planning, continuous feedback, and permanent interaction withthe client. Agile methodologies appear as a way to improve the per-formance and results of companies interested in software development,specially the small and medium ones. In this chapter we present thefeatures of small and medium companies and their challenges to adoptagile methodologies.

201


8.1 INTRODUCTION

In the last decades, the uncontrollable evolution of technology has beena crucial factor in the birth and growth of new companies, which providedifferent kinds of services worldwide. Nowadays, software developmentcompanies full the market, offering new products or solutions. Acs etal. [1] argued that small firms are indeed the engines of global eco-nomic growth, technology advancement and employment opportunities.In most countries, SMEs (Small and Medium Enterprises) dominatethe industrial and commercial infrastructure. Economists believe thatwealth of nations and growth of their economies depend on their SMEs’performance [2].

The SMEs are the most important companies in terms of softwareproduction. Tables 8.1 and 8.2 show the importance of India, Ireland,and Israel in the software market. For example, Ireland is recognizedas one of the principal countries in this field; being this sector one ofthe top of its economy, growing 25% faster than international marketsin the nineties [3]. In Ireland, almost 99% of companies are SMEs andeach one employs less than 50 people; however, they account for over68% of the private sector employment [4]. For example, a report of ICTIreland [5] shows that the employment contribution of Irish SMEs isjust 11% compared with 15% in the United States. Another remarkablecountry in the genesis of software development SMEs is India. Dur-ing the 70s, companies at India started to have an unexpected growth,employing about 345.000 persons in 2004 and generating earnings foralmost US$12 billions, equivalent to the 3.3% of global services spend-ing.

Table 8.1 Software exports from India, Ireland and Israel (US$millions) [6].

India Ireland Israel1990 105 2,132 902000 6,200 8,865 2,6002002 7,500 12,192 3,0002003 8,600 11,819 N/A

Employment −2003 260,000 23,930 15,000Revenue/employee 2003 33,076 493,988 273,000

202

SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUMENTERPRISES (SMES)

Table 8.2 Growth of the Indian software industry [6].

Year Total No. of Average Average Exports/Exports firms revenue per revenue per Revenue

(US$millions) firm (US$) employee (US$) (%)1980 4 21 190,476 16,000 501984 25,3 35 722,857 18,471 501990 105,4 700 150,571 16,215 N/A2000 5,287 816 7,598,039 32,635 71,82004 12,200 3170 7,004,154 35,362 73,9

Software industry in Colombia has taken an important role in econ-omy; in the last years its growth has been significant, with revenuesaround US$465 millions in 2009 [7], with SMEs being main participants.Reports of Proexport 1 show that the Colombian market duplicated itsearnings from 2006 to 2009 [7]. However, according to [8], the softwareproducts of Colombian companies do not have the recognition outsideof the local boundaries because their market is not sophisticated and isoriented just for customers in Latin America. This study also reports alist of factors to be improved in the industry, mostly related to SMEsconfiguration, industrial maturity, government support, infrastructure,and the human resources involved in the processes. Additionally [9]and [10] describe a primary set of obstacles for SMEs’ development inColombia; the set includes the following items:

• Difficulties recognizing and accessing to appropriate technology.

• Formalization and absorption of new technologies.

• Technical and competitive limitations.

• Poor physical infrastructure.

• Lack of managers with management skills and strategic thinking.

• Lack of qualified human resources.

• Limited access to external markets; among others.

1Proexport is the entity who promotes the tourism, foreign investment and ex-ports in Colombia. http://www.proexport.com.co/ (accessed and verified onNovember 17, 2011)

203


(a) Software development companies inColombia (2005-2009)

(b) Size participation of Soft-ware development companies inColombia

Figure 8.1 Software development in Colombia [7].

Figure 8.2 Earnings generated by Software development in Colombia [7].

Therefore, smart and effective solutions for these kinds of problemsare needed, and we think agile methodologies are better suited for SMEsfeatures than traditional plan-driven methodologies.

This chapter is organized as follows: Section 2 explains the legal def-inition of SMEs globally and locally; Section 3 provides a description ofthe current situation of agile methodologies around the world; Section 4exposes some agility assessment models; Section 5 discusses the weak-nesses and strengths of agile methodologies, Section 6 describes thechallenges adopting agile methodologies in SMEs, and the last sectiondraws conclusions and summarizes the chapter.

204


8.2 LEGAL DEFINITION OF SMES

In this section we present international and local definitions of what theSMEs are.

8.2.1 Foreign Definitions

The European Commission defines SMEs according to the Recommen-dation 2003/361/EC on 6th May 2003, to take effect from January 1st,2005 (published in OJ L 124 of 20.5.2003, p. 36):

• A micro enterprise has a headcount of less than 10, and a turnoveror balance sheet total of not more than €2 million.

• A small enterprise has a headcount of less than 50, and a turnoveror balance sheet total of not more than €10 million.

• A medium-sized enterprise has a headcount of less than 250, anda turnover of not more than €50 million or a balance sheet totalof not more than €43 million; other definitions can be found inTable 8.3.

According to Schatz [12], the main characteristics of these kinds ofcompanies are:

1. SMEs are strongly owner-manager driven. Most of the decisionmakers’ time is spent on doing routine tasks. In many cases, theyare family run.

2. SMEs are driven by the demand for improving productivity, cuttingcosts and ever decreasing life-cycle phases.

3. SMEs do not have extensive processes or structures. They arerun by one individual or a small team, who takes decisions on ashort-term time horizon.

4. SMEs are generally more flexible, and can quickly adapt the waythey do their work around a better solution.

5. SMEs entrepreneurs are generally “all-rounders” with basic knowl-edge in many areas. They are good at multi-tasking.

205


Table 8.3 Definition of SMEs in selected countries (adapted from [11]) *(CBI, 2009);**(ISIPO, 2009).

Country Category of Employee num-bers

Turnover Other measures

enterpriseEuropeanCom-mission

Small 10-50 employ-ees

Less than €10(13.5 USD) millionturnover

Balance sheet total:Less than €10 mil-lion balance sheettotal

Medium Fewer than 250employees

Less than €50(67.6 USD) millionturnover

Balance sheet total:Less than €43 mil-lion balance sheettotal

Iran Small Less than 10*Less than50**

Medium 10-100*50-250**

Malaysia Small Between 5-50employees

Between RM250,000 (75,000USD) and less thanRM 10 (3 USD)million

Medium Between 50-150employees

Between RM 10 (3USD) million andRM 25 (7.5 USD)million

6. SMEs are more people –than process– dependent. There are spe-cific individuals who do certain tasks; their experience and knowl-edge enable them to do so.

7. SMEs are often less sophisticated, since it is hard for them torecruit and retain technology professionals.

8. SMEs are more focused on medium-term survival than long-termprofits.

9. SMEs do not focus on efficiencies. They end up wasting a lot oftime and money on general and administrative expenses.

10. SMEs are time-pressured.

11. SMEs want a solution, not a particular machine or service.

206


12. SMEs focus on gaining instant gratification with technology solu-tions. These solutions must be simple to use and easy to deploy,and must provide clear tangible benefits.

13. SMEs do not necessarily need to have the “latest and greatest”technology. The solution can use “lag technology”, then it be-comes cheaper to obtain and use.

8.2.2 Local Definition (Colombia)

The SME definition in Colombia is founded in the law number 590 of2000. It defines SMEs as every unit of economic exploitation, estab-lished by natural or legal persons, in business, agricultural, industrial,commercial or services activities, that must respond to the followingparameters:

• Medium enterprises

– Number of employees: Between 51 and 200.

– Total assets: Between 5001 and 15,000 SMLV2.

• Small enterprises

– Number of employees: Between 11 and 50.

– Total assets: Between 501 and 5000 SMLV.

• Micro enterprises

– Number of employees: Less than 10.

– Total assets: Less than 500 SMLV.

For those SMEs with parameter differences in the combination ofemployee counts and total assets, the determining factor are the totalassets.

2Legal minimum wage (for its initials in spanish - SMLV: Salario Mínimo LegalVigente); equivalent to US$302.

207


8.3 AGILE METHODOLOGIES IN THE REAL WORLD

This section reports the results of the the 5th annual state of agiledevelopment survey [13]; this report shows the main participants inthe ASDM3 usage, what has been its impact within the organizations,and the tools that are used in ASDM implementations. The surveywas conducted between August 11 and October 30, 2010. The datawere analyzed and prepared into a summary report with a total of 4770responses. The respondents were commonly project managers (almost20%) and other managerial staff; the distribution of the roles of the restof respondents was:

• 9.7% were development managers.

• 9.2% were team leaders.

• 9% were developers.

• 8.9% were other respondents.

More than 90% of the survey respondents reported some kind ofknowledge about ASDM; also, only 30% of them practiced ASDM inprevious jobs. The companies or organizations involved in the studythat have implemented ASDM approve the agility usage in more than a60%; the ASDM and its practices helped them, improve the results intheir software development processes. With this survey we found out alot of people want to get more involved in agile methodologies, and theones who have worked with them want to improve their skills and tryto have an effective software development process, faster and friendly.The results of the survey are listed as follows:

8.3.1 Knowledge About Agile Methodologies

• 42.7% were moderately knowledgeable.

• 25.3% were extremely knowledgeable.

• 23.7% were knowledgeable.

• 8.3% were very little knowledgeable.3Agile Software Development Methodologies.

208


8.3.2 People Who Practiced Agile at a PreviousCompany

• 34% did.

• 66% did not.

The survey helps to find concerns prior to adoption of ASDM; themost common are the following ones:

• 224 people are concerned about loss of management control.

• 210 people are concerned about lack of up-front planning.

• 204 people are concerned about management opposed to change.

• 173 people are concerned about lack of documentation.

8.3.3 Roles in Agile Usage

According to the survey, the closer role is the development director, with21.5%, in contrast to QA group, with 0.9%. Other results are:

• 21.5% think the closer is VP/director of development.

• 17.9% think the closer is project manager.

• 15.4% think the closer is development manager.

• 12.8% think the closer is product manager.

• 7.9% think the closer is team lead.

• 7.4% think the closer is developer.

8.3.4 Reasons for Agile Adoption

Enhancing speed is one of the most important reasons to implementASDM. People respond with “Very important” (41.5%) or “highest im-portant” (37.4%) to the reason “Accelerate Time to Market”, whereas

209


they respond with “Not important at all” (45.8%) or “Somewhat impor-tant” (31.8%) to the reason “Manage distributed teams”. For the reason“Enhance Software Quality ”, respondents thought that is “very impor-tant”, with 49%, or “highest importance” with, 23.6%. On the otherhand, 40.3%, of people answered that “Improved Increased EngineeringDiscipline” is “Somewhat important” for adopting agile methods.

8.3.5 Agile Methodology Used

Scrum and Scrum XP were the most used methodologies, with 75.6%of answers, while the other methodologies had less than 5%. The restof the results were:

• 58% used Scrum.

• 17.6% used ScrumXP Hybrid.

• 5.4% used CustomHybrid.

• 3.7% used Extreme programming (XP).

• 2.1% used Feature-driven development (FDD).

• 3.3% did not know what methodology they used.

8.3.6 What Already Have Been Achieved by Using ASDM

For this topic, 46% of the respondents answered that “Enhance abilityto manage changing priorities” was significantly improved after adoptingAgile; others (38.5%) said that ASDM has improved significantly theproject’s visibility. According to survey’s results, after adopting ASDMthe project’s performance was improved.

8.3.7 Barriers to Further Agile Adoption

The more likely barriers to the ASDM adoption are;

• 1286 respondents (17.1%) said that the ability to change organi-zational culture was a barrier.

210


• 1023 respondents (13.6%) answered that the availability of per-sonnel with the necessary skills was a barrier.

• 1018 respondents (13.5%) responded that the general resistanceto change was a barrier.

• 867 respondents (11.5%) said that managerial support was a bar-rier.

8.3.8 Agile Practices

The most common techniques used were daily stand-up meetings, with82%; iteration planning, with 83%; unit testing, with 77%; whilst, theunlikely techniques were collective code ownership, with 36%; Kanban,with 18%, and behavior- driven development, with 9%.

8.3.9 Plans for Implementing Agile on Future Projects

Most of respondents, about 436 people (63.3%), indicated that they ortheir companies plan to implement Agile methods on future projects;other respondents, about 193 people (28%), said that they or theircompanies do not know if they are planning to implement Agile methods,and only 60 respondents, about 8.7%, answered they will not.

8.3.10 Using Agile Techniques on Outsourced Projects

The survey describes that 1082 respondents (41%) do not outsource.On the other hand, 845 (32.1%) outsource or are planning to do it.Others (12%) answered that they are using Agile methods on outsourcedprojects but are not planning to do it in the future, and there are 10.8%that do not currently use Agile methods and do not plan to do it in thefuture.

8.4 AGILITY ASSESSMENT MODELS

Adoption of agile methods depends on the project, the team, and thecompany. In this way, several models have been proposed to assess the

211


level of agility of discipline in companies. Below we describe some ofthese models.

8.4.1 Boehm and Turner’s Agility and DisciplineAssessment

Boehm et al. [14] argued that according to a set of factors in softwareprojects, agility and discipline should be combined to achieve good re-sults in software development production.“Discipline is the foundationfor any successful endeavor” [14], it can be compared to the athletes’training, or the musicians’ practicing. Without discipline the successwould be occasional. The authors highlight that discipline creates well-organized memories, history, and experience in an organization. How-ever, discipline has its counterpart, “Agility”. Taking the athletes’ exam-ple, agility gives them the ability to make an unexpected play or provideengineers with the ability to embrace changing technology, needs and re-quirements. Agility, according to the authors, uses the memory and thehistory to adjust the new environments, react and adapt to the changes,take advantage of unexpected opportunities, and update the experiencefor the future. It’s clear that people develop software using agility anddiscipline, agile and plan-driven methods; thus, the best way to face thedevelopment is trying to find a balance between the two models.

Plan-driven and Agile Methods

The plan-driven approach is the traditional way to develop software. Itis based on the formal methodology adopted from traditional engineer-ings, such as the civil and mechanical engineering. For example, in theconstruction of a new vehicle in an assembly factory, the design phase isfocused on the specification of the principal characteristics of the newvehicle, such as materials, structure, shape, among others. Then, theprocess goes to the assembly line, which carries out the execution ofthe plan made in the design phase. In this stage there are all the mate-rials, personnel and necessary dependencies to build the vehicle. In thesoftware development case, this approach is represented by a require-ments/design/build paradigm with standard and well defined processesthat organizations often improve continuously.

212


Boehm et al. [14,15] say that the plan-driven approach is a system-atic engineering that carefully adheres to specific processes that move asoftware product through a series of representations from requirementsto finished code. Thus, there is a need for completeness of documen-tation in every step of the process. This is the traditional way to viewthe development cycle as a waterfall from the concept up to the prod-uct release. The plan-driven method requires management support,organizational infrastructure and an environment where the participantsunderstand the importance of common processes to their personal workand the success of the enterprise.

Boehm et al. [14,15] describe agile methods as lightweight processesthat employ short iterative cycles; actively involve users to establish,prioritize, and verify requirements; and rely on tacit knowledge of theteam as opposed to documentation. Agile methods have the followingattributes:

• Iterative (several cycles).

• Incremental (not deliver the entire product at once).

• Self-organizing (teams determine the best way to handle work).

• Emergence (processes, principles, and work structures are recog-nized during the project rather than predetermined).

The rapidly changing nature of software requires a faster speed fromsoftware developers and new techniques to achieve their goals. Theproblem of change is described by long development cycles that yielda code that may be well written but does not meet user expectations.Agile methods deal with this problem. However, they have some require-ments for success, such as close relationships with the customer and finalusers of the system under development; motivated and knowledgeableteam members; and minimum documentation effort.

Agile and Plan-driven Methods Home Grounds

Boehm et al. [14] identified five critical decision factors related to agileand plan-driven home grounds (Table 8.4), and summarized them in afive-axes plane (Figure 8.3) where:

213


• Size axis represents the number of persons working in the project.

• Culture axis represents the balance between chaos and order.

• Dynamism axis is an estimate of how much the team or organi-zation likes to work on the edge of chaos or with more planningand defined practices and procedures.

• Personnel axis represents skills of the team. (The different skillroles are explained in Table 8.5.

• Criticality axis represents the criticality of the project measured asloss of lives resulting from defects that may exist in the process.

Table 8.4 Personnel characteristics [14,15].

Level Characteristics3 Able to revise a method (break its rules) to fit an unprece-

dented situation.2 Able to tailor a method to fit a precedented new situa-

tion. Can manage a small, precedented agile or plan-drivenproject but would need level 3 guidance on complex, un-precedented projects.

1A With training, able to perform discretionary method steps(e.g., sizing tasks for project timescales, composing pat-terns, architecture reengineering). With experience, can be-come level 2. 1A’s perform well in all teams with guidancefrom level 2 people.

1B With training, able to perform procedural method steps(e.g., coding a class method, using a CM tool, performing abuild/installation/test, writing a test document). With ex-perience, can master some level 1A skills. May slow downan agile team but will perform well in a plan-driven team.

-1 May have technical skills, but unable or unwilling to collab-orate or follow shared methods. Not good on an agile orplan-driven team.

These five factors associated with the ASDM and the plan-drivenmethods can be summarized in the radar plot (Figure 8.3), which al-lows us to visualize the agility needed in each organization. Therefore,the level of agility is based on the kind of projects in course and theorganization in charge of them.

214


Table 8.5 Agility - plan-driven method home grounds and levels of software method un-derstanding and use [15].

Characteristics Agile Plan-drivenAPPLICATIONPrimary Goal Rapid value; responding to

changePredictability, stability, high as-surance

Size Smaller teams and projects Larger teams and projectsEnvironment Turbulent; high change; project

focusedStable; low - change;project/organization focused

MANAGEMENTCustomer Rela-tions

Dedicated on-site customers; fo-cused on prioritized increments

As-needed customer interac-tions; focused on contractprovisions

Planning andControl

Internalized plans; qualitativecontrol

Documented plans, quantitativecontrol

Communications Tacit interpersonal knowledge Explicit documented knowledgeTECHNICALRequirements Prioritized informal stories and

test cases; undergoing unfore-seeable change

Formalized project, capability,interface, quality, foreseeableevolution requirements

Development Simple design; short increments;refactoring assumed inexpensive

Extensive design; longer incre-ments; refactoring assumed ex-pensive

Test Executable test cases define re-quirements, testing

Documented test plans and pro-cedures

PERSONNELCustomers Dedicated, collocated CRACK*

performersCRACK* performers, not alwayscollocated

Developers At least 30% full-time CockburnLevel 2 and 3 experts; no Level1B or -1 personnel**

50% Cockburn Level 3s early;10% throughout; 30% Level 1Bsworkable; no Level -1s**

Culture Comfort and empowerment viamany degrees of freedom (thriv-ing on chaos)

Comfort and empowerment viaframework of policies and proce-dures (thriving on order)

* Collaborative, Representative, Authorized, Committed, Knowledgable** See Table 8.2. These numbers will particularly vary with the complexity of the application

8.4.2 Pikkarainen and Huomo’s Agile AssessmentFramework

According to [16], agility means the ability of the companies to respondto change and to balance organizational flexibility and stability. Theassessment framework proposed by Pikkarainen et al. [16] is based onthe wide understanding of different practices, methods and tools thatare proved to increase agility in software development companies. The

215


Figure 8.3 Dimensions affecting method selection [15].

authors state that “The purpose of agile assessment is thus to integratethe software process assessment methods and the knowledge of agilepractices together”. The aim of this framework is to provide answersabout the following questions:

1. How to evaluate the agility of the software product development?

2. How to tailor agile practices, methods and tools to fit the needsof the projects?

3. How to tailor agile practices, methods and tools to fit the needsof the organization?

The first version of the assessment is focused on the requirementsand project management process areas. The main purpose here is toevaluate a selected group of the company software development projectsthrough the next aspects:

216


1. CMMI4 process descriptions, which give a basic structure for anal-ysis.

2. Agile principles, through which the processes, methods and tech-niques are analyzed.

3. Agile practices methods and tools.

4. Boehm and Turner agile dimensions [14,15].

Thus, the authors define the goals of their agile assessment as:

• Analyze the agility of the evaluated project (e.g., using the Boehmand Turner agility dimensions).

• Analyze both, the agile and plan-driven practices methods andtools that are currently used in an organization or in a project.

• Discover and evaluate the most suitable agile practices for thedevelopment needs in the organizations.

• Evaluate the efficiency of the agile practices, methods and toolsin use.

• Support the deployment of the agile practices, methods and toolsin the organizational level.

The agile assessment process used by [16] is depicted in Figure 8.4.Its main steps are focus definition, agility evaluation, data collectionplanning, interviews, analysis, and workshops and learning phases.

Agility evaluation is divided in four phases. It starts with the con-text factor analysis, using the Boehm and Turner’s model. Then, thenext phase is aimed at defining the agile principles, methods, practicesand tools used in the evaluated projects. Phases three and four arealternative depending on the main purpose; the selection depends onwhether we want to find the best agile practices in plan-driven softwaredevelopment or we want to evaluate the agile projects efficiency. The

4(Capability Maturity Model Integration) is a process improvement approach thatprovides organization with the essential elements of effective processes, whit willimprove their performance. http://www.sei.cmu.edu/cmmi/

217


Figure 8.4 An Agile assessment framework [16].

data collection planning is explained in Figure 8.5. We consider this lastphase as the most important one to assess agility in SMEs, because itcould be used to define a data structure capable to assess the agility inthree different dimensions: The company (administration), the projectsand the work teams.

Figure 8.5 Data collection planning [16].

218


After the agile assessment, the agile practices and improvements areprioritized and further analyzed in the company’s internal meetings. Theresults should be validated testing them in pilot projects and comparingthe final outcomes.

8.5 WEAKNESSES AND STRENGTHS OF AGILEMETHODOLOGIES

Petersen et al. [17] present a case study about software developmentwith agile and iterative models. The main contribution of the study isto help managers in the decision of adopting agile methods and showingthe problems that have to be addressed as well as the merits that canbe gained by ASDMs. The issues and advantages they formulate arelisted as follows:

• Small projects allowed to implement and release requirements it-erations faster, which leads to reduction of requirements volatilityin projects.

• The waste of unused work (documented requirements, imple-mented components, etc.) is reduced through small iterations.

• Requirements in iterations are precise, and estimates are accuratedue to the small scope.

• Small teams with people having different roles, only require smallamounts of documentation because it is replaced with direct com-munication facilitating learning and understanding for each oneinside the team.

• Frequent integration and deliveries to subsystem tests allow theteam to receive early and frequent feedback on its work.

• Rework caused by faults is reduced as testing priorities are madeclearer due to prioritized features, and doe to testers as well asdesigners work closely together.

• Testers time is used more efficiently because in small teams test-ing and design can be easily parallelized due to short ways of

219


communication between designers/developers and testers (instanttesting).

• Testing the latest product release makes problems and successestransparent (testing and integration per iteration) and thus gener-ates high incentives for designers/developers to deliver high qual-ity.

We can also find another issues that became a barrier to agile pro-cesses. Boehm et al. [14] stated that, although the ASDMs are a goodsolution for some old problems, they also have some characteristics thatmake the process vulnerable to fail. For example, one of the biggestconflicts using ASDMs lies in the estimation, resource loading and slackcalculations, due to the level of uncertainty and ambiguity that existsin an iterative process in long-term estimates; it is higher with agile ap-proaches. According to [14], we state that the most important barriersfor agile adoption in companies are:

• Cost estimation: One of the main problems in ASDMs is the costestimation, because there are not long-term estimations.

• Business process conflicts: An often overlooked difference betweenagile and traditional engineering processes is the way everydaybusiness is conducted. Estimation, resource loading, and slackcalculations can vary significantly.

• Human resources policies and processes: Agile development teammembers often cross the boundaries of standard development po-sition descriptions and might require significantly more skills andexperience to adequately perform them.

• Process standards assessed: Most agile methods do not supportthe degree of documentation and infrastructure required for lower-level certification; it could, in fact, make agile methods less effec-tive.

• Resistance to change: The paradigm change is aimed at empow-ering individuals by supporting new processes, but many peopleare reluctant to change because of the fear of failure in a project.

220


• Customer empowerment: Agile methods require close relation-ships between the customer and the development team.

8.6 CHALLENGES ADOPTING AGILEMETHODOLOGIES IN SMES

The primary challenge in adopting agile practices in large organizationsis integration of agile projects with the existing processes [18]. AdoptingASDM in a company stems from different kinds of reasons like increasingcostumer satisfaction or improving the teams’ performance or a fastersoftware production. Therefore, before explaining these challenges isnecessary to explain the role of the company in the ASDM implantation,putting into consideration the following topics proposed by [19]:

• The development team evolution inside the methodology:It is common to find some kinds of concerns in the work teamwhen it faces a change. The organization must understand andtake actions as training, communication, and continued supportmanaging the change in the correct way. With this in mind, it isimportant to be aware about the time it takes for the implemen-tation of the new ASDM.

• Business addressing: It is important to keep the top-role peopleinvolved into the agile process making them know that continuityis required for success in the ASDM implementation. The organi-zation has the duty of leading the team looking for an incrementaladvance.

• Ensure project success: To achieve success in the implementa-tion of an ASDM is necessary for both manager and stakeholdersto be involved into the process. Both parts should understand theproject requirements and try to use a common language to coverthem.

• Managing the company expectations: The company can ob-tain return on investment having different partial deliveries to theclient. To achieve this is necessary prioritize the most importantuser stories, giving greater value to each one and minimizing therisk.

221


A group of students of Carleton University, Ottawa, Canada, identi-fied the major changes needed to implement agile software developmentpractices (Figure 8.6) and main challenges that companies using tra-ditional methodologies have to address, regarding from a managementview. Understanding these challenges and having a strategy to overcomethe adoption of agile methodologies can make them easier for organiza-tions. In software projects, people and processes are important. Peoplecan be trained and processes can be built and improved. Now, the useof agile software development methodologies is raising among softwareprofessionals. The engineers’ experience in agile software constructionreveals us changes and challenges implicated in agile software projects.SMEs widely use the conventional way of developing software, wherethe requirements are fixed, with written documentation and a uniquedelivery. However, following an agile software development methodol-ogy could allow software teams to develop quickly and react to changespresented throughout projects.

Figure 8.6 Changes required for adopting Agile methodologies in traditional organizations,and associated challenges/risks [20].

222


8.7 SUMMARY

The aim of this chapter was mainly to show the challenges that smallcompanies could face, in order to adopt agile methodologies. It is nec-essary to understand that the increase in software market is the bestopportunity for these companies to get involved in agile methodologies.Nowadays, the SMEs’ spread on software development market is offeringall kind of products and solutions. For example, Colombia grew 48%in this field between 2000 and 2004 [9]. However, this kind of com-panies cannot use normal methodologies such as UP, RUP or CMMI,because they do not have hierarchical structures clearly defined, theirdevelopment teams are small, and plan-driven methods require a lot ofbureaucracy. The use of agile methods has contributed to find practiceswith better results than the ones achieved with plan-driven methods forall kind of projects. Therefore, it is important to find a way of assessingand assuring the success of being agile.

The agile culture is gaining popularity, becoming one of the mostnamed topics in software development. This can be seen through thenumber of people that are adopting agile methods, and the companiesthat are beginning to use them in their processes. Thus, it is necessaryto consider all the challenges associated with the implementation of newprocesses in a company; for example, the issues related to integratingagile projects with an existing company environment can not be avoided.Additionally, the profile of a company and the project need to be assessedin order to identify the level of agility. According to that level, strategiesfor adopting agile method should be designed to provide lower rates ofnegative impact in the productivity of the team.

223


REFERENCES

[1] Acs, Z. J. & Preston, L. Small and Medium-Sized Enterprises,Technology, and Globalization: Introduction to a Special Issue onSmall and Medium-Sized Enterprises in the Global Economy. SmallBus.Econ., 9: 1997; 1–6.

[2] Schroder, H. H. & Kraaijenbrink, J. IN: KnowledgeIntegration-The Practice of Knowledge Management in Small andMedium Enterprises. Physica-Verlag HD: 2006.

[3] Consulting, M. Manpower, Education and Training Study ofthe Irish Software Sector. Report submitted to the Software Train-ing Advisory Committee and FAS: 1998.

[4] Richardson, I. & Avram, G. Having a Foot on Each Shore,Bridging Global Software Development in the Case of SMEs. 2008IEEE International Conference on Global Software Engineering:2008.

[5] Ireland, I. Key Industry Statistics: 2005.

[6] Dossani, R. Origins and Growth of the Software Industry in India.Asia-Pacific Research Center - Stanford University.

[7] Proexport. Sector de software y servicios TI.http://www.slideshare.net/inviertaencolombia/sector-servicios-de-tiproexport?src=related_normal&rel=1187105: 2010. Accessedand verified on May 3, 2011.

[8] Ministerio de Comercio, I. y. t. Programa MIDAS - De-sarrollando el Sector de TI Como uno de Clase Mundial. Informetécnico: 2008.

[9] Rodríguez, A. G. La relidad de la Pyme colombiana, Desafiopara el desarrollo. FUNDES Internacional: 2003.

[10] Sánchez, J. J. & Osorio, J. Algunas Aproximaciones Al Pro-blema De Financiamiento De Las Pymes En Colombia. Scientia etTechnica Ano XIII: 2007; 321–324.

224


[11] Ebrahim, N. A., Ahmed, S. & Taha, Z. Virtual R & D teamsin small and medium enterprises: A literature review. Sci. Res.Essay: 2009; 1575–1590.

[12] Schatz, C. A Methodology for Production Development. TesisDoctoral, Norwegian University of Science and Technology: 2006.

[13] VersionOne. 5th Annual State of Agile Development SurveyFinal summary report: 2010.

[14] Boehm, B. & Turner, R. Balancing Agility and Discipline - AGuide for the Perplexed. Addison-Wesley: 2004.

[15] Boehm, B. & Turner, R. Observations on Balancing Disciplineand Agility : 2004.

[16] Pikkarainen, M. & Huomo, T. Agile Assessment Framework.Information Technology for European Advancement: 2005; 1–44.

[17] Petersen, K. & Wohlin, C. A Comparison of Issues andAdvantages in Agile and Incremental Development between Stateof the Art and an Industrial Case. Journal of Systems and Software:2009; 1–14.

[18] Lindvall, M., Muthig, D., Dagnino, A., Wallin, C.,Stupperich, M., Kiefer, D., May, J. & Kahkonen, T. Ag-ile software development in large organizations. Computer, 37(12):2004; 26 – 34. ISSN 0018-9162. doi:10.1109/MC.2004.231.

[19] Mahanti, A. Challenges in Enterprise Adoption of Agile Meth-ods. Journal of Computing and Information Technology, 3: 2006;197 206.

[20] Misra, S. C., Kumar, U., Kumar, V. & Grant, G. TheOrganizational Changes Required and the Challenges Involved inAdopting Agile Methodologies in Traditional Software Develop-ment Organizations. IEEE Computer Society: 2006.

225


Model-drivenDevelopment

and Model-driven Testing

Henry Roberto Umaña-AcostaMiguel Cubides

ABSTRACTSoftware modeling is useful in the specification and comprehension ofsoftware requirements. UML specification, for example, helps developersto understand the intended solution with static and dynamic designs. Inthis chapter we explain several software modeling approaches in softwaredevelopment and testing.

9.1 INTRODUCTION

ColSWE, the research group in Software Engineering from UniversidadNacional de Colombia, has two main research areas in software modeling:Development and test modeling. This has been motivated by the needof supporting the comprehension of the software.

In the software development area, we explain the modeling advan-tages and disadvantages, and in the software testing area, we focus inwriting and generating tests automatically from models. The last tech-nique belongs to one of the wider areas of Model Based Testing. Whendevelopers build models, they spend many time and effort, but the pro-ductivity of development is increased. This area is promising as in theacademic research as in the industry.

227


9.1.1 Challenges of Software Development

Software development has, amongst others, two identified difficulties:The maintenance and evolution and the comprehension and communi-cation of the different artifacts that compose it.

By the comprehension of the intrinsic evolvable nature of software, itcan be demonstrated that along time it has to adapt to its environment,including solutions for new requirements and/or previous requirementsmodifications. This requires modifications in different aspects that haveto be studied and implemented: Studied at the level of impact thata modification would have as directly as indirectly in the system andimplemented by means of making all the modifications that need to beexecuted as for the impact analysis.

On the other hand, for achieving the development of optimal qualitysoftware it is required that the development team totally comprehendseach one of the specifications of the different artifacts that describethe previous stages to that point. For example, in the developmentphase, the specifications in the requirement, specification, requirementand design phases should be perfectly understood. This produces certaindifficulties, for example the fact that in the different phases there arealso different types of people that work with distinct levels of expertiseand experience, and even with various intellectual and cultural focuses.

9.1.2 Traditional Testing Process

In the life cycle of a software project, sooner or later the team needsto deal with tests. In the formal way, someone, maybe the analyst orthe tester, designs the Test Cases based on the Use Cases or on thefunctional requirements. Then, the tester executes, step by step, eachTest Case and compares the result with the expected output and definesif the Test Case was successful or not.

9.1.3 A solution to Face the Problem

Popular knowledge wisely indicates that “a picture is worth a thousandwords”; and that explains why, for the comprehension, development andcommunication of the different software artifacts, a specification that

228

MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

allows guiding them graphically has been developed using different toolsspecialized in modeling to achieve this objective. We have, then, dia-grams that represent the software and also what it should do, that sup-port documentation and facilitate communication between parts withdifferent focuses and knowledge. Also, this diagrams offer software per-spective from two different focuses: a static vision and a dynamic one.

9.1.4 Model-based Testing

On the other hand, the Model-based Testing (MBT) starts definingthe SUT, System Under Test, or the parts of the system that needto be tested. Then, the team models this SUT with one of severalapproaches, and generates the Test Cases from this model. Those TestCases generated are not executable. They need to be transformed intoa script test in some language, in order to be executed. Finally, thetester analyzes the result of the test, which can be: Fail or pass, andreport to the development team.

This chapter represents an introductory description for the good useof Model-driven development (MDD) during software development, ini-tially treating the reason for using MDD in agile methodologies (model-ing a specification does not slow software development), continuing witha model-designed development description where its philosophy shall beexplained. Following this, UML unified language specification is de-scribed, which allows to talk about model-oriented architecture (MDA),its capacities and potential, the use of MDA in the software’s designand analysis phases and reviewing of the development eases that someCASE tools offer. Finally, some common errors, in which people fallwhen it is supposed that MDA will act by itself improving the quality ofthe developed software, will be described, and some conclusions of thework will be specified.

In this chapter we will revise two works in the area of Model BasedTesting, one of the research areas of ColSWE.

These works were leaded by the structure and explanation given byMark Utting and Bruno Legeard [1]. We will cover two techniques ofMBT: One based in Finite State Machines (FSM) and the other onebased in Pre/Post notations, specifically in OCL. In both cases, we use

229


and evaluate the tools ModelJUnit for FSM and Qtronic (QML) fromConformiq for Pre/Post notation.

9.2 WHY TO USE MODEL-DRIVEN ARCHITECTUREIN AGILE METHODOLOGIES

According to Jorbi Cabot [2], there are two points to consider: Can agilemethodologies benefit from modeling? And, can modeling benefit fromagile methodologies?

For embracing the first question, it should be considered that agilemethodologi es, in spite of an emergent design, maintain modeling use,supporting them in different specifications related to the developmentteam’s capacity and experience. There is the case, for example, of theagile modeling proposed by Scott Ambler [2, 3], where he suggests theuse of model rain and iterated necessary modeling, keeping up with alight, but representative model of the system. This suggested modelgoes from abstract representations of the system to representations ofthe tests to be done in the product.

The second question is embraced from a point of view based onexperience and offered as investigation theme, reaching the proposal ofestablish a methodology or specification that allows modeling proceduresa consequence of the application of characteristics that are common inagile methodologies in the modeling process.

Based on this it can be observed that there is, firstly, the possibil-ity to develop a light model to support development following an agilemethodology, and secondly, the possibility to initiate an investigativeroad that allows to implement new characteristics in the actual exist-ing modeling methodologies which are oriented to classic developmentmethodologies.

9.3 MODEL-DRIVEN DEVELOPMENT

9.3.1 Philosophy

A model is a generalized representation of the system. This is achievedby the use of specific and specialized diagrams in each of the differentparts of the product.

230


The use of MDD is suggested for three reasons:

Documentation and Comprehension of the System

Due to the simplicity and globalization of a model, it is easier to get thegeneral idea that is represented by a diagram than the one representedby a text. That is way documentation is a great support (even though adiagram does not replace a textual documentation) helping the reviewersto comprehend the system’s characteristics in a better way.

Internal Communication and Client Communication

Once the documentation is complete, it acts as a support for the com-munication with the client, helping to acquire the initial requirementsand the representation of what, as a development team, has been un-derstood regarding the application should do.

On the other hand, communication between different internal divi-sions in the software team gets easier too, as a diagram is designed in auniversal language and no matter the specialty, the team members willunderstand it easily.

Automatic Code Generation and Evolution RelatedFacilities

Due to the utilization of computing tools that help software engineering(CASE), code can be generated automatically from a model, whichreduces the dedication time of the different resources that intervene inthe application’s development phase.

With the use of these tools, a reverse engineering can also be ac-complished from the code of an existing application, which generatesdiagrams that model the system and allows the generation of its owndocumentation.

Another characteristic that facilitates to optimize the model-orienteddevelopment is the localization of the affected sectors in the system bymeans of the change applied. This is seen in the evolution process, wherethe modification of a requirement or the input of a new one makes achange in the product in such way that it is necessary to dimension

231


this change by making an impact analysis (despite that low couplingdevelopment is suggested, there are some cases in which the impact isvery high).

There is also an advantage given by the union of impact analysisand self-genera ting code, from which codification due to evolution au-tomation can be reached.

9.3.2 Unified Modeling Language

The OMG1 group worked on the specification of a language that permit-ted creating a model of any system that could be understood universally;for this goal, Unified Modeling Language (UML) was created, which inits 2.0 version specifies more than 10 diagrams that divide between dy-namic and static [3].

On this specification, case of use, class, package, sequence, commu-nication, state, activity, components, deployment and object diagramsare found (amongst others). With this group of diagrams a systemcan be represented from a general perspective to a more granular levelrequired.

Through the case of use diagrams the specification of system re-quirements can be represented. With the classes, package, state, com-ponents, and deployment diagrams the system is represented in a staticway, while in sequence, communication, activity and objects diagramsthe system is dynamically represented. To be strict, in this context,dynamic is taken as the way in which the system behaves given certaincharacteristics; that is so because, for example, the object diagram is arepresentation of the system in a specific state.

One of the great advantages that UML represents is its capacity tobe extended through profiles, like Lidia Fuentes exposes it in her investi-gation [4], by which the generation of a particularized specification anda custom necessity specific development or the company’s developmentpolitics can be achieved.

1Acronym for Object Management Group, more informationhttp://www.omg.org/

232


9.4 MODEL-DRIVEN ARCHITECTURE

Due to the fact that UML has been developed to standardize systemsmodeling through diagrams, a methodology from MDD that implementsUML was born naturally, and it is known as model-based architecture(MDA), developed by the OMG group. The difference between bothmethodologies is that MDD is a generalized methodology and MDA isthe one that uses UML.

Regarding tools for automatic coding generation from models, themore evolved of these two methodologies is the one that uses UMLstandard, i.e., MDA to be clear, just as Nikiforova proves it in [5].

MDA proposes the creation of a meta model and a model [6–10].The meta model is a model that is language-independent and gives aninitial holistic vision of the system, which brings up a first approximationfor communication with the stakeholders; while the model is a directedspecification to a particular programming language which permits toautomate coding generation tasks, using UML profiles for a greateraccuracy and a higher independence of the activities.

9.4.1 Software Engineering

Software creation implies an engineering process that, depending on theimplemented methodology, will embrace different phases.

In agile methodologies, despite being managed in a iterative way,planning or analysis, architecture and design, development, testing andrevision and implementation phases are contemplated. It is importantto point out the fact that, depending on a specific methodology, thesephases have greater or lower emphasis, for example XP contemplates anemergent design [11].

As it has been explained, software development will be strengthenedand facilitated with the implementation of MDD. Specially by developinga focused complete engineering in the design and analysis phase, it willbe possible, in advance, to comprehend in a better way the product tobe created.

233


The Analysis and Design Phase

Both of these phases are focused to the comprehension of the applicationto be developed, creating different artifacts to seize the representationof the different characteristics of the system, what the product has todo and its limits, how it should be developed and which developmentcharacteristics will be done.

It is in the design phase where different purpose models are createdto analyze how the product will be created, what modules it will containand in which manner will the modules communicate between them. It isthat in this phase where representative system models implementationwill be strongly used and where UML standard is suggested for obtaininggreater facilities in implementation and development (for example, codeautomatic generation).

MDA for Software Engineering

Since UML standard provides different diagrams that can represent asystem from different focuses [3, 12], it can be used during design toobtain a more complete abstraction easily, which will make comprehend,in earlier production phases, what the system has to do and how to doit.

A system will be more complete as more perspectives it contem-plates. For example, a system should be analyzed from the user’s pointof view (what it should do) as from the performance’s point of view(how it should do it). For this purpose, case of use and state diagramscan be used to see what it must do and as object, activity and sequencediagrams to see how it must do it. The first ones allows seeing thesystem from a static point of view, while the second ones represents adynamic view.

9.5 WHAT THE MODEL-DRIVEN ARCHITECTUREDOES NOT DO

A typical error made in software development companies that intro-duce themselves into model-oriented design, as Bell explains it [13], isto believe that creating an initial system’s model will optimize their

234


development practices by itself. It should be considered and clearly un-derstood that model-driven design is just a tool that, well used, willpermit obtaining great advantages in software development.

9.6 NOTATIONS FOR MODELING TESTS

In order to build the model, Utting and Legeard [1] give some recom-mendations:

• Choose only the classes related with the SUT

• Include only the methods to be tested

• Include only the class’s attributes needed to reflect the behaviorof the methods

With this scope you may choose some notations for your model. Wewill explain some of them studied in our research.

9.6.1 Transitions-based Modeling

We model the behavior of the system as transitions between severalstates due to events. Usually the model is represented by a Finite StateMachine Figure 9.1.

9.6.2 Pre/Post Modeling

Another kind of system representation is through a series of variableswith their respective values in a specific point of time. We model that inthe specification of the use cases with preconditions and post-conditions.We also can be more formal and write this specification in Z language,Spec# or OCL (Object Constraint Language). Preconditions specifythe conditions that must be true before the operations be executed. Asan example [14], let see these instructions in OCL:

Context Player::calculateFinalScore():: Integer

Pre: self.isComplete = true

The precondition related to the operation “calculateFinalScore()”states that the player has completed the game. Post-conditions

235


Figure 9.1 Representing states and state transitions using a state diagram.

specify the conditions that must be true after the operations have beenexecuted. From the same authors, another specification in OCL,

Context GameEvent::processPlayerChoices():: Integer

Post: result = 0.

According to this example, the post-condition of the operation “pro-cessPlayerCho ices” states that the player has no choices to play.

9.7 TESTING FROM FINITE STATE MACHINES

The next sections show the work developed by Miguel Cubides, re-searcher of ColSWE, and presented as a requisite to get the under-graduate level at the Systems Engineering career in 2009 [15].

9.7.1 FSM and ModelJUnit

ModelJUnit [3] is a plug-in that extends the JUnit functionality. Itsimplementation is based in a Finite State Machine as a model. Besides,ModelJUnit gives us several figures about testing process as is presentedby Utting [16].

236


9.7.2 Case Study

We develop a small prototype in order to show the functionality ofModelJUnit. The case chosen is a very basic ATM with two operations:Credit, debit and few restrictions. We start modeling the system with afinite state machine Figure 9.2.

Figure 9.2 State diagram for the case study [15].

Next, we code this model using the classes offered by ModelJUnit.

public class PruebasModel implements FsmModel {

public enum Estados {VERIFICAR_DATOS, MOVIMIENTO, CANCELAR,

TERMINAR};

private Estados estado;

private ControlCajeroTest test;

public PruebasModel() {

test = new ControlCajeroTest();

estado = Estados.VERIFICAR_DATOS;

}

public String getState() {

return String.valueOf(estado);

}

237


public void reset(boolean testing) {

test.reset(); estado = Estados.VERIFICAR_DATOS;

}

public boolean verificarDatosCorrectosGuard() {

return estado == Estados.VERIFICAR_DATOS;

@Action

public void verificarDatosCorrectos() throws Exception {

test.testVerificarDatosCorrectos();

estado = Estados.MOVIMIENTO;

}

public boolean verificarDatosIncorrectosGuard() {

return estado == Estados.VERIFICAR_DATOS;

@Action

public void verificarDatosIncorrectos() throws Exception {

test.testVerificarDatosIncorrectos();

estado = Estados.VERIFICAR_DATOS;

}

\begin_inset Newline newline

public boolean debitarGuard() {

return estado == Estados.MOVIMIENTO;

@Action

public void debitar() throws Exception {

test.testDebitar();


}

public boolean cancelarGuard() {


@Action

public void cancelar() throws Exception {

test.testCancelar();


}

public boolean salirGuard() {


@Action

public void sali() throws Exception {

test.testCancelar();


}

}

238


ModelJUnit has a graphical tool that shows us the FSM embeddedin the code. Figure 9.3 depicts the FSM of the code for the case study.

Figure 9.3 State diagram for case study [15].

The tool also offers some features of the tests:

• Number of tests: 10

– State coverage = 2/3

– Transition coverage = 4/6

– Transition pair coverage = 6/16

– Action coverage = 4/6

The main advantage using the tool is getting the possible sequencesof events in the system through the implementation of the Finite StateMachine. In other words, we got the Tests Cases running the code.

9.8 TESTING FROM PRE/POST MODELS

The next sections show the work developed by Luis Alberto Bonilla[17], researcher of ColSWE, and presented as a requisite to get theundergraduate level at the Systems Engineering career in 2010.

9.8.1 Object Constraint Language (OCL)

The OCL notation is used to do more precise the modeling throughUML diagrams. For example, OCL allows us to specify preconditionsand post-conditions of the operations in a class. Each expression in OCLhas a context that usually is the class or method which it belongings to.In Table 9.1 there are some important constructors of OCL.

239


Table 9.1 Main OCL constructs [1].

Constructor OCLcontext class inv: predicatecontext class def: name: type = exprcontext class::attribute init: exprcontext class::method pre: predicatecontext class::method post: predicatecontext class::method body: expr

9.8.2 Case Study

The system to be modeled is the triangle classificator from Myers’s book(2004), [18],

The program reads three integer values from an input dialog. The threevalues represent the lengths of the sides of a triangle. The programdisplays a message that states whether the triangle is scalene, isosceles,or equilateral.

The context is the Triangle with its three values (length of sides)and a text message with the evaluation as an output.

Context Triangle :: kindOfTriangle(a:int, b:int, c:int) : String

The precondition is referring to the fact that the integers should bepositive

pre: a>0 and b>0 and c>0

The post-conditions validate that the three integers built a trianglewhere a+ b > c for all the possible combinations. Also they contain therules to classify the triangle as equilateral, isosceles, or scalene:

post: if ( a + b <= c or a + c <= b or b + c <= a) then

result = ‘‘notriangle’’ else

if (a=b or b=c or a=c)

if(a=b and b=c) then

result = ‘‘equilateral’’ else

result = ‘‘isosceles’’

endif

else

result = ‘‘scalene’’

endif

endif

240


In this assessment of the OCL utilization as a modeler for generatingtests cases, Bonilla uses QML from Qtronic [19], a OCL- like that allowsto generate the test cases from the specification, as you can see in Table9.2.

Table 9.2 Test cases generated. Bonilla [17].

Test Port/Field Value1 in/(−1, 0, 0) out/badside2 in/(0, 0, 0) out/badside3 in/(1, −1, 0) out/badside4 in/(2, 0, 0) out/badside5 in/(1, 1, −1) out/badside6 in/( 1, 2, 0) out/badside7 in/(1, 1, 2) out/notriangle8 in/(1, 1, 9) out/notriangle9 in/(1, 2, 1) out/notriangle10 in/( 1, 9, 1) out/notriangle11 in/( 2, 1, 1) out/notriangle12 in/( 9, 1, 1) out/notriangle13 in/( 1, 1, 1) out/equilateral14 in/( 5, 5, 9) out/isosceles15 in/( 9, 1, 9) out/isosceles16 in/( 9, 9, 1) out/isosceles17 in/( 1, 9, 9) out/isosceles18 in/( 3, 9, 7) out/scalene19 in/( 9, 3, 7) out/scalene

The main advantage of this technique is the coverage. In fact, thealgorithm behind the tools is based on Decision Coverage, also knownas branch coverage.


Right now, we are conducting our research in another area related toMBT: How can we derive test cases automatically from models thatrepresent GUIs? In middle term, we think to cover other challenge: Tomake test executable or concrete the abstract tests generated from themodel.

On the other hand, in the software modeling area, our research in-cludes two sub-areas: MDD for mobile applications and the optimization

241


of data transference applying MDA, focused in the data access objectin a multilayer architecture.

9.10 SUMMARY

Model based software development has three main advantages.

1. It is an excellent support for documenting different artifacts onthe development process.

2. It allows people involved in development to easily comprehendwhat software is and what it is designed for.

3. It makes development more agile by involving it with self-generatingcode tools, easing the process in evolution tasks.

The implementation of model-oriented software development method-ology enables the analysis of a project from both, dynamic and staticstandpoints.

One of the most advanced methodological lines in MDD is the oneoffered by the OMG group: MDA, that, by implementing UML, usesMDD with a standardization that allows comprehending and evaluatingresults easily between different parts.

242


REFERENCES

[1] Utting, M. & Legeard, B. Practical model-based testing: atools approach. Morgan Kaufmann: 2007. ISBN 9780123725011.

[2] Cabot, J. Agile and Modeling / MDE: friendsor foes? | MOdeling LAnguages. http://modeling-languages.com/blog/content/agile-and-modeling-mde-friends-or-foes.

[3] Ambler, S. The Elements of UML 2.0 Style. Cambridge Univer-sity Press: 2005.

[4] Fuentes, L. & Vallecillo, A. Una Introduccion a los PerfilesUML. Novática, 168: 2004; 6–11.

[5] Nikiforova, O., Cernickins, A. & Pavlova, N. Discussingthe Difference between Model Driven Architecture and ModelDriven Development in the Context of Supporting Tools. Fourth In-ternational Conference on Software Engineering Advances, 1: 2009;446–451.

[6] Belaunde, M., Burt, C. & Casanave, C. MDA Guide Ver-sion 1.0.1. OMG: 2003.

[7] Caramazana, A. Tecnologias MDA para el desarrollo de soft-ware. In: I Jornada Academica de Investigacion en Ingenieria In-formatica: 2004.

[8] Franky, M. C. MDA: Arquitectura Dirigida por Modelos: 2010.

[9] Schmidt, D. C. Model-Driven Engineering. IEEE Computer, 39:2006; 25–31.

[10] Schmidt, D. C. Guest Editor’s Introduction to Model-DrivenEngineering. Computer, 39: 2006; 25–31. ISSN 0018-9162. doi:http://doi.ieeecomputersociety.org/10.1109/MC.2006.58.

[11] XP Design and Documentation | xProgramming.com.http://xprogramming.com/articles/ferlazzo/.

243


[12] Favre, L. Formalizing MDA-Based Reverse Engineering Pro-cesses. In: Australian Software Engineering Conference: 2008,153 –160. doi:10.1109/SERA.2008.21.

[13] Bell, A. E. Death by UML fever. DSPs, 2: 2004; 11–23.

[14] Ericksson H., L. B. F. D., Penker M. UML 2 Toolkit. JowWikert: 2004.

[15] Umana, H. & Cubides, M. Pruebas basadas en maquinas deestado Finitas (FSM). Revista Tendencias en Ingenieria de Softwaree Inteligencia Artificial, 4: 2009.

[16] The ModelJUnit test generation tool.http://www.cs.waikato.ac.nz/~marku/mbt/modeljunit/.

[17] Bonilla, L. Como escribir modelos Pre/Post adecuados parala automatizacion de pruebas. Paper presentado en el proceso detrabajo de grado en Ingenieria de Sistemas: 2010.

[18] Myers, G. J. The Art of Software Testing. John Wiley & Sons,1 edición: 1979. ISBN 0471043281.

[19] Automated Test Design | Model-Based Testing (Conformiq).http://www.conformiq.com/.

244


Subject Index

AAbstract Syntax Tree (AST); 17,

35, 133-135, 137Abstract System Dependence

Graph (ASDG); 90, 94,95, 96

Agileassessment framework; 215,

216, 217, 218, 219manifesto; 173, 181, 182methodologies;

evolution; 164history; 164, 178in traditional organizations,

and associated chal-lenges/risks; 189, 220

overview; 183methods required for adopt-

ing; 190-194, 210-214,217, 220-223

practices; 211, 216, 221principles; 173, 175, 192, 194,

217principles vs. software fea-

tures; 83roles; 209

UP, Basic/Open; 188Agility; 212-218, 223

and the Agile Manifesto; 173,177, 179

and software evolution; 161,164

assessment models; 204, 211evolutionary software pro-

cesses; 177, 181, 190, 201Architecture; 37, 44, 68, 74, 84,

86, 98, 107, 111, 133,140, 151, 153, 163, 167-172, 189, 190, 192, 214,229, 230, 234, 242

Description Language (ADL);73, 74

Driven Modernization (ADM);151

Aspect Oriented Programming(AOP); 72, 73

Automatic categorization; 106,113, 117, 118

BBusiness rules; 17, 18, 130, 136,

141, 147, 152, 162, 163

246

SUBJECT INDEX

Extraction; 134, 135, 145, 153knowledge through abstrac-

tion levels; 88

CCentralized version control systems

(CVCSs); 44, 45Changes; 7, 15-19, 22, 32, 35, 39,

44, 58, 61, 82Implementation;Managing; 29, 31

Clustering; 35, 43, 115, 118, 138-140

CodeCity metaphor; 112Colombia, software development

in; 203, 204earnings generated; 204

Communications; 42, 79, 215logs; 32, 33,modes and effectiveness

Concept location; vii, 19, 26, 83-85, 88-92, 97, 98, 111,114, 120

Control-flow diagrams; 66, 134Control structure diagrams; 65, 66,

134Corpus creation process; 11, 12Crystal Family; 180, 181

DData mining algorithm; 40Distributed Version Control Sys-

tems (DVCSs); 44, 45, 49Domain Specific Languages

(DSL); 73

Documentation; 1, 7, 15-23, 29-34, 47, 87, 90, 93, 106,113, 117, 118, 129, 132,141, 149, 150, 156, 172,174, 209, 213, 219, 220,222, 241, 231

Summarizing; 7

Entropy; 190Binary; 11

Evolution; 17, 57, 61, 83, 105,110, 113, 127, 161, 190,182

Evolutionary Development Model(EVO); 165, 167, 178,182

Extreme programming (XP); 173,179, 181, 183, 210, 233

FFDD lifecycle; 185Feature location; 87Financial Management System

(SGF); 141, 143, 144, 146Formal Concept Analysis (FCA);

13, 94-96, 114Fractal representation; 68

HHistorical repositories; 18, 32Hub organizational structure; 193

IImpact analysis; 18, 19, 67, 82, 83,

85, 86, 88, 96-99, 134,149, 228, 232

247


Incremental change; 81-89, 98,105, 106, 110, 120,161,183

Activities; 84Incremental development; 164-

166, 184Information retrieval (IR); 2, 3, 4,

6, 18, 23, 35, 90, 94, 105-108, 113, 114, 115, 120

and software evolution activi-ties; 113

models; 93techniques; 90, 93-96, 105

Internet Relay Chat (IRC); 29, 31,33

IR-based techniques; 93-96IR system, general architecture;

107Iterative and incremental develop-

ment; 164-166, 184

JJackson diagrams; 65, 66

KKanban board; 187, 188

LLanguage(s); 2-10, 17, 18, 21-23,

36, 59, 62, 71-74, 85, 88-91, 112, 114, 135, 139,142, 180, 194, 221, 229,231-235, 239

C++; 72Groovy; 73, 74JavaScript; 73Objective-C; 73

PHP; 73Python; 73Java; 72

Latent; 11, 90, 92, 107-09Dirichlet Allocation (LDA);

108, 109, 117-119Semantic Indexing (LSI); 11,

13, 90-96, 108, 109, 114-119

Law of program evolution; 30

MMaintenance; 17, 83, 130MECCA approach; 152Mining Software Repositories

(MSR); 29-36, 48, 59,110, 112-14

Modeling language; 62,180, 232Model(s); viii, 1, 6, 10-13, 16, 33,

39-42, 45, 59, 69, 73, 74,81, 82, 90, 91, 105-111,114-120, 132, 135, 151,161, 162, 165, 168, 169,171, 172, 174, 175, 178,179, 182, 184, 185, 201,217, 230-237, 241, 242

architecture in agile method-ologies;

based testing; ix, 227, 229model-driven; 184, 227, 230,

233architecture; 230, 234development; 227, 230testing; 227

notations for modeling tests;235

248

SUBJECT INDEX

probabilistic; 108, 116, 119staged; 171testing from finite state ma-

chines; 236testing from pre/post models;

239trends and challenges; 2, 21,

29, 44, 151, 193, 241Web; 108, 109

Modularization; 140process based on cluster-

ing/searching algorithms;140

Quality (MQ); 140Mozilla Firefox; 68Multi-Element Component Com-

parison and Analysis(MECCA); 150-152

NNatural language; 2-7, 21, 22, 85,

88, 89, 163analysis; 3processing (NLP); 2, 18, 23concepts and techniques; 2summarization; 2, 4, 7

OObject-oriented; 38, 65, 77, 97,

138, 143software (OOS); 143

OCL, main constructs; 229, 235,236, 239, 240, 241

OpenUP layers; 190Operating environment; viiOracle; 129

forms; 129, 141, 142, 1446i technology; 142business application; 141reverse engineering tool;

144, 150, 152PL/SQL (Procedural Lan-

guage/Structured QueryLanguage); 139

PPixel-maps; 63, 66, 67, 69, 75

Process(es);Enterprise Unified Process

(EUP); 170, 182of mining software reposito-

ries; 34, 112, 114Rational Objectory Process;

168, 182Rational Unified Process

(RUP); 19, 168, 169, 170,172, 174, 182, 188, 189,223

summarization; 2, 10, 11, 15traditional testing; 228

Production environment; 170, 171,172

Program(s); 30, 116Instructions; 8understanding; 19, 31, 37

QQuality of Software Systems; 31,

38

249


RRefactoring; 83, 86, 87, 110, 111,

113, 120, 183, 189, 215Reverse engineering (RE); 19, 20,

110, 112, 127-133, 137,140, 142-144, 147-153

assessment; 147,concepts and relationships;

129, 133considerations for applying;

142in procedural software evolu-

tion; 127process; 131, 132techniques; 133trends and challenges; 21, 154

RUP (Rational Unified Process)model; 19, 168, 169, 170,172, 174, 182, 188, 189,223

SScenario-based Probabilistic Rank-

ing (SPR); 92-96, 114SCRUM; 173, 179, 182, 184, 188,

192, 210SeeSoft tool; 67Semantics of Business Vocabu-

lary and Business Rules(SBVR); 136

Simple staged model; 175Software;

artifacts; 1, 6, 16, 106, 113,118, 120, 134, 163, 228

bug reports; 1, 7, 8, 16-18,21, 33, 35, 38, 39, 41, 47

comprehension; 4, 19, 87,111, 129

requirements specification;1, 6, 88

technical designs; 1, 21test cases; 20, 91, 119, 215,

228, 229, 241trends and challenges; 21use cases documents; 1

code; 8, 34, 117communications logs; 32, 33comprehension; 19, 87, 111,

129configuration management

system (SCM); 40costs; 7, 17, 40development; 82, 177, 201,

204, 228in Colombia; 204

engineering; 233evolution; 17, 83, 105, 110,

113, 127, 161, 190historical repositories; 32life cycle; 1, 17, 31, 49, 82,

84, 171, 172maintenance; 1, 2, 14, 20, 37,

75, 81-83, 130, 132, 148,151

production process; 41properties; 39reflection model; 12repositories; 29, 31, 32, 34,

36, 112, 114summarization; 3, 5, 118trends and challenges; 24,

154, 193, 241

250

SUBJECT INDEX

types; vii, 16, 17, 21, 38understanding; 127, 129visualization; 53, 57-59, 62,

64, 72, 112Word Usage Model (SWUM);

10, 11Software Development Agility in

Small and Medium Enter-prises (SMEs); 201, 205,206, 221, 245

assessment models; 204, 211definition; 207weaknesses and strengths;

219, 228Source code; 8, 34, 89, 94, 117Spiral; 110, 168, 169

model; 168, 169State diagram; 236, 237, 239Structograms; 65, 66

conditional; 65loop; 66sequence; 66

Summary Generation process; 11,15

TTheory of complex networks; 42Tools; 35, 36, 48, 59, 148

assessment; 148visualization; viii, 58, 59, 66,

72, 74, 75Traceability recovery; 111, 113,

119

UUnified Modeling language (UML);

14, 17, 18, 62, 71, 119,128, 180, 184, 227, 229,232

sequence diagram; 14Unified Process (UP); 19, 168-171,

173, 174, 188, 223elements; 173family; 168model; 171, 174

VVector Space Model (VSM); 6, 11,

12, 90, 91, 94-96, 108,109, 115-119

Virtual environments; 72, 75Visualization; 57-59, 62, 112

pipeline; 59-61process; 72,steps; 60

WWeb mining; 92, 139

WXML (eXtensible Markup Lan-

guage); 47

251

Name Index

AAcs, Z. J.; 202Ambler, S.; 170, 188, 230Amor, J.; 40Anvik, J.; 40Aponte, J.; 1, 29, 81, 127, 161,

201Asif, N.; 131,

BBachmann, A.; 47Basili, V.; 178Baxter, I.; 136Bayer, S.; 181Beck, K.; 181Bennett, K.; 82, 84Bernstein, A.; 47Boehm, B.; 162, 168, 178, 212,

213, 217, 220Bonilla, L. A.; 215, 241Booch, G.; 62Brooks, F.; 162, 178,

CCabot, J.; 230Canfora, G.; 37

Chikofsky, E. J.; 130Coad, P.; 180Cockburn, A.; 180, 191, 215Crowston, K.; 42Cubides, M.; 227, 236Cunningham, W.; 181

Dde Luca, J. 180

GGilb, T.; 162, 165, 178Gonzalez, J. M; 47Gousios, G.; 40Graçanin, D.; 63Guo, P.; 39

HHassan, A. E.; 37, 39, 47, 48,Herraiz, I.; 47Highsmith, J.; 181, 192Huang, S. K., 42

252

NAME INDEX

JJacobson, I.; 62Jeffries, R.; 181

KKanban; 182, 185, 187, 188, 211Khan, M.; 131Knab, P.; 38

LLantz, K.; 178Larman, C.; 178Legeard, B.; 211, 235Lehman, M.; 30, 82, 164Lopez-Fernandez, L.; 42

MMartin, J.; 179Mens, T.; 58Montaño, D.; 57Moreno, L.; 1Morisaki, S.; 38Myers, G. J.; 240

NNikiforova, O.; 233Niño, L. F.; ixNiño, Y.; 29, 41

OOhira, M.; 42Orso, A.; 97

PPanjer, L.; 38Petersen, K.; 219Pikkarainen, M.; 215Poppendieck, M. & T.; 186Putrycz, E.; 145

RRajlich, V.; 82, 84, 87, 110, 172Robles, G.; 47Royce, W.; 162Rumbaugh, J.; 62

SSayyad, J.; 37Schwaber, K.; 179Scrumban; 182, 185, 188Shultz, S.; 178Stasko, J. T.; 69Storey, M. A.; 130, 149Sutherland, J.; 179

TTonella, P.; 153Turner, A.; 178, 217

UUtting, M.; 229, 235, 236

VVon Neumann, J.; 65

YYing, A. T. T.; 40Yu, L.; 43Yuan, L.; 43

253


ZZhao, W.; 90Zimmermann, T.; 38

254

Date post:	05-Aug-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times