Toward Mining “Concept Keywords” from Identifiers in Large Software Projects Masaru Ohba and...

Toward Mining “Concept Keywords”

from Identifiersin Large Software Projects

Masaru Ohba and

Katsuhiko GondowTokyo Institute of Technology

What are “concept keywords”?

• Most programmers try to name identifiers meaningfully.• Concept keywords are defined terms that describe key

concepts to aid in as program understanding.– e.g. read_dirent() : dirent is a concept keyword.

Concept keywordsdirent, root, PTE, tss,path, signal, yield

Grouping wordskbd_, vga_, FAT12_, sys_, H, t

Attributes, less important concepts

busy, byte, offset, name,memory, end, int8, again

Generic verbsread, set, is, move, wait, print, dump, make, init

Human-selected concept keywords and other category words in udos

Suggestion• We should use more “concept keywords” in pro

gram understanding tools ．– concept keywords are concise and descriptive

• Our solution:– provides a way to mine concept keywords.

• ckTF/IDF methods / Identifier Exploratory Framework

– could be used to build tools that support and utilize extracted concept keywords (future work).

Future work

• Applying concept keywords to a Bug Tracking System (BTS) to see the relationship between bug report and corresponding problem source code.

Bug-report no.1Overview: It could not read directories.

Bug-report no.3Overview: I could not catch system calls.

dirent

fat12.c

read_dirent() { return NULL;}

task.c

signal sys_signal(){ sys_kill();}

Concept keywordcan bridge the gap

between bug-reportsand source code.

IBM Watson Research Center

© 2005 IBM Corporation

Source code that talks:an exploration of Eclipse task commentsand their implication to repository mining

Annie Ying(joint work with Jim Wright & Steve Abrams)

Annie Ying et. al., IBM Research


In a software development task...development

artifacts

developmentartifacts

reqsclass Foo {

}

class Foo {

// Joan, please fix this

void m1() {

}

emails

task-orientedinfo

task-orientedinfo

changereports

changereports

communicationcommunication



Empirical study on Eclipse task comments

Eclipse task comments

// TODO an ugly hack for now –sue. Joan, please fix it

// TODO eliminate this once ECR 317 complete

com

mun

icatio

npo

inte

r to

a ch

ange

requ

est

past

task

curre

nt ta

skfu

ture

task

loca

tion

mar

ker



Conclusion

Presented observations on uses of comments

– e.g., task-oriented info and communication

Take-home message:

– When mining software repositories, consider analyzing comments.



The End



Challenges in analyzing Eclipse task comments

Eclipse task comments

// TODO an ugly hack for now –sue. Joan, please fix it

// TODO eliminate this once ECR 317 complete

// TODO explain why this method is public

// TODO once we have Eclipse-icon-decorator mechanism, use it here

// TODO workaround for ......// End workaround

implied context

informality

fuzzy scope

Text Mining for Software Text Mining for Software Engineering: Engineering: How Analyst Feedback Impacts Final Results

Jane Huffman Hayes,Alex Dekhtyar,Senthil Karthikeyan Sundaram*Funded by NASA

Department of Computer ScienceUniversity of Kentucky

Question of the DayQuestion of the Day

What can Data Mining

Do for Software Engineering ???

Question of the DayQuestion of the DayAnswer 1

Help study the process

After-the-factExploratoryConclusions help future projects



Question of the DayQuestion of the DayAnswer 1

Help study the process

After-the-factExploratoryConclusions help future projects

Answer 2

Help improve improve the process

!!!!!!



Our ApproachOur Approach

Use Data MiningUse Data Mining

during the processduring the process

Analyst

Automated “Mining” Tool

FeedbackLoop

Task

Final Result

Objective Study(RE’04,PROMISE’05) Subjective Study

Use Mining During the Use Mining During the Process?Process?

Ultimately,Ultimately,We are interestedWe are interestedIn the accuracyIn the accuracyOf the final resultOf the final result

Analyst


Final Result

Preliminary StudyPreliminary StudyQuestion:Question: What wouldthe analyst do withmachine-generated data?

Task :Task : Requirements TracingMetrics:Metrics: Precision Recall

Analyst

Final Result


Pr Rec40% 60%20% 90%80% 30%

Candidate link listsEmulated

Emulated

Analyst


Pr Rec40% 60%20% 90%80% 30%

Candidate link lists

Pr Rec45% 56%58% 65%23% 27%

Analyst


Pr Rec40% 60%20% 90%80% 30%

Candidate link lists

Pr Rec45%45% 56%56%58%58% 65%65%23%23% 27%27%

ΔPr ΔRec+5%+5% -4%%+38%+38% -25%%-57%-57% -2% -2%

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Precision

Recall

T1 T3 T4 From RE2003 reg

Trend???Trend???

Analyst


FeedbackLoop

Task

Final Result

(Not Quite) Conclusions(Not Quite) Conclusions

Analyst

Final Result

(Not Quite) Conclusions(Not Quite) Conclusions

• New Field of StudyNew Field of Study

• Larger Study NeededLarger Study Needed

Call for Help!Call for Help!

WANTED!WANTED!

VOLUNTEERS!VOLUNTEERS!

Thank You!Thank You!

Signature Change Analysis

Sunghun Kim, Jim Whitehead, Jennifer Bevan

{hunkim, ejw, jbevan}@cs.ucsc.edu

University of California, Santa Cruz

Biological and Software Evolution


v1 v2 v3


• Can we shape software evolution path?– LOC– Number of Changes– Structural Changes– Signature Changesv1 v2 v3

Found Signature Change properties

• The most common signature change kinds are complex data type, parameter addition, parameter ordering, and parameter deletion.

0

10

20

30

40

50

60

Parameter name change Only ordering change Addition Deletion Modifier change Array/Pointer Complex type namechange

Primitive type change

A 1.3A 2APRAPUCVSGCCSVNAVG


• More than half of function signatures never change. About 90% of function signatures change less than three times.

• A function’s signature changes after every 5-15 function body changes.

• A project’s average number of parameters per function remains relatively constant over time.

• Functions typically have parameter lists with 1, 2, or 3 parameters.


• Weak correlations between signature change and other changes including LOC and function body changes.

• Each project has its own signature change patterns, and the pattern can be discovered after analyzing the first 1000 to 1500 revisions.

SVN

0

10

20

30

40

50

60

Parametername

change

Onlyorderingchanges

Additon Deletion Modifierchange

Complextype name

change

10020030050010001500200050006029

A 1.3

0

10

20

30

40

50

60

Parametername

change

Onlyorderingchanges

Additon Deletion Modifierchange

Complextype name

change

10020030050010001500200050007747


• Probability of a change kind depends on previous changes.

A D O C

0.07

0.04 0.15

0.58

O C0.22

0.38

O C0.09

0.61

O C0.17

0.58

C

0.73

A D O C

0.16

0.11 0.1

0.51

A C0.19

0.33

A C0.27 0.21

O C0.18

0.36

C0.66

C0.53

C0.73

C

0.78

C

0.61

C0.81

C0.83

C0.76

C0.94

(a) APR

(b) Apache 2

Future Work

• Signature change analysis on OOP (Java)– The results presented here are based on a procedural

programming language (C) open source projects: Apache HTTP 1.3, Apache HTTP 2.0 , Apache Portable Runtime, APR utility, CVS, GCC, and Subversion

– Find OOP signature change properties and compare the with those from a procedural language

• Changes inside Struct/Class– Variable addition/deletion– Variable renaming– Method addition/deletion

Signature Change Analysis

Sunghun Kim, Jim Whitehead, Jennifer Bevan

{hunkim, ejw, jbevan}@cs.ucsc.edu

University of California, Santa Cruz

Linear Predictive Coding and Cepstrum coefficients for mining time variant information from software repositories

G. Antoniol, F. Rollo and G. VenturiRCOST – Unievrsity of Sannio -

Italy

LPC Idea Model a time series with a polynomial

approximation LPC Cepstrum smooth the spectrum

Define the distance between two time series as the distance between their polynomial approximations

Use distance to cluster time series with identical or similar evolutions.

LPC and Linux Kernel 211 Linux releases about

1700 files Study the influence of

the number of coefficients

Study the influence of distance thresholds

Mine files with similar evolution:

Create groups of files with the same or very similar size evolution

0

100

200

300

400

500

600

700

800

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248

100

1000

10000

12 16 20 32

1E-3

1E-4

1E-5

Similar pairs for different thresholdsand coefficients used

Similar pair of evolving files

Complementing Each Other: GQM & DMAIC

GQM

(Goal-Question-Metric)

• CMM sometimes criticized for emphasizing repeatability over improving productivity.

• GQM strong in defining metrics appropriate to business goals and nature of the process.

DMAIC(Define-Measure-Analyze-

Improve-Control)

• Six Sigma sometimes criticized as inappropriate for processes characterized by knowledge efforts.

• DMAIC strong in focus on continuous iterative process improvement.

CMM+6σ Process Improvement Cycle

DefineControl Baselines Weaknesses Opportunities

Measure

Assess

Improve

Collect DataAnalyze

Requirements Activities Changes Time Results

Progress Defects Delays Dissatisfactions

Hypotheses Trends Indicators Causes

Define

Measure

Analyze

Improve

Control

Areas of Concern

• Architecture– Design weaknesses– General or for new demands

• Bottlenecks– Areas for focused attention

• Causal Connections– System view of process– Root cause analysis

Mining Version Histories to Verify the Learning Process of LPP

• Mining the Boundary of Openness of an Open Source Software Project

• Explore if we can apply Open Source Development (OSD) Process to Proprietary Software

• Show the Boundary of Openness during OSD

National Chiao Tung UniversityShih-Kung Huang, Kang-min Liu

Method

• Team Members– Core= Relatively Important Developers– NonCore = All – Core

• Source Code– Kernel = All – NonKernel– NonKernel = {d | d is touched by one of the NonCore}

• Project Characteristic function– f(x) = {y | y is the kernel ratio with respect to the core

ratio of x}– Kernel Ratio = (Kernel Size)/All– Core Ratio = (Core Team Size) / All

gallery

moodlephpmyadmin

GCC SlashcodeSlashcode

Pugs

Conclusions

• Obtain the characteristic function of each project team– Reveal different team consititutions with varied

involvement in the software

• An Implication to develop a hybrid software process model to embed OSD into commercial software.– OpenDarwin: Mac OS X

– Helix: Real Network Server

Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for

Mining of Source Code RepositoriesMining of Source Code Repositories

Huzefa H. Kagdi, Michael L. Collard, Jonathan I. Maletic

Software Development Laboratory <SDML>

Department of Computer Science

Kent State University

Kent Ohio, USA

Motivation

• A number of approaches have been proposed to derive and express changes from source code repositories in a more source-code “aware” manner

• We need better insight of the current research in the MSR community in order to facilitate building efficient and effective MSR tools

Building a Taxonomy

• Draw similarities and variations between six MSR approaches based on three dimensions– Entity type and granularity– How changes are expressed and defined– Type of MSR question

• Define notations to describe MSR to facilitate a taxonomic description of approaches

An Initial Taxonomy Entity Change Question

Annotation Analysis

Gall et al classsyntax and semantic

-hidden dependencies

market basket and prevalence

German file & commentsyntax and semantic

- file couplingmarket basket and

prevalence

Heuristic

Hassan et al function & variablesyntax and semantic

-dependenciesmarket basket

Data Mining (association rule)

Zimmerman et al class & methodsyntax and semantic - association rules

market basket

Differencing

Raghavan et al logical statementsyntax and semantic

- moveprevalence

Collard et al logical statementsyntax - add, delete,

modifyprevalence

Conclusions

• Most of the approaches except Differencing work with fairly high-level entities

• Very different semantic information being is used in these approaches

• Further investigation is necessary to discern between how changes are expressed

A Framework for Describing and Understanding Mining Tools in

Software Development

D.M. German, D. Čubranić, and M.-A. Storey

University of Victoria

Introduction

• Software engineering is a collaborative activity → activity awareness is important

• Can be provided by mining software repositories

• A variety of mining tools → how to compare?

• Do we mine what is easy to mine and think about the uses for it later?

Proposal

• Develop a framework for describing tools for mining software repositories

• Purpose:• Help designers understand and compare tools• Assist users assess tools• Identify new research areas

• Keep the specific user needs and tasks in the forefront!

The Framework

• Intent• Role, time, cognitive support

• Information• Change management, program code, defect

tracking• Informal communication, local history,

correlated information

• Infrastructure• Requirements, offline/online, storage backend

What Next?

• Applied the framework to three tools:• softChange• Hipikat• Xia/Creole

• We invite researchers to apply it to their tools and give us feedback on their experiences

Date post:	03-Jan-2016
Category:	Documents
Upload:	sherman-flynn
View:	212 times
Download:	0 times

Toward Mining “Concept Keywords” from Identifiers in Large Software Projects Masaru Ohba and...

Documents