Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | sherman-flynn |
View: | 212 times |
Download: | 0 times |
Toward Mining “Concept Keywords”
from Identifiersin Large Software Projects
Masaru Ohba and
Katsuhiko GondowTokyo Institute of Technology
What are “concept keywords”?
• Most programmers try to name identifiers meaningfully.• Concept keywords are defined terms that describe key
concepts to aid in as program understanding.– e.g. read_dirent() : dirent is a concept keyword.
Concept keywordsdirent, root, PTE, tss,path, signal, yield
Grouping wordskbd_, vga_, FAT12_, sys_, H, t
Attributes, less important concepts
busy, byte, offset, name,memory, end, int8, again
Generic verbsread, set, is, move, wait, print, dump, make, init
Human-selected concept keywords and other category words in udos
Suggestion• We should use more “concept keywords” in pro
gram understanding tools .– concept keywords are concise and descriptive
• Our solution:– provides a way to mine concept keywords.
• ckTF/IDF methods / Identifier Exploratory Framework
– could be used to build tools that support and utilize extracted concept keywords (future work).
Future work
• Applying concept keywords to a Bug Tracking System (BTS) to see the relationship between bug report and corresponding problem source code.
Bug-report no.1Overview: It could not read directories.
Bug-report no.3Overview: I could not catch system calls.
dirent
fat12.c
read_dirent() { return NULL;}
task.c
signal sys_signal(){ sys_kill();}
Concept keywordcan bridge the gap
between bug-reportsand source code.
IBM Watson Research Center
© 2005 IBM Corporation
Source code that talks:an exploration of Eclipse task commentsand their implication to repository mining
Annie Ying(joint work with Jim Wright & Steve Abrams)
Annie Ying et. al., IBM Research
© 2005 IBM Corporation
In a software development task...development
artifacts
developmentartifacts
reqsclass Foo {
}
class Foo {
// Joan, please fix this
void m1() {
}
emails
task-orientedinfo
task-orientedinfo
changereports
changereports
communicationcommunication
Annie Ying et. al., IBM Research
© 2005 IBM Corporation
Empirical study on Eclipse task comments
Eclipse task comments
// TODO an ugly hack for now –sue. Joan, please fix it
// TODO eliminate this once ECR 317 complete
com
mun
icatio
npo
inte
r to
a ch
ange
requ
est
past
task
curre
nt ta
skfu
ture
task
loca
tion
mar
ker
Annie Ying et. al., IBM Research
© 2005 IBM Corporation
Conclusion
Presented observations on uses of comments
– e.g., task-oriented info and communication
Take-home message:
– When mining software repositories, consider analyzing comments.
Annie Ying et. al., IBM Research
© 2005 IBM Corporation
Challenges in analyzing Eclipse task comments
Eclipse task comments
// TODO an ugly hack for now –sue. Joan, please fix it
// TODO eliminate this once ECR 317 complete
// TODO explain why this method is public
// TODO once we have Eclipse-icon-decorator mechanism, use it here
// TODO workaround for ......// End workaround
implied context
informality
fuzzy scope
Text Mining for Software Text Mining for Software Engineering: Engineering: How Analyst Feedback Impacts Final Results
Jane Huffman Hayes,Alex Dekhtyar,Senthil Karthikeyan Sundaram*Funded by NASA
Department of Computer ScienceUniversity of Kentucky
Question of the DayQuestion of the DayAnswer 1
Help study the process
After-the-factExploratoryConclusions help future projects
What can Data Mining
Do for Software Engineering ???
Question of the DayQuestion of the DayAnswer 1
Help study the process
After-the-factExploratoryConclusions help future projects
Answer 2
Help improve improve the process
!!!!!!
What can Data Mining
Do for Software Engineering ???
Analyst
Automated “Mining” Tool
FeedbackLoop
Task
Final Result
Objective Study(RE’04,PROMISE’05) Subjective Study
Use Mining During the Use Mining During the Process?Process?
Ultimately,Ultimately,We are interestedWe are interestedIn the accuracyIn the accuracyOf the final resultOf the final result
Analyst
Automated “Mining” Tool
Final Result
Preliminary StudyPreliminary StudyQuestion:Question: What wouldthe analyst do withmachine-generated data?
Task :Task : Requirements TracingMetrics:Metrics: Precision Recall
Analyst
Final Result
Preliminary StudyPreliminary StudyQuestion:Question: What wouldthe analyst do withmachine-generated data?
Pr Rec40% 60%20% 90%80% 30%
Candidate link listsEmulated
Emulated
Analyst
Preliminary StudyPreliminary StudyQuestion:Question: What wouldthe analyst do withmachine-generated data?
Pr Rec40% 60%20% 90%80% 30%
Candidate link lists
Pr Rec45% 56%58% 65%23% 27%
Analyst
Preliminary StudyPreliminary StudyQuestion:Question: What wouldthe analyst do withmachine-generated data?
Pr Rec40% 60%20% 90%80% 30%
Candidate link lists
Pr Rec45%45% 56%56%58%58% 65%65%23%23% 27%27%
ΔPr ΔRec+5%+5% -4%%+38%+38% -25%%-57%-57% -2% -2%
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Precision
Recall
T1 T3 T4 From RE2003 reg
Trend???Trend???
Analyst
Automated “Mining” Tool
FeedbackLoop
Task
Final Result
(Not Quite) Conclusions(Not Quite) Conclusions
Analyst
Final Result
(Not Quite) Conclusions(Not Quite) Conclusions
• New Field of StudyNew Field of Study
• Larger Study NeededLarger Study Needed
Signature Change Analysis
Sunghun Kim, Jim Whitehead, Jennifer Bevan
{hunkim, ejw, jbevan}@cs.ucsc.edu
University of California, Santa Cruz
Biological and Software Evolution
• Can we shape software evolution path?– LOC– Number of Changes– Structural Changes– Signature Changesv1 v2 v3
Found Signature Change properties
• The most common signature change kinds are complex data type, parameter addition, parameter ordering, and parameter deletion.
0
10
20
30
40
50
60
Parameter name change Only ordering change Addition Deletion Modifier change Array/Pointer Complex type namechange
Primitive type change
A 1.3A 2APRAPUCVSGCCSVNAVG
Found Signature Change properties
• More than half of function signatures never change. About 90% of function signatures change less than three times.
• A function’s signature changes after every 5-15 function body changes.
• A project’s average number of parameters per function remains relatively constant over time.
• Functions typically have parameter lists with 1, 2, or 3 parameters.
Found Signature Change properties
• Weak correlations between signature change and other changes including LOC and function body changes.
• Each project has its own signature change patterns, and the pattern can be discovered after analyzing the first 1000 to 1500 revisions.
SVN
0
10
20
30
40
50
60
Parametername
change
Onlyorderingchanges
Additon Deletion Modifierchange
Complextype name
change
10020030050010001500200050006029
A 1.3
0
10
20
30
40
50
60
Parametername
change
Onlyorderingchanges
Additon Deletion Modifierchange
Complextype name
change
10020030050010001500200050007747
Found Signature Change properties
• Probability of a change kind depends on previous changes.
A D O C
0.07
0.04 0.15
0.58
O C0.22
0.38
O C0.09
0.61
O C0.17
0.58
C
0.73
A D O C
0.16
0.11 0.1
0.51
A C0.19
0.33
A C0.27 0.21
O C0.18
0.36
C0.66
C0.53
C0.73
C
0.78
C
0.61
C0.81
C0.83
C0.76
C0.94
(a) APR
(b) Apache 2
Future Work
• Signature change analysis on OOP (Java)– The results presented here are based on a procedural
programming language (C) open source projects: Apache HTTP 1.3, Apache HTTP 2.0 , Apache Portable Runtime, APR utility, CVS, GCC, and Subversion
– Find OOP signature change properties and compare the with those from a procedural language
• Changes inside Struct/Class– Variable addition/deletion– Variable renaming– Method addition/deletion
Signature Change Analysis
Sunghun Kim, Jim Whitehead, Jennifer Bevan
{hunkim, ejw, jbevan}@cs.ucsc.edu
University of California, Santa Cruz
Linear Predictive Coding and Cepstrum coefficients for mining time variant information from software repositories
G. Antoniol, F. Rollo and G. VenturiRCOST – Unievrsity of Sannio -
Italy
LPC Idea Model a time series with a polynomial
approximation LPC Cepstrum smooth the spectrum
Define the distance between two time series as the distance between their polynomial approximations
Use distance to cluster time series with identical or similar evolutions.
LPC and Linux Kernel 211 Linux releases about
1700 files Study the influence of
the number of coefficients
Study the influence of distance thresholds
Mine files with similar evolution:
Create groups of files with the same or very similar size evolution
0
100
200
300
400
500
600
700
800
1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248
100
1000
10000
12 16 20 32
1E-3
1E-4
1E-5
Similar pairs for different thresholdsand coefficients used
Similar pair of evolving files
Complementing Each Other: GQM & DMAIC
GQM
(Goal-Question-Metric)
• CMM sometimes criticized for emphasizing repeatability over improving productivity.
• GQM strong in defining metrics appropriate to business goals and nature of the process.
DMAIC(Define-Measure-Analyze-
Improve-Control)
• Six Sigma sometimes criticized as inappropriate for processes characterized by knowledge efforts.
• DMAIC strong in focus on continuous iterative process improvement.
CMM+6σ Process Improvement Cycle
DefineControl Baselines Weaknesses Opportunities
Measure
Assess
Improve
Collect DataAnalyze
Requirements Activities Changes Time Results
Progress Defects Delays Dissatisfactions
Hypotheses Trends Indicators Causes
Define
Measure
Analyze
Improve
Control
Areas of Concern
• Architecture– Design weaknesses– General or for new demands
• Bottlenecks– Areas for focused attention
• Causal Connections– System view of process– Root cause analysis
Mining Version Histories to Verify the Learning Process of LPP
• Mining the Boundary of Openness of an Open Source Software Project
• Explore if we can apply Open Source Development (OSD) Process to Proprietary Software
• Show the Boundary of Openness during OSD
National Chiao Tung UniversityShih-Kung Huang, Kang-min Liu
Method
• Team Members– Core= Relatively Important Developers– NonCore = All – Core
• Source Code– Kernel = All – NonKernel– NonKernel = {d | d is touched by one of the NonCore}
• Project Characteristic function– f(x) = {y | y is the kernel ratio with respect to the core
ratio of x}– Kernel Ratio = (Kernel Size)/All– Core Ratio = (Core Team Size) / All
Conclusions
• Obtain the characteristic function of each project team– Reveal different team consititutions with varied
involvement in the software
• An Implication to develop a hybrid software process model to embed OSD into commercial software.– OpenDarwin: Mac OS X
– Helix: Real Network Server
Towards a Taxonomy of Approaches Towards a Taxonomy of Approaches for for
Mining of Source Code RepositoriesMining of Source Code Repositories
Huzefa H. Kagdi, Michael L. Collard, Jonathan I. Maletic
Software Development Laboratory <SDML>
Department of Computer Science
Kent State University
Kent Ohio, USA
Motivation
• A number of approaches have been proposed to derive and express changes from source code repositories in a more source-code “aware” manner
• We need better insight of the current research in the MSR community in order to facilitate building efficient and effective MSR tools
Building a Taxonomy
• Draw similarities and variations between six MSR approaches based on three dimensions– Entity type and granularity– How changes are expressed and defined– Type of MSR question
• Define notations to describe MSR to facilitate a taxonomic description of approaches
An Initial Taxonomy Entity Change Question
Annotation Analysis
Gall et al classsyntax and semantic
-hidden dependencies
market basket and prevalence
German file & commentsyntax and semantic
- file couplingmarket basket and
prevalence
Heuristic
Hassan et al function & variablesyntax and semantic
-dependenciesmarket basket
Data Mining (association rule)
Zimmerman et al class & methodsyntax and semantic - association rules
market basket
Differencing
Raghavan et al logical statementsyntax and semantic
- moveprevalence
Collard et al logical statementsyntax - add, delete,
modifyprevalence
Conclusions
• Most of the approaches except Differencing work with fairly high-level entities
• Very different semantic information being is used in these approaches
• Further investigation is necessary to discern between how changes are expressed
A Framework for Describing and Understanding Mining Tools in
Software Development
D.M. German, D. Čubranić, and M.-A. Storey
University of Victoria
Introduction
• Software engineering is a collaborative activity → activity awareness is important
• Can be provided by mining software repositories
• A variety of mining tools → how to compare?
• Do we mine what is easy to mine and think about the uses for it later?
Proposal
• Develop a framework for describing tools for mining software repositories
• Purpose:• Help designers understand and compare tools• Assist users assess tools• Identify new research areas
• Keep the specific user needs and tasks in the forefront!
The Framework
• Intent• Role, time, cognitive support
• Information• Change management, program code, defect
tracking• Informal communication, local history,
correlated information
• Infrastructure• Requirements, offline/online, storage backend