International Conference on Program Comprehension (ICPC) 2008
A Traceability Technique for Specifications
Aharon Abadi, Mordechai Nisenson and Yahalomit Simionovici
ICPC 2008
2 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Goals
Our Solution: Outline of Traceability Link Process
IR Techniques
Experiments
Conclusions
Future work
ICPC 2008
3 A Comparison of Traceability Techniques for Specifications
Traceability
The ability to link between different artifacts
– Example artifacts: code, user manuals, design documentation, development wikis, etc.
In particular, link code to:
– Relevant requirements
– Sections in design documents
– Test-cases
– Other structured and free-text artifacts
Also, link from requirements, design documents, etc. to code
ICPC 2008
4 A Comparison of Traceability Techniques for Specifications
What’s Traceability Good For?
Program Comprehension
– Top-down
– Bottom-up
• Particularly relevant for the maintenance of legacy systems
Impact analysis
– Keeping non-code artifacts up-to-date
Requirement Tracing
– Discover what code needs to change to handle a new req.
– Aid in determining whether a specification is completely implemented and covered by tests
ICPC 2008
5 A Comparison of Traceability Techniques for Specifications
Challenges
Scalability
– Large # of artifacts
Heterogeneity
– Large # of different document formats and programming languages
Noisy
– Free text information (natural language): conjuctions, prepositions, abbreviations, etc.
– Some information may be outdated, or just plain wrong
Prior work:
– Recovering Traceability Links in Software Artifact Management Systems using information retrieval methods [Lucia et al., 2007]
– Recovering Traceability Links between Code and Documentation [Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic, 2003]
ICPC 2008
6 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Goals
Our Solution: Outline of Traceability Link Process
IR Techniques
Experiments
Conclusions
Future work
ICPC 2008
7 A Comparison of Traceability Techniques for Specifications
Example
/** The File interface provides…*/public class FileImpl extends FilePOA{ private String nativefileName;
/** * Creates a new File… */ public FileImpl(String nativePath ...){ … }
/** *… */
Private String f(..){…} }
ICPC 2008
8 A Comparison of Traceability Techniques for Specifications
Goals
Examine the effectiveness of IR techniques for traceability between code and documentation on “real world” data
Most prior work compared 2 specific algorithms, LSI and VSM
– Is LSI really better?
– How does LSI stack up with other dimensionality reduction techniques?
– How does it compare with other non-dimensionality reduction techniques?
How do different levels of abstraction affect the choice of the best methods?
– How to fit a method and parameters to a dataset?
ICPC 2008
9 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Goals
Our Solution: Outline of Traceability Link Process
IR Techniques
Experiments
Conclusions
Future work
ICPC 2008
10 A Comparison of Traceability Techniques for Specifications
Traceability Link Process
TextPreprocessing
Sectoring
Document Pre-processing IR-Index
Words
expansion words
extraction
Query Construction
Words
ranking
documents sections sections
sections
Off line processes
partialcode
(word1,rank1),…,(wordm,rankm)
sections
TextPreprocessing
(word1,rank1),…,(wordm,rankm)
ICPC 2008
11 A Comparison of Traceability Techniques for Specifications
Text Preprocessing
TextPreprocessing
…Copyright owners grant member companies of the OMG permission to make a limited …
…copyright owner grant member companiomg permissmake limit …
• Lower-case , stop-words, number etc. • Stemming
ICPC 2008
12 A Comparison of Traceability Techniques for Specifications
/** The File interface provides…*/public class FileImpl extends FilePOA{ private String nativefileName;
/** * Creates a new File… */ public FileImpl(String nativePath ...){ … }
/** *… */
Private String f(..){…} }
Words Extraction
words extraction
FileImpl
• Class Name• Public Function names• Public function arguments and return type• Comments• Super class name
FileImpl nativePath
FilePOA
Creates a new File…
The File interface provides…
ICPC 2008
13 A Comparison of Traceability Techniques for Specifications
Words Expansion
Words
expansion …NativePath, fileName, delete_all_elements…
… NativePath,Native,Path, fileName,File,Name, delete_all_elements,Delete,all,elements …
• Use well-known coding standards for sub-words separation
ICPC 2008
14 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Goals
Our Solution: Outline of Traceability Link Process
IR Techniques
Experiments
Conclusions
Future work
ICPC 2008
15 A Comparison of Traceability Techniques for Specifications
Information Retrieval (IR) Methods
Vector Space Model (VSM) [Salton et al., 1975] implemented by Lucene
– Each document, d, is represented by a vector of ranks of the terms in the vocabulary:
vd = [rd(w1), rd(w2), …, rd(w|V|)]
– The query is similarly represented by a vector
– The similarity between the query and document is the cosine of the angle between their respective vectors
Jensen Shannon Similarity Model [Abadi et al., 2008]
– Each document, d, is represented by its empirical probability distribution over words: pd(w)
– The query is similarly represented
– The similarity score is calculated as 1 – JS(pq, pd), where JS is the Jensen-Shannon Divergence
ICPC 2008
16 A Comparison of Traceability Techniques for Specifications
Dimensionality Reduction Methods
LSI [Deerwester et al., 1990]
– Commonly used in prior studies
– An algebraic method
– Dimensions represent orthogonal topics
PLSI [Hofmann, 1999]
– Probabilistic extension to LSI
– Based on the assumption that documents are mixtures of topics distributions
– Words and documents are conditionally independent given the topic
SDR [Globerson and Tishby, 2003]
– Based on information theory
– Topics are sufficient statistics in information theory terms
– These statistics are functions that capture maximum mutual information between words and documents
ICPC 2008
17 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Goals
Our Solution: Outline of Traceability Link Process
IR Techniques
Experiments
Conclusions
Future work
ICPC 2008
18 A Comparison of Traceability Techniques for Specifications
Datasets
Software Communication Architecture (SCA) is an open architecture framework that defines how software and hardware elements operate within a software defined radio.
Common Object Request Broker Architecture (CORBA) is OMG's open, vendor-independent architecture and infrastructure that computer applications use to work together over networks.
DatasetSize (MB)SectionsVocabulary size
SCA0.411311 4827
CORBA1.793340 7161
Documentation details:
Queries details: Dataset# classes# relevant results / query
Total # of relevant results
SCA76 – 1365
CORBA45 – 20 58
ICPC 2008
19 A Comparison of Traceability Techniques for Specifications
IR Quality Measures
Precision @ n:
Recall @ n:
Average precision:
n
retrievedrelevantnP
)(
relevant
retrievedrelevantnR
)(
relevant
nrelnPAP
N
n
1
)()(
ICPC 2008
20 A Comparison of Traceability Techniques for Specifications
MAP versus Method
ICPC 2008
21 A Comparison of Traceability Techniques for Specifications
Mean Average Precision (MAP) versus Dimension
ICPC 2008
22 A Comparison of Traceability Techniques for Specifications
Precision versus Recall
ICPC 2008
23 A Comparison of Traceability Techniques for Specifications
Dimensionality of Datasets
SCA CORBA
PLSI Results
ICPC 2008
24 A Comparison of Traceability Techniques for Specifications
Precision versus Recall over Algorithms for SCA
ICPC 2008
25 A Comparison of Traceability Techniques for Specifications
Precision versus Recall over Algorithms for CORBA
ICPC 2008
26 A Comparison of Traceability Techniques for Specifications
MAP versus Method – Combined over SCA & CORBA
ICPC 2008
27 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Our Solution: Outline of Traceability Link Process
Similarity measures
IR Techniques
IR Quality Measures
Experiments
Conclusions
Future work
ICPC 2008
28 A Comparison of Traceability Techniques for Specifications
Conclusions
Our Most significant results are:
– Traceability between code and documentation in real world systems is effective via IR techniques.
– For realistic datasets the Vector Space Model and Jensen Shannon model, which did not perform dimensionality reduction where shown to be the most effective.
– SDR was shown to be the best dimensionality reduction model, specifically it is better then LSI.
– As the documentation links are more abstract, the performance of VSM, JS model and SDR become equivalent.
Additional results:
– SDR was shown to be robust to datasets abstractness level
– LSI and PLSI are sensitive to datasets abstractness level
– We believe that PLSI poor performance is due to the difficulty of modeling very short documents, which could result in severe overfitting
ICPC 2008
29 A Comparison of Traceability Techniques for Specifications
Outline
Motivation
Our Solution: Outline of Traceability Link Process
Similarity measures
IR Techniques
IR Quality Measures
Experiments
Conclusions
Future work
ICPC 2008
30 A Comparison of Traceability Techniques for Specifications
Future work
Development of new measures for evaluation of different IR algorithms and datasets, specifically for traceability
– Example: developing a measure of “abstractness” for a specification which will help with tuning of parameters such as dimensionality
Using dimensionality reduction techniques for creating thesaurus from the indexed data and using it for adding synonyms to the query
Traceability for other types of documents and links
Investigate alternative methods for query construction
ICPC 2008
31 A Comparison of Traceability Techniques for Specifications
References
A.D. Lucia, F.Fasano, R. Oliveto, and G. Tortora. Recovering Traceability Links in Software Artifact Management Systems using Information Retrieval Methods. ACM Trans. Softw. Eng. Methodol., 16(4):13, 2007.
G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering Traceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. , 28(10):970-983, 2002.
S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.
A. Marcus and J. I. Maletic. Recovering Documentation to Source Code Traceability Links using Latent Semantic Indexing. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering , 125-135, 2003.
G.Salton, A. Wong, and C.S. Yang. A Vector Space Model for Automatic Indexing. Commun. ACM, 18(11):613-620, 1975.
T.Hofmann, Probabilistic Latent Semantic Indexing. In SIGIR, 50-57, 1999.
A. Globerson and N. Tishby. Sufficient Dimensionality Reduction. Journal of Machine Learning Research, 3:1307-1331, 2003.
ICPC 2008
32 A Comparison of Traceability Techniques for Specifications
Thank You!