Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 223 times |
Download: | 1 times |
Detection of Plagiarism In University Projects Using Metrics-Based Similarity
Dagstuhl Seminar on Duplication, Redundancy, and Similarity in Software
July, 2006
Ettore Merlo, Ecole Polytechnique de Montréal
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Context
• Detect plagiarism in first years programming projects at university– Programming skills have to be developed
during courses
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Plagiarism Detection
• Comparison of sets of syntactic blocks• Spectral analysis of similarity
– Increasing thresholds– Spectral shape parameters are computed
• Projects are ranked by similarity spectrum• The most similar projects are considered
as candidates for plagiarism
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Plagiarism Problem
• Detect code transformations that require little programming effort and make apparent differences in source code– Changed identifier by editing operations– Changed source code layout (comments,
indentation, order of procedures, functions, and methods, file structure)
– Changed constants (initialization, loops)
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Metrics-Based Similarity
• Definition– Two code fragments are similar if their associated
vectors of metrics satisfy some similarity criterion
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Similarity Identification Process
F1 m11 m12 ……. M1k
………………………………….
Fj mj1 mj2 ……. mjk
Source code Parsing
and Analysis
Metrics Extraction
Clones Extraction
Abstract Syntax Tree
MetricsClones
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Metrics Extraction
• Metrics for similarity detection– Volume– Complexity– Module/function interface– Call graph structure– Local memory– Global memory– Dataflow
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Metrics Matching
• similar(fI,fJ) = | mk(fI) – mk(fJ) | <= thk
– forall k within the size of the metrics vector
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Metrics Matching Complexity
• n = | fragments_set |• Exact solution algorithms show a
worst-case O(n ) complexity in general• Linear complexity exact solutions exist
for specific sub-problems• Opportunistic strategies and heuristics
may reduce the average-case complexity
• Approximate solutions may reduce the worst-case complexity
2
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Threshold-Based Quantization
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Threshold-Based Quantization (2)
• Clusters represent the following hyper-parallelepiped:
• Clusters represent a partition of all fragments• Complexity is O(M·n) where:
– M is the cardinality of metrics– n is the total number of fragments– often M << n
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Quantization Error
• Fragments in neighboring clusters may be closer than (thi / 2) and still be in different clusters
• Errors for threshold level (thi) disappear for threshold levels (k·thi), (k > 1)
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Project Comparison
• Compute structural similarity spectrum– Compute similarity for increasing threshold
levels in s steps• Quantize projects for the current threshold level• Traverse current clusters to check for
commonality in compared project• Count common structurally-similar fragments
under current threshold level
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Project Comparison (2)
• Complexity: O(s·M·(n1 + n2))
– n1, n2 : size of projects
– M: cardinality of metrics– s: threshold steps
• Rationale: – Plagiarism is hard to deeply hide if little
programming energy is deployed– Surface differences are quickly ignored by
thresholds of increasing levels
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Project Comparison (3)
• Typical spectrum
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Parameters
• Granularity: functions and methods• Steps: 5• Metrics and thresholds:
– CALLS: 1– LOCALS: 1– NONLCALS: 1– PARNUM: 1– STMNT: 3– NBRANCHES: 1– NLOOPS: 1
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Plagiarism problem
• Projects are composed of a variable number of fragments– Problem similar to class comparison or to
software evolution analysis
• Identify projects with high spectral similarity– p = number of projects– Galaxy approach
• O(p)
– Pair comparison• O(p2)
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Galaxy• Algorithm:
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Procedural Projects
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
OO Projects
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Clone Visualization
• Visual display of source code fragments differences
• DP-matching algorithm on tokens
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Matching Algorithms
• Compute the sets of lexical changes– Dynamic programming– Sub-optimal and heuristic ones
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
int restore_stack ( object info ) {
intrestore_list ( int index , object info ) {
Matching Example
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Remarks• Similarity contrast is very good for
procedural code• Distribution of similarity for OO code is
less sharp– Reference classes were given as a part of the
projects– Methods tend to be smaller– More methods tend to be similar– Class structure could be taken into
consideration– Inter-class relationship could be taken into
account
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Administrative Approach
• Identify most similar projects• Do not make any hypothesis about the causes
of similarity• Shift the burden of explanation over the
authors of a project
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Conclusions
• A metrics based plagiarism detection approach in an academic environment has been presented
• The presented approach has been successfully used to discourage plagiarism in course projects
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Bibliography
Merlo E., Antoniol G., Di Penta M., Rollo F."Linear Complexity Object-Oriented Similarity for Clone Detection and Software Evolution Analysis",Proc. International Conference of Software Maintenance (ICSM), IEEE Computer Society Press, 2004, pp. 412-416
Merlo E., Antoniol G., Di Penta M.,``Complexity and Feasibility Issues in Object Oriented Clone Detection'',Proc. 2nd International Workshop on Detection of Software Clones (IWDSC-2003), Victoria (BC), Canada, 2003, pp. 5-6.
G. Antoniol, U. Villano, E. Merlo, M. Di Penta,``Analyzing Cloning Evolution in the Linux Kernel'‘,Information and Software Technology, Vol. 44, No. 13, pp. 755-765, October 1, 2002
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Bibliography (2)
E. Merlo, M. Dagenais, P. Bachand, J. S. Sormani, G. Antoniol``Investigating Large Software System Evolution: the Linux Kernel''Computer Software and Applications Conference, COMPSAC - 2002
Dagenais M., Patenaude J. F., Merlo E., Lague B.,``Comparison of clones occurrence in Java and Modula-3 software systems'',in ``Advances in Software Engineering: Comprehension, Evaluation, and Evolution'',H. Erdogmus and O. Tanir (Eds.), Springer-Verlag,ISBN: 0-387-95109-1, 2001.
Casazza G., Antoniol G., Villano U., Merlo E., Di Penta M.,``Identifying Clones in the Linux Kernel'',Proc. International Workshop on Source Code Analysis and Manipulation (IWSCAM),IEEE Computer Society Press, pp. 90-97, 2001
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Bibliography (3)
Antoniol A., Casazza G., Di Penta M., Merlo E.,
``Modeling Clones Evolution through Time Series'',
Proc. International Conference of Software Maintenance (ICSM),
IEEE Computer Society Press, pp. 273-280, 2001
Antoniol G., Casazza G., Merlo E.,
``GAWK Software System Evolution'',
International Workshop on Feedback and Evolution in Software and Business
Processes (FEAST), July 2000
Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K.,
``Advanced Clone-analysis as a Basis for Object-oriented System
Refactoring'',
Proc. Working Conference on Reverse Engineering (WCRE),
IEEE Computer Society Press, pp. 98-107, 2000.
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Bibliography (4)
Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K.,
``Measuring Clone Based Reengineering Opportunities'', Proc. International
Software Metrics Symposium, pp. 292-303, IEEE Computer Society Press, 1999
Balazinska M., Merlo E., Dagenais M., Lague B., Kontogiannis K.,
``Partial Redesign of Java Software Systems Based on Clone Analysis'',
Proc. 6th Working Conference on Reverse Engineering, WCRE99, pp. 326-336,
IEEE Computer Society Press, 1999
Dagenais M., Merlo E., Lague B., Proulx D., ``Clones Occurrence on
Large Object Oriented Software Packages'', Proc. CASCON'98, pp. 192-200,
IBM Canada, National Research Council of Canada, 1998
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Bibliography (5)
Lague, B., Proulx, D., Mayrand, J., Merlo, E.M., Hudepohl, J.,
``Assessing the Benefits of Incorporating Function Clone Detection in a
Development Process'', Proc. of International Conference on Software
Maintenance, IEEE Computer Society Press, 1997, pp. 314-321.
Mayrand, J., Leblanc, C., and Merlo, E.,
``Experiment on the Automatic Detection of Function Clones in a Software System
Using Metrics'',
Proc. IEEE International Conference on Software Maintenance, Monterey,
California, November 1996, IEEE Computer Society Press, pp. 244-253.
Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein M.,
``Pattern matching techniques for clone detection'',
Journal of Automated Software Engineering, V.3, 1996, pp. 77-108, Kluwer
Academic Publishers.
Ettore Merlo (Ecole Polytechnique de Montréal), (C) Copyright, July 2006
Further Contacts
Ettore MerloEcole Polytechnique de Montréaltel: +1 (514 ) 340 4711 ext. 5758
fax: +1 (514) 340 [email protected]