+ All Categories
Home > Documents > CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is...

CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is...

Date post: 19-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
26
CodeSimian CS491B – Andrew Weng
Transcript
Page 1: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

CodeSimianCS491B – Andrew Weng

Page 2: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Motivation

• Academic integrity is a universal issue

• Plagiarism is still common today• Kaavya Viswanathan (Harvard Student)

• Book contains many plagiarized passages

• Yoshihiko Wada (Painter, Japan)• Artwork plagiarized from Alberto Sughi

• Scott D. Miller (Wesley College President)• Plagiarized material found on his website

Page 3: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Is Plagiarism Harmful?

• Who does plagiarism really hurt?• The student• The class• The University

• Plagiarism is not only concerned with the protection of intellectual property rights

Page 4: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Plagiarism Detection

Benefits of Utilizing Plagiarism Detection

• Prevention

• Enforcement

• Objective standpoint

Page 5: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Platform Overview

• Developed on Visual Studio .NET 2005• Coded in Microsoft Visual C# .NET• Windows Forms application• Simple and familiar GUI (Windows)

• Intended focus is ease of use

Page 6: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Theoretical Overview

CodeSimian is based on two primary principles

• Kolmogorov Complexity

• Information Distance

Page 7: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Kolmogorov Complexity

• Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output

• Purely theoretical

• Impossible to calculate exactly

Page 8: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Kolmogorov Complexity

Define x to be a desired output string

K(x) = The length of the program that produces x

K(x|y) = The length of the program that produces x given y as an input

K(xy) = The length of the program that produces x concatenated with y

Page 9: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Kolmogorov Complexity

Compare two infinitely long numbers π and a randomly generated number between 0 and 1:

π =3.1415926535897932384626433832795…

n = 0.5234958723957329875320935293853…

K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

Page 10: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Kolmogorov Complexity

π =3.1415926535897932384626433832795…

K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

Perhaps something as simple as the implementation of Leibniz’s formula:

...11

1

9

1

7

1

5

1

3

1

1

14

12

14

0n

n

n

Page 11: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Kolmogorov Complexity

n = 0.5234958723957329875320935293853…

In order to generate the full output of a truly random number n, the length of the program would be infinitely long.

The code would essentially be System.out.println(“0.52349587…”);

Page 12: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Kolmogorov Complexity

So how does this apply to plagiarism detection?

Define x = π and y = π/4

K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.

Page 13: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Information Distance

The distance (or difference) between two objects

Formula used:

)(

)|()(1),(

xyK

yxKxKyxd

Page 14: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Information Distance

• Similarity Factor

If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity

)(

)|()(),(

xyK

yxKxKyxs

Page 15: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Implementation

What does CodeSimian do to obtain the similarity factors?

1. Parse and Tokenize the code

2. Compress the tokenized strings

3. Compare the compressed strings

Page 16: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Parsing the Code

• Utilized ANTLR to parse and tokenize the code

• ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)

Page 17: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Tokenizing the Code

• The tokenized output is a string of characters, each of which represents a token within the code

• For Example:

{ int c = 0; } contains 7 “letters”

Open Bracket

Integer type declaration

Variable name

Assignment operator

Integer Value

Statement end

Close Bracket

Page 18: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Compressing the String

This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers

• As the string is being read, a library is generated as it progresses.

• When repeats are detected, it utilizes pointers to the library to recreate the required section

Page 19: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Compressing the String

• Normally limitations exist on library size and the “word” length stored

• Memory utilization and efficiency is not important

• Lempel-Ziv is suitable for this application

Page 20: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Comparing the Compressed String

• K(x) is the size of the compressed and tokenized code x.

• K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library

• K(xy) is the size of the compressed and tokenized code x+y.

Page 21: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Results

Using the test on trivial examples:• LinkedList.java• LinkedList2.java• LinkedList3.java

• Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output.

• All files came out as >85% similar

Page 22: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Results

Using the test on a small real-world sample

Professor Kang’s CS201 HW1

• Relatively simple homework assignment

• 30-50% similarity average

• 95% similarity detected on one pair of submissions

• Confirmed by Professor Kang as correct

Page 23: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Results

Using the test on another small real-world sample

Professor Kang’s CS201 HW4• More complex homework assignment involving 2-3

files; break down of java files according to function• Problem being that specialized function files may

possible present false positives?• 30-70% similarity average• 95+% similarity detected on pairs of submissions• Confirmed by Professor Kang as correct

Page 24: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Results

• Things to note…

• The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive

• Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes

Page 25: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Conclusions

• Successful test cases

• Simple and straightforward to use

• Based on an objective principle which works!

Page 26: CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)

Future Work

• Enhancing the application to be able to compare internal “blocks” of code

• Improving the compression algorithm to better handle and adapt to “approximate matches”

• Improving the functionality with the GUI

• Providing a report printing capability of directories


Recommended