Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | melanie-andrews |
View: | 213 times |
Download: | 0 times |
A New Approach for Cross-Language Plagiarism Analysis
Rafael Corezola Pereira, Viviane P. Moreira,and Renata Galante
Universidade Federal do Rio Grande do SulCLEF 2010
Introduction
• Plagiarism is one of the most serious forms of academic misconduct
• It is defined as “the use of another person's written work without acknowledging the source”
• A study with over 80,000 students in the US and Canada found that many of them have already commited a plagiarism offense– 36% of undergraduate students– 24% of graduate students
• Several types– Word-for-word, paraphrasing, text translation, etc.
Introduction
• Cross-language plagiarism is becoming more commom– Evolution of automatic translation systems– Increasing availability of textual content in many
different languages
• Common scenario– A student downloads a paper, translates it using a
automatic translation tool, corrects some translation errors and presents it as his own work
• It can also involve self-plagiarism– Usually aims at increasing the number of publications
Introduction
• What is the task?– Detect the plagiarized passages in the suspicious
documents and their corresponding text fragments in the source documents even if the documents are written in different languages
• Known as External plagiarism analysis
Related Work
• Monolingual Plagiarism Analysis– Fingerprints, fuzzy-fingerprints, ...
• Cross-Language Plagiarism Analisys– Statistical bilingual dictionary + bilingual text alignment– Use EuroWordNet to transform words into a language
independent representation
• PAN competition– Enables different methods to be compared against each
other– It was held as an evaluation lab in conjunction with
CLEF 2010
The Proposed ApproachSuspiciousDocuments
OriginalDocuments
Norm. Susp.Documents
Norm. Orig.Documents
LanguageNormalization
SuspiciousDocument
for each
Index
Retrieval
CandidateDocuments
TrainingCorpus
Feature Selection+
Classifier Training
ClassificationModel
PlagiarismAnalysis
PreliminaryResult
Post-ProcessingFinal
Result
(1)
(2)
(4)
(5)
(3)
(1) Language Normalization
• All documents are converted into a common language
• English was chosen– More translation resources– One of the easiest languages to translate into
• Used a language guesser and an automatic translation tool
The Proposed ApproachSuspiciousDocuments
OriginalDocuments
Norm. Susp.Documents
Norm. Orig.Documents
LanguageNormalization
SuspiciousDocument
for each
Index
Retrieval
CandidateDocuments
TrainingCorpus
Feature Selection+
Classifier Training
ClassificationModel
PlagiarismAnalysis
PreliminaryResult
Post-ProcessingFinal
Result
(1)
(2)
(4)
(5)
(3)
(2) Retrieval of Candidate Documents
• Problem: It is not feasible to perform exhaustive comparisons
• Solution: Use passages from the suspicious document as a query to be sent to an IR system
• Note that documents are divided into subdocuments (paragraphs) in order to reduce the amount of text that must be analyzed
• At the end of this phase, we have a list of at most ten candidate subdocuments for each passage in the suspicious document
The Proposed ApproachSuspiciousDocuments
OriginalDocuments
Norm. Susp.Documents
Norm. Orig.Documents
LanguageNormalization
SuspiciousDocument
for each
Index
Retrieval
CandidateDocuments
TrainingCorpus
Feature Selection+
Classifier Training
ClassificationModel
PlagiarismAnalysis
PreliminaryResult
Post-ProcessingFinal
Result
(1)
(2)
(4)
(5)
(3)
(3) Feature Selection and Classifier Training
• The goal is to build a classification model that can learn how to distinguish between a plagiarized and a non-plagiarized text passage
• Annotated synthetic examples used for training• J48 classification algorithm• Features
– The cosine similarity between the suspicious passage and the candidate subdocument
– The similarity score assigned by the IR system– The position of the candidate subdocument in the rank
generated– The length (in characters) of the suspicious and the
candidate subdocument
The Proposed ApproachSuspiciousDocuments
OriginalDocuments
Norm. Susp.Documents
Norm. Orig.Documents
LanguageNormalization
SuspiciousDocument
for each
Index
Retrieval
CandidateDocuments
TrainingCorpus
Feature Selection+
Classifier Training
ClassificationModel
PlagiarismAnalysis
PreliminaryResult
Post-ProcessingFinal
Result
(1)
(2)
(4)
(5)
(3)
(4) Plagiarism Analysis
• Submit the test instances to the trained classifier and let it decide whether the suspicious passage is, in fact, plagiarized from one of the candidate subdocuments
SubDoc 1
SubDoc 2
SubDoc 10
...
Classifier
PlagiarizedOr
Non-Plagiarizedclass labels
SuspiciousDocument
Passage 1 Passage 2 Passage 5...IndexRetrieval
The Proposed ApproachSuspiciousDocuments
OriginalDocuments
Norm. Susp.Documents
Norm. Orig.Documents
LanguageNormalization
SuspiciousDocument
for each
Index
Retrieval
CandidateDocuments
TrainingCorpus
Feature Selection+
Classifier Training
ClassificationModel
PlagiarismAnalysis
PreliminaryResult
Post-ProcessingFinal
Result
(1)
(2)
(4)
(5)
(3)
(5) Result Post-Processing
• Join the contiguous plagiarized passages detected by the method in order to decrease the granularity score
• The granularity score is a measure that assesses whether the plagiarism method reports a plagiarized passage as a whole or as several small plagiarized passages
Experiments• Multilingual Test Collection
– ECLaPA collection assembled from the Europarl Parallel Corpus (English, Portuguese and French)
– An analogous monolingual corpus was also assembled
– Available at http://www.inf.ufrgs.br/~viviane/eclapa.html
• Terrier IR System (Porter Stemmer + Stop-Word Removal)
• Weka (J48 classification algorithm)• Google Translator (as language guesser)• LEC Power Translator• Evaluation Measures (PAN competition)Recall
R =
S
i i
i
s
sofcharsDetected
S 1
___#1
Precision
P =
R
i i
i
r
rofcharsdPlagiarize
R 1
___#1
Granularity
G =
RS
ii
R
RinsofDetectionsS 1
____#1
Overall Score
O = GF
1log 2
Experiments - Results
• Monolingual vs. Multilingual
• Recall was the most affected measure– Loss of information due to the translation process
• 86% of the overall score of the monolingual baseline
--- Monolingual Multilingual % of Monolingual
Recall 0.8648 0.6760 78.16%
Precision 0.5515 0.5118 92.80%
F-Measure 0.6735 0.5825 86.48%
Granularity 1.0000 1.0000 100%
Overall Score
0.6735 0.5825 86.48%
Experiments - Results
• Detailed analysis
• The larger the passage the easier the detection
• Plagiarized passages detected– Monolingual 90% vs. 77% Multilingual
---Monolingual Multilingual
Short Medium Large Short Medium Large
Detected 435 1289 239 242 1190 239
Total 607 1323 239 607 1323 239
% 71 97 100 39 90 100
Summary
• We proposed and evaluated a new method for CLPA– Used a classification algorithm in order to decide
whether a text passage is pagiarized or not
• We assembled an artificial cross-language plagiarism test collection to evaluate the method– It is freely available
• Cross-language experiment achieved 86% of the performance of the monolingual baseline
Future Work
• Improve the time spent during the analysis of each suspicious document– Analyze each suspicious passage in a different
computer?
• Test other features during the classifier training phase
• Evaluate the method while detecting plagiarism between documents written in unrelated languages– English vs. Chinese/Japanese– Many plagiarism cases happen between these pairs of
languages
• Citation analysis