A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira,...

A New Approach for Cross-Language Plagiarism Analysis

Rafael Corezola Pereira, Viviane P. Moreira,and Renata Galante

Universidade Federal do Rio Grande do SulCLEF 2010

Outline

• Introduction• Related Work• The Proposed Approach• Experiments• Summary and Future Work

Introduction

• Plagiarism is one of the most serious forms of academic misconduct

• It is defined as “the use of another person's written work without acknowledging the source”

• A study with over 80,000 students in the US and Canada found that many of them have already commited a plagiarism offense– 36% of undergraduate students– 24% of graduate students

• Several types– Word-for-word, paraphrasing, text translation, etc.

Introduction

• Cross-language plagiarism is becoming more commom– Evolution of automatic translation systems– Increasing availability of textual content in many

different languages

• Common scenario– A student downloads a paper, translates it using a

automatic translation tool, corrects some translation errors and presents it as his own work

• It can also involve self-plagiarism– Usually aims at increasing the number of publications

Introduction

• What is the task?– Detect the plagiarized passages in the suspicious

documents and their corresponding text fragments in the source documents even if the documents are written in different languages

• Known as External plagiarism analysis

Outline

Introduction• Related Work• The Proposed Approach• Experiments• Summary and Future Work

Related Work

• Monolingual Plagiarism Analysis– Fingerprints, fuzzy-fingerprints, ...

• Cross-Language Plagiarism Analisys– Statistical bilingual dictionary + bilingual text alignment– Use EuroWordNet to transform words into a language

independent representation

• PAN competition– Enables different methods to be compared against each

other– It was held as an evaluation lab in conjunction with

CLEF 2010

Outline

Introduction Related Work• The Proposed Approach• Experiments• Summary and Future Work

The Proposed ApproachSuspiciousDocuments

OriginalDocuments

Norm. Susp.Documents

Norm. Orig.Documents

LanguageNormalization

SuspiciousDocument

for each

Index

Retrieval

CandidateDocuments

TrainingCorpus

Feature Selection+

Classifier Training

ClassificationModel

PlagiarismAnalysis

PreliminaryResult

Post-ProcessingFinal

Result

(1)

(2)

(4)

(5)

(3)

(1) Language Normalization

• All documents are converted into a common language

• English was chosen– More translation resources– One of the easiest languages to translate into

• Used a language guesser and an automatic translation tool


OriginalDocuments




SuspiciousDocument

for each

Index

Retrieval

CandidateDocuments

TrainingCorpus

Feature Selection+

Classifier Training

ClassificationModel

PlagiarismAnalysis

PreliminaryResult


Result

(1)

(2)

(4)

(5)

(3)

(2) Retrieval of Candidate Documents

• Problem: It is not feasible to perform exhaustive comparisons

• Solution: Use passages from the suspicious document as a query to be sent to an IR system

• Note that documents are divided into subdocuments (paragraphs) in order to reduce the amount of text that must be analyzed

• At the end of this phase, we have a list of at most ten candidate subdocuments for each passage in the suspicious document


OriginalDocuments




SuspiciousDocument

for each

Index

Retrieval

CandidateDocuments

TrainingCorpus

Feature Selection+

Classifier Training

ClassificationModel

PlagiarismAnalysis

PreliminaryResult


Result

(1)

(2)

(4)

(5)

(3)

(3) Feature Selection and Classifier Training

• The goal is to build a classification model that can learn how to distinguish between a plagiarized and a non-plagiarized text passage

• Annotated synthetic examples used for training• J48 classification algorithm• Features

– The cosine similarity between the suspicious passage and the candidate subdocument

– The similarity score assigned by the IR system– The position of the candidate subdocument in the rank

generated– The length (in characters) of the suspicious and the

candidate subdocument


OriginalDocuments




SuspiciousDocument

for each

Index

Retrieval

CandidateDocuments

TrainingCorpus

Feature Selection+

Classifier Training

ClassificationModel

PlagiarismAnalysis

PreliminaryResult


Result

(1)

(2)

(4)

(5)

(3)

(4) Plagiarism Analysis

• Submit the test instances to the trained classifier and let it decide whether the suspicious passage is, in fact, plagiarized from one of the candidate subdocuments

SubDoc 1

SubDoc 2

SubDoc 10

...

Classifier

PlagiarizedOr

Non-Plagiarizedclass labels

SuspiciousDocument

Passage 1 Passage 2 Passage 5...IndexRetrieval


OriginalDocuments




SuspiciousDocument

for each

Index

Retrieval

CandidateDocuments

TrainingCorpus

Feature Selection+

Classifier Training

ClassificationModel

PlagiarismAnalysis

PreliminaryResult


Result

(1)

(2)

(4)

(5)

(3)

(5) Result Post-Processing

• Join the contiguous plagiarized passages detected by the method in order to decrease the granularity score

• The granularity score is a measure that assesses whether the plagiarism method reports a plagiarized passage as a whole or as several small plagiarized passages

Outline

Introduction Related Work The Proposed Approach• Experiments• Summary and Future Work

Experiments• Multilingual Test Collection

– ECLaPA collection assembled from the Europarl Parallel Corpus (English, Portuguese and French)

– An analogous monolingual corpus was also assembled

– Available at http://www.inf.ufrgs.br/~viviane/eclapa.html

• Terrier IR System (Porter Stemmer + Stop-Word Removal)

• Weka (J48 classification algorithm)• Google Translator (as language guesser)• LEC Power Translator• Evaluation Measures (PAN competition)Recall

R =

S

i i

i

s

sofcharsDetected

S 1

___#1

Precision

P =

R

i i

i

r

rofcharsdPlagiarize

R 1

___#1

Granularity

G =

RS

ii

R

RinsofDetectionsS 1

____#1

Overall Score

O = GF

1log 2

Experiments - Results

• Monolingual vs. Multilingual

• Recall was the most affected measure– Loss of information due to the translation process

• 86% of the overall score of the monolingual baseline

--- Monolingual Multilingual % of Monolingual

Recall 0.8648 0.6760 78.16%

Precision 0.5515 0.5118 92.80%

F-Measure 0.6735 0.5825 86.48%

Granularity 1.0000 1.0000 100%

Overall Score

0.6735 0.5825 86.48%

Experiments - Results

• Detailed analysis

• The larger the passage the easier the detection

• Plagiarized passages detected– Monolingual 90% vs. 77% Multilingual

---Monolingual Multilingual

Short Medium Large Short Medium Large

Detected 435 1289 239 242 1190 239

Total 607 1323 239 607 1323 239

% 71 97 100 39 90 100

Outline

Introduction Related Work The Proposed Approach Experiments• Summary and Future Work

Summary

• We proposed and evaluated a new method for CLPA– Used a classification algorithm in order to decide

whether a text passage is pagiarized or not

• We assembled an artificial cross-language plagiarism test collection to evaluate the method– It is freely available

• Cross-language experiment achieved 86% of the performance of the monolingual baseline

Future Work

• Improve the time spent during the analysis of each suspicious document– Analyze each suspicious passage in a different

computer?

• Test other features during the classifier training phase

• Evaluate the method while detecting plagiarism between documents written in unrelated languages– English vs. Chinese/Japanese– Many plagiarism cases happen between these pairs of

languages

• Citation analysis

Outline

Introduction Related Work The Proposed Approach Experiments Summary and Future Work

A New Approach for Cross-Language Plagiarism Analysis

Rafael Corezola Pereira, Viviane P. Moreira,and Renata Galante

Universidade Federal do Rio Grande do SulCLEF 2010

Questions?

Date post:	26-Dec-2015
Category:	Documents
Upload:	melanie-andrews
View:	213 times
Download:	0 times