Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan...

Slide 1

Authorship Attribution CS533 Information Retrieval Systems Metin KO Metin TEKKALMAZ Yiithan DEDEOLU 7 April 2006 Slide 2 CS533 Information Retrieval Systems 2 Outline Overview What is Authorship Attribution? Brief History Where and How to use it? Stylometry Style Markers Classification Methods Nave Bayes Support Vector Machine k-Nearest Neighbor Slide 3 7 April 2006CS533 Information Retrieval Systems 3 What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It is useful when two or more people claim to have written something or when no one is willing (or able) to stay that (s)he wrote the piece In a typical scenario, a set of documents with known authorship are used for training; the problem is then to identify which of these authors wrote unattributed documents. Slide 4 7 April 2006CS533 Information Retrieval Systems 4 A Brief History The advent of non-traditional authorship attribution techniques can be traced back to 1887, when Mendenhall first created the idea of counting features such as word length. His work was followed by work from Yule (1938) and Morton(1965) with the use of sentence lengths to judge authorship Slide 5 7 April 2006CS533 Information Retrieval Systems 5 Where to use it? Authorship Attribution can be used in a broad range of applications To analyze anonymous or disputed documents/books, such as the plays of Shakespeare (shakespeareauthorship.com) Plagiarism detection - it can be used to establish whether claimed authorship is valid. Slide 6 7 April 2006CS533 Information Retrieval Systems 6 Where to use it? (Contd) Criminal Investigation - Ted Kaczynski was targeted as a primary suspect in the Unabomber case, because authorship attribution methods determined that he could have written the Unabombers manifesto Forensic investigations - Verifying the authorship of e-mails and newsgroup messages, or identifying the source of a piece of intelligence. Slide 7 7 April 2006CS533 Information Retrieval Systems 7 Motivation So many publications existed, but no detailed work has been given for Turkish literature Idea Originated from: Kayp Yazarn zi, Eliasn Gizi by S. Ouzertem Our work is going to support his idea? Slide 8 7 April 2006CS533 Information Retrieval Systems 8 How to do it? When an author writes they use certain words unconsciously. Find some underlying fingerprint for an authors style. The fundamental assumption of authorship attribution is that each author has habits in wording that make their writing unique. Slide 9 7 April 2006CS533 Information Retrieval Systems 9 How to do it? (Contd) It is well known that certain writers can be quickly identified by their writing style. Extract features from text that distinguish one author from another Apply some statistical or machine learning technique given training data Showing examples and counterexamples of an author's work Slide 10 7 April 2006CS533 Information Retrieval Systems 10 How to do it Problems? Highly interdisciplinary area Expertise in linguistics, statistics, text authentication, literature? Too many style measures to apply? Statistical method complicated or so simple? Also too many exist in the literature as well Slide 11 7 April 2006CS533 Information Retrieval Systems 11 How to do it? (Contd) Determine style markers. Parse all of the documents and extract the features Combine the results in order to get certain characteristics about the authors Apply each of the statistical/machine learning approaches to assign a given document to the most likely author. Slide 12 7 April 2006CS533 Information Retrieval Systems 12 Stylometry The science of measuring literary style What are the distinguishing styles? Study the rarest, most striking features of the writer? Study how writers use bread-and-butter words (e.g. "to", "with" etc. in English)? Slide 13 7 April 2006CS533 Information Retrieval Systems 13 Stylometry "People's unconscious use of everyday words comes out with a certain stamp", David Holmes - stylometrist at the College of New Jersey "Rare words are noticeable words, which someone else might pick up or echo unconsciously. It's much harder for someone to imitate my frequency pattern of 'but' and 'in'.", John Burrows - emeritus English professor of the University of Newcastle in Australia Slide 14 7 April 2006CS533 Information Retrieval Systems 14 Style Markers in Our Study Frequency of Most Frequent Words Token and Type Lengths Token: All words Type: Unique words For the sentence I cannot bear to see a bear 7 tokens, 6 (context-free) types Sentence Lengths Syllable Count in Tokens Syllable Count in Types Slide 15 7 April 2006CS533 Information Retrieval Systems 15 Style Markers in General Some commonly used style markers Average sentence length Average syllables per word Average word length Distribution of parts of speech Function word usage The Type-Token ratio Word frequencies Vocabulary distributions Slide 16 7 April 2006CS533 Information Retrieval Systems 16 Test Set Slide 17 7 April 2006CS533 Information Retrieval Systems 17 Test Set Slide 18 7 April 2006CS533 Information Retrieval Systems 18 Test Set Slide 19 7 April 2006CS533 Information Retrieval Systems 19 Test Set Slide 20 7 April 2006CS533 Information Retrieval Systems 20 Classification Methods How the style markers are used? Several methods exist such as k-NN (k Nearest Neighbor) Bayesian analysis SVM (Support Vector Machines) PCA (Principal Components Analysis) Markovian Models Neural Networks Decision Trees We are planning to use Nave Bayes SVM K-NN Slide 21 7 April 2006CS533 Information Retrieval Systems 21 Nave Bayes Approach In general each style marker is considered to be a feature or a feature set Existing text whose author is known is used for training Several choices are possible to find out the distributions of the feature values in a text with a known author such as Maximum likelihood estimation Bayes Density Estimation Maximization-Estimation etc. Slide 22 7 April 2006CS533 Information Retrieval Systems 22 Nave Bayes Approach Values of the features (x) for the unattributed text is found Since the probability densities are known for each author, Bayes formula is used to find the author of the anonymous text A * = argmax A i (P(A i |x) = p(x|A i ) P(A i )) Slide 23 7 April 2006CS533 Information Retrieval Systems 23 An Oversimplified Sample Scenario Assume that There are texts from two authors (two classes) As the style marker only the number of words with 3 characters is used (one feature) Classifier is trained with the text pdf's obtained Slide 24 7 April 2006CS533 Information Retrieval Systems 24 An Oversimplified Sample Scenario Assume that the unattributed text has 10 words with 3 characters Check whether the author 1 or the author 2 has higher probability of having 10 words with 3 characters The unattributed text is assigned to the author with a higher probability for 10 words with 3 characters Slide 25 7 April 2006CS533 Information Retrieval Systems 25 Support Vector Machines (SVMs) Supervised learning method for classification and regression Quite popular and successful in Text Categorization (Joachim et al.) Seeks for an hyper plane separating two classes by: Maximizing the margin Minimizing the classification error Solution is obtained using quadratic optimization techniques Slide 26 7 April 2006CS533 Information Retrieval Systems 26 Support Vector Machines (SVMs) denotes +1 denotes -1 Sample adapted from Andrew Moores SVM slides Slide 27 7 April 2006CS533 Information Retrieval Systems 27 Support Vector Machines (SVMs) denotes +1 denotes -1 Slide 28 7 April 2006CS533 Information Retrieval Systems 28 Support Vector Machines (SVMs) denotes +1 denotes -1 Slide 29 7 April 2006CS533 Information Retrieval Systems 29 Support Vector Machines (SVMs) denotes +1 denotes -1 Slide 30 7 April 2006CS533 Information Retrieval Systems 30 Support Vector Machines (SVMs) denotes +1 denotes -1 Slide 31 7 April 2006CS533 Information Retrieval Systems 31 Support Vector Machines (SVMs) denotes +1 denotes -1 Slide 32 7 April 2006CS533 Information Retrieval Systems 32 Support Vector Machines (SVMs) denotes +1 denotes -1 Margin Slide 33 7 April 2006CS533 Information Retrieval Systems 33 Support Vector Machines (SVMs) denotes +1 denotes -1 Support Vectors define the hyperplane Maximum margin linear classifier, simplest SVM Support vectors lie on the margin and carry all the relevant information Slide 34 7 April 2006CS533 Information Retrieval Systems 34 Support Vector Machines (SVMs) Slide 35 7 April 2006CS533 Information Retrieval Systems 35 Support Vector Machines (SVMs) denotes +1 denotes -1 How to find the hyperplane? x=0 Slide 36 7 April 2006CS533 Information Retrieval Systems 36 Support Vector Machines (SVMs) denotes +1 denotes -1 Move training data into higher dimension with kernel functions x=0 Slide 37 7 April 2006CS533 Information Retrieval Systems 37 Support Vector Machines (SVMs) denotes +1 denotes -1 The hyperplane may not be linear in the original space x=0 Slide 38 7 April 2006CS533 Information Retrieval Systems 38 Support Vector Machines (SVMs) Basis functions are of the form: Common kernel functions: Polynomial Sigmoidal Radial basis Slide 39 7 April 2006CS533 Information Retrieval Systems 39 Multi-class SVM SVM only works for binary classification, how to handle multi-class (N classes) cases? Create N SVMs SVM 1 learns Output==1 vs Output != 1 SVM 2 learns Output==2 vs Output != 2 : SVM N learns Output==N vs Output != N While predicting the output, assign the label of the SVM which puts the input point into furthest positive region Slide 40 7 April 2006CS533 Information Retrieval Systems 40 SVM Issues Choice of kernel functions Computational complexity of the optimization problem Slide 41 7 April 2006CS533 Information Retrieval Systems 41 k-Nearest Neighbour Classification Method Key idea: keep all the training instances Given query example, take vote amongst its k neighbours Neighbours are determined by using a distance function Slide 42 7 April 2006CS533 Information Retrieval Systems 42 k-Nearest Neighbour Classification Method (k=1) (k=4) Probability interpretation: estimate p(y|x) as Sample adapted from Rong Jins slides Slide 43 7 April 2006CS533 Information Retrieval Systems 43 k-Nearest Neighbour Classification Method Advantages: Training is really fast Can learn complex target functions Disadvantages Slow at query time: Efficient data structures are needed to speed up the query Slide 44 7 April 2006CS533 Information Retrieval Systems 44 How to choose k? Use validation with leave-one-out method For k = 1, 2, , K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Slide 45 7 April 2006CS533 Information Retrieval Systems 45 How to choose k? Use validation with leave-one-out method For k = 1, 2, , K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Slide 46 7 April 2006CS533 Information Retrieval Systems 46 How to choose k? Use validation with leave-one-out method For k = 1, 2, , K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal (k=1) Slide 47 7 April 2006CS533 Information Retrieval Systems 47 How to choose k? Use validation with leave-one-out method For k = 1, 2, , K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 1 Slide 48 7 April 2006CS533 Information Retrieval Systems 48 How to choose k? Use validation with leave-one-out method For k = 1, 2, , K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 1 Slide 49 7 April 2006CS533 Information Retrieval Systems 49 How to choose k? Use validation with leave-one-out method For k = 1, 2, , K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 3 Err(2) = 2 Err(3) = 6 k = 2 Slide 50 7 April 2006CS533 Information Retrieval Systems 50 Future Work & Conclusion Preliminary features distributions seem discriminative Will apply classification methods on the feature set Will rank the features success rate May come up with new style markers

Date post:	14-Dec-2015
Category:	Documents
Upload:	ingrid-mankins
View:	216 times
Download:	0 times

Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan...

Documents