Date post: | 04-Apr-2018 |
Category: |
Documents |
Upload: | iqra-javed |
View: | 224 times |
Download: | 0 times |
of 32
7/30/2019 text minning
1/32
Presented By:
Iqra Javed
BSE 2005-2009
7/30/2019 text minning
2/32
Presentation Rundown
Introduction.
Proposed Approach.
Techniques and methodology.
Application Architecture.
System Design.
Conclusion and Results.
7/30/2019 text minning
3/32
AUTOMATIC AUTHORSHIP
ATTRIBUTION
Automatic authorship attribution is the task of a system that hasto decide which author from a given list of authors wrote a
given unspecified and unattributed document.
A number of attributed authorship documents served as atraining set.
Attributed:
Unattributed:
Stylometry
7/30/2019 text minning
4/32
Proposed Approach
The proposed research of this thesis will use the machine
learning model (based on nearest neighbor classifier) with
entropy weighted to detect and identify the class of testing
document and implement it using different architectures to get
the better system understanding as well as more reliable results.
7/30/2019 text minning
5/32
Data Mining Data mining refers to the mining and extraction of desired
knowledge from a large amount of data.
Text Mining:
The process of deriving high quality information from text.
High quality information is typically derived through the
deriving of patterns and trends through means such as
statistical pattern learning
7/30/2019 text minning
6/32
Sample
Documents
Transformed
Representation
modelsLearning Domain specific
templates/models
Text document
Visualizations
7/30/2019 text minning
7/32
Text Mining Methods:
Information Retrieval (IR)
Information Retrieval (IR)
Information Extraction (IE)
Natural Language Processing (NLP)
7/30/2019 text minning
8/32
Nearest Neighbor Classification (NN).
The K-NN method was first introduced in earlier 1950s as the
method require computational complexity so it remains
unpopular till 1960 when the computational power has been
introduced.
Nearest neighbour classifier are based on learning by analogy
that is by comparing the specified test tuples with the already
trained tuples that are similar to it.
7/30/2019 text minning
9/32
Used Terminologies. Entropy.
Entropy indicates how large the information content uncertainty
of a clustering result with respect to the given classification is
Cosine Similarity.
Cosine similarity is a measure of similarity between two vectorsof n dimensions by finding the cosine of the angle between
them.
7/30/2019 text minning
10/32
Object Oriented Architecture. Object-oriented programming Architecture focuses on the
relationships between classes that are combined into one large
binary executable .
In the traditional object-oriented once classes are compiled, the
result is monolithic binary code. All the classes share the same
physical deployment unit (typically an EXE), process, address
space, security privileges, and so on.
If multiple developers work on the same code base, then itrequires the sharing of source files.
Redeployment of all the other classes that results in a burden
for managing the application.
7/30/2019 text minning
11/32
Component Based Architecture. Component-oriented application
comprises a collection of interacting
binary application modules that is,
its components and the calls thatbind them.
The motivation for breaking down a
monolithic application into multiple
binary components is analogous tothat for placing the code for
different classes into different files.
A component-oriented application is
easier to extend
7/30/2019 text minning
12/32
TECHNIQUES AND METHODOLOGY
The Automatic Authorship Attribution System
accepts plain text documents. The system working is
based on two phases of training and testing. The
working of both the phases is similar to each other.
7/30/2019 text minning
13/32
Application Architecture of System
7/30/2019 text minning
14/32
Text Pre-Processing.
Text pre-processing is the process of transforming theunstructured document into structured format.
Tokenization.
Stop Word Filtering.
Lemmatization. Stemming.
Bag of Words.
Domain Dictionary Creation.
7/30/2019 text minning
15/32
Text Pre-Processing.
Tokenization:
The process of splitting text into its constituent tokensis called tokenization.
Stop Word Filtering:
Filtering is used to remove words from the dictionaryas well as from the documents.
Such as the , are , but , for , In ,. Etc
7/30/2019 text minning
16/32
Text Pre-Processing.
Lemmatization.
methods try to map verb forms to the infinite
tense and nouns to the singular form.Stemming.
Word stemming is an important feature
supported by present day indexing and searchsystems.
Stemming broadens results to include both
word roots and word derivations.
7/30/2019 text minning
17/32
Porters Stemming Algorithm.
The Porter Stemmer is a conflation Stemmer developed by
Martin Porter at the University of Cambridge in 1980.
Porter stemming is a process for removing the commoner
morphological and in flexional endings from words in English.
It is based on 5-steps with stem length 2 ( in proposed system).
7/30/2019 text minning
18/32
Porters Algorithm Steps.
Step #1: deals with plurals and past participles.
Ex. plastered->plaster, motoring-> motor
Step #2: deals with pattern matching on some common suffixes.
Ex. happy -> happi, relational -> relate.
Step #3: deals with special word endings.
Ex. triplicate-> triplic, hopeful-> hope
Step #4:checks the stripped word against more suffixes in case
the word is compounded. Ex. revival -> reviv, allowance-> allow, etc.,
Step #5: checks if the stripped word ends in a vowel and fixes it
appropriately
Ex. probate -> probat, cease -> ceas, controll -> control.
7/30/2019 text minning
19/32
BOW and Dictionary Creation
Bag -OfWords (BOW):The BOW created contains all the words in the
document irrespective of the order of words as it
does not create any difference.Example:
There are two red apples on a red table.
BOW ={there, two, red, apples, red, table }
Domain Dictionary Creation:
It based on unique words collection from all
documents.
7/30/2019 text minning
20/32
Vector Space Model (VSM). It represents documents as vectors in n-dimensions in order to
perform multiple vector operations.
One of the main tasks of vector space representation is to find
appropriate encoding of feature vector where each element of
the vector represents a word of document collection.
Multi term encoding is used in the system to get the word
presence as well as word frequency related to a document.
7/30/2019 text minning
21/32
Centroid-Based Classifier.
1. Input new document d = (w1, w2,,wn) for the training of thesystem.
2. Predefined categories: C={c1,c2,.,cl} based on training.
3. Compute centroid vector on the basis of the vector magnitudes of
the documents.4. Similarity model - cosine function
5. Compute similarity on the basis of the nearest matching classcosine value in order to declare the class of the testing document.
6. Output : Assign to document d the categorycmax.
jlil
jlil
ji
ji
jiji
ww
ww
dd
ddddddSimil
22
22
,cos),(
),cos(),( dcdcSimilii
7/30/2019 text minning
22/32
Used Terms:
Euclidean Distance:
The Euclidean distance or mahalanobis distance of each pointfrom term space origin is measure in order to compute the
vector magnitude while ignoring the zero terms.|Di| =(xi2 + yi2)
Dot-Product:
Calculate the dot-product for all the documents while ignoring
the zero values for lessen down the computational burden. Dot-product is calculated by
Q . Di = | Di | * | idf |
7/30/2019 text minning
23/32
SYSTEM DESIGN
System Requirement:
Hardware Requirement:
Pentium 4 computer with the RAM of 1GB are required.
Software Requirement:
Operating system of Microsoft Windows XP /2000 and abovewith Microsoft Office Access 2007 and .Net Framework 2008.
7/30/2019 text minning
24/32
Workflow of Automatic Authorship System:
7/30/2019 text minning
25/32
Results and Discussions
7/30/2019 text minning
26/32
User Interface:
7/30/2019 text minning
27/32
Application User Interface
7/30/2019 text minning
28/32
Graphic Analysis of Application
7/30/2019 text minning
29/32
Help And User Manual
7/30/2019 text minning
30/32
Application Using Object Oriented
Architecture.
Comprises of creation of class with in the application and
calling the functions behind the user interface events in order to
perform he functionality of the of authorship attributionsystem.
The maintenance of changing requirement is not an easy task
to handled with using this architectural approach.
7/30/2019 text minning
31/32
Application Using Component Based
Architecture.
Comprises of creation of two base classes one for data
preprocessing and other for classification purpose.
Creating the class object of base class in child class and calling
them with respective function the system can perform its
functionality.
Maintenance of changing requirement can be easily handled in
this architectural approach.
7/30/2019 text minning
32/32
Conclusion
The time used by component based architecture issomehow less than that of object oriented architecture
which prove component based architecture to be preferablein case of execution time.
The architectural approach somehow effect the application
and its results, as based on the experimental results of theapplication, in component based approach themaintenance , execution and test results are comparativelybetter than those provided by object oriented .