Date post: | 06-May-2018 |
Category: |
Documents |
Upload: | truongthuan |
View: | 213 times |
Download: | 0 times |
The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student):
Detecting Arabic Text
May Ali AL-Farsi
Computer Science (BSc)
2010/2011
I
Summary
This report investigates machine learning algorithms that are used in text retrieval. It aims to
implement classification to categorize different verses of Quran and sayings in the Hadith. The data
that is used in this task should be reformatted in a corpus form to be used in classification. The
classification implementation was performed using two corpora: the first includes all the text of the
holy Quran verses or the prophet sayings and the second includes the text that holds the interesting
verse or sayings. Since this project is classifying text that is in Arabic this two data sets were selected
for their availability in this language. Another reason was that this project uses supervised method that
requires pre-knowledge on the data set used in order to label it. Both data sets were pre-classified
either in previous text classification implementation such as the holy Quran or Hadith that is classified
based on Islamic scholars in their books such as Sahih Muslim. The classification task was to divide
the text into two classes interesting/ non-interesting based on a defined subject. The hereafter concepts
subject was selected to classify the Holy Quran or the prophet sayings as interesting. Predefined
features were used to help in the classification tasks which were selected based on Islamic scholars‟
researches. WEKA will be used to implement the classification task after retrieving the information
required.
II
Acknowledgements
In the beginning I would like to thank my supervisor, Eric Atwell, whose help and
supervision led me throughout the project. I would also like to thank Katja Markert for her
useful feedback on my mid-report.
Finally, I would like to thank all PhD students at the University of Leeds who supported me in any
respect during the completion of the project.
III
Contents
Chapter 1 ............................................................................................................................................. 1
Introduction ............................................................................................................................................. 1
1.1 Overview ....................................................................................................................................... 1
1.2 Aim ............................................................................................................................................... 2
1.3 Objectives ..................................................................................................................................... 2
1.4 Minimum Requirements ............................................................................................................... 3
Chapter 2 ................................................................................................................................................. 4
Background Research ............................................................................................................................. 4
2.1 Machine Learning ......................................................................................................................... 4
2.2 Natural Language Processing........................................................................................................ 4
2.3 Text Classification ........................................................................................................................ 5
2.3.1 Text Classification methods ................................................................................................... 6
2.3.2 Classification Examples ......................................................................................................... 8
2.4 WEKA......................................................................................................................................... 11
2.4.1 Graphical interface & Performance ..................................................................................... 11
2.4.2 Data format: ARFF file and Processing ............................................................................... 12
Chapter 3 ............................................................................................................................................... 13
Project Plan ........................................................................................................................................... 13
3.1 Procedures and Deliverables ....................................................................................................... 13
3.2 Schedule ...................................................................................................................................... 14
3.3 Methodology ............................................................................................................................... 14
Chapter 4 ............................................................................................................................................... 16
Design and Implementation .................................................................................................................. 16
4.1 Classification Overview .............................................................................................................. 16
4.2 Features ....................................................................................................................................... 16
4.3 Approaches ................................................................................................................................. 17
4.3.1 The holy Quran data set ....................................................................................................... 19
4.3.2 The hadith data set ............................................................................................................... 23
Chapter 5 ............................................................................................................................................... 27
Results and Evaluation .......................................................................................................................... 27
5.1 WEKA Results ............................................................................................................................ 27
5.1.1 The holy Quran data set results ............................................................................................ 27
5.1.2 The hadith data set results .................................................................................................... 32
5.2 Evaluation of the Model .............................................................................................................. 36
IV
5.2.1 Evaluation on data sets ......................................................................................................... 36
5.2.2 Evaluation of feature ............................................................................................................ 36
5.3 Analysis....................................................................................................................................... 37
5.3 Future Work ................................................................................................................................ 40
Bibliography ......................................................................................................................................... 42
Appendix A ........................................................................................................................................... 44
Personal Reflection ............................................................................................................................... 44
Appendix B ........................................................................................................................................... 46
Materials used in the Project ................................................................................................................. 46
1
Chapter 1
Introduction
1.1 Overview
This project classified interesting and non-interesting text by designing a system that would
retrieve information that will be useful in the implementation of the classifier. Then it will attempt a
number of approaches that would evaluate the classifier‟s accuracy. In order to retrieve the
information required, features that would specify the interesting subject in the text were defined.
According to ChengXiang Zhai [1], one way of representing the data that would be used in text
maiming is topic model labelling. This is by implementing supervised methods for text classification.
This project will label the data in different mergers that will be explained later in the design section.
The data sets that were selected in implementing text classification in this project are the holy Quran
and the Hadith as an additional dataset. Since this project is classifying text that is in Arabic these two
data sets were available in this language. Another reason for selecting these two data sets is that this
project uses supervised method that requires pre-knowledge on the data set that is used. For example
labelling into classes they belong to the then using supervised algorithms to train and test the
classification. Since both data sets are pre-classified either in previous text classification
implementation such as the holy Quran or Hadith that is classified based on Islamic scholars in their
books such as Sahih Bukhari and Sahih Muslim. Both data sets hold the hereafter concepts. As a
result this subject was selected to classify verses of the Holy Quran or the sayings of the prophet as
interesting. First implementation was on the holy Quran and the classification was tested on the
English data set. The English version of the holy Quran was provided by Claire Brierley, a researcher
at the University of Leeds. However the Arabic version was downloaded from the internet and was
structured to use it in the implementation. The Arabic version of the subset was created manually
using the English subset that was provided and the full Arabic version of the holy Quran that was
downloaded. In addition, hadith was also added as new data set to implement the classifier and test if
it was performing in the expected behaviour. This data set was manually created from Sahih Muslim
to include all the sayings of the prophet Muhammad (May peace be upon him) and was later used in
implementation. Creating the Hadith data set was a bit challenging since it has never been created and
2
it was essential to reformat it to a corpus that would be used in the implementation. However, the
main challenge was improving the result of the classification when it was skewed.
The outline of the report includes the procedures that were carried out to understand the problem
of text classification and implement a solution that would solve it. The following chapter will be the
Background review that represents the first attempt to understand the problem from previous work
and implementations. The third chapter will include a description of the procedures and deliverables,
schedule, and the methodology that was applied in this project. The forth chapter is the design and
implementation chapter, that includes an overview of the classification task, a description of the
features that were defined in order to aid the classification implementations, and finally the
approaches that were considered and the results of the classification from WEKA. The final chapter in
this project is the evaluation, which includes an evaluation on the data set, the selected features, an
analysis of the results, future work. In addition, an appendix A is attached that includes personal
reflection, along with appendix B that describes the contents of the appendix CD.
1.2 Aim
The aim of the project is to produce a classification of the verses in the Arabic version of
Holy Quran into two classes, which are interesting or non-interesting. This will be implemented using
supervised learning algorithms that include training sets and the Quran corpus. The system is built on
classifications that are based on predefined features that are used in open source software tools called
WEKA which will help in analysing the performance of the features that were defined.
1.3 Objectives
The main objectives of the project are to:
1. Provide a good background review on text classification and machine learning algorithms.
2. Generate a system that classifies data sets into interesting and non-interesting types based on
features that are defined by the user to describe the interesting set. In this case the interesting
set is the verses of the holy Quran that relates to the hereafter concept.
3. Research on how to define beneficial features in order to improve the text classification. In
the case of the project the defined feature must help in identifying the verses that hold the
hereafter. This may include signs of the hereafter, name of the hereafter and the awards in the
hereafter.
4. Generate class labels for all verses in the holy Quran, ready for training and testing.
5. Try 10-fold cross-validation as a testing method.
6. Consider WEKA options of classification implemented on designed .arff file.
3
1.4 Minimum Requirements
The minimum requirements are:
1. Understand how categorizing should be done in this project and how to build an accurate
classifier.
2. Build a Java program that helps in retrieving information required from the holy Quran and
generate an .arff file that is used in classification tools.
3. Train and test the data set.
4. Implement the classifier and evaluate the results.
4
Chapter 2
Background Research
2.1 Machine Learning
Machine learning is a scientific field in artificial intelligence. It permits computers to find
patterns in, classify, or cluster a given set of text. Classification is one of the important tasks that
could be accomplished in a machine learning field via natural language processing. Computers are
trained to classify certain given attributes and then experts evaluate the algorithm to check if it meets
the requirements and give the percentage accuracy of the outcome [2]. Natural language processing
tasks could be implemented using machine learning methods.
2.2 Natural Language Processing
Natural Language Processing (NLP) can be defined as a computational approach to analyse
text that is based a set of theories and technologies. The term NLP is normally used to describe the
function of software or hardware components in a computer system which analyse or synthesize
spoken or written language. The “natural” description is meant to distinguish human speech and
writing from more formal languages such as mathematical or logical notations, or computer languages
such as Java and C++ [3]. In order to achieve these tasks NLP uses some linguistic tools that help in
text mining and information retrieval. It includes many important techniques to apply and extract
knowledge automatically from the text. Documents are split into paragraphs, then into sentences
which are eventually split into words. These words are tagged by part of speech, grammatical analysis
and other features prior to the sentence being parsed. Thus, parsing is carried out on sentence
delimiters, tokenizers, stemmers and part of speech taggers. In general detecting sentence boundaries
is done accurately by stating delimiters that rely on regular expressions or punctuation signs. Regular
expressions or punctuation signs can disambiguate, for example, the end of a sentence. In addition this
type of analysis can rely on training corpora, or use more substance such as part of speech
frequencies. Furthermore, the use of tokenization can be utilised to disambiguate punctuation
characters since tokenizers are lexical analysers that divide a motion of characters into meaning units
called tokens. Parsing cannot proceed in the absence of lexical analysis, and so it is required to use
stemmers. Stemmers are morphological analysers that identify alternative terms to a word using root
5
form. Moreover, part of speech tagger builds upon tokenizers and sentence delimiters label every
word with its proper tag such as noun, verb, adjective etc. Finally, parsing maybe accomplished by
addressing the simplified variant of the parsing extend, which helps in extracting interesting parts of
the text. Moreover parsing can be done with respect to grammar, which is a set of rules that say which
combinations of parts of speech generate well-formed phrase and term structure [3].
2.3 Text Classification
Text mining is the knowledge of analysing and applying specific algorithms to text to detect
and extract substance. Text mining uses techniques primarily developed in the field of information
retrieval, statistics, and machine learning. It is achieved in three main stages. Firstly, the text
preparation stage where the text selection and management is developed, presenting training sets
which were created manually by AI experts. Secondly, the text processing stage that uses text mining
algorithms to treat the processed data sets. At this stage of the text mining procedure a fully featured
natural language processing system would define rules and attributes, and features of the provided
data to be used in designated algorithms or techniques such as decision trees [4]. The NLP systems
usually convert the input to an internal representation that interacts with the external language
resource such as dictionaries. This is done to produce a useful analysis and annotation on the input
text. This can be used in many types of application such as: automatic question –answering, text
summarization, machine translation into another language, analysis of customer preferences,
automated tagging of internet advertising, etc. [2]. Thirdly, is the text analysis stage that consists of
the evaluation and demonstration of assumptions made by the experts in the beginning. The extracted
text is passed to text tools to visualize and eventually present a detailed analysis [4]. According to
Khitam Jbara [5], automatic text classification is an essential research subject in text classification
researches. This is due to the availability of the large number of digital documents that are used. In
addition, according to Al-Kabi, G Kanaan; Al-Shalabi, R; M.O, Khalid; Bani-Ismail, and Basel
Mohammed [6], Automatic Text Categorization (ATC) refers to producing software that includes
predefined categories to handle “unseen” text files.
Text classification has been the subject of many researches in different languages around the
world. Projects on techniques to classify texts in different languages such as English, European and
Asian are widely spread. One of the examples of a project that will be implemented in this project is
classifying Arabic text. Automatic Arabic text classifications are mostly implemented in the same
sequence by first compiling the text document in a corpus and have the option of labelling it.
Secondly select the most suitable feature to classify. Thirdly the selection of classification algorithms
that is trained and tested. [7].
6
2.3.1 Text Classification methods
Data are divided into two main types based on the attributes. At first, data in which the
attributes were taken into account and intended to be used are called labelled data. This type of data
uses supervised learning methods for data mining. If the attributes can be divided into categories, then
it is a classification. On the other hand, analysing numerical attributes is called regression. Secondly,
data in which no attributes were taken into account are called unlabelled data, which aims to extract
information from the data set. This type of data uses an unsupervised learning method. In addition,
semi-supervised learning is another machine learning method that is used in text classification and is a
combination of the previously mentioned methods.
2.3.1.1 Unsupervised Learning
Unsupervised learning uses training samples that include a number of instances. These
instances do not help in directing the results of the training [8]. Clustering is considered to be an
alternative term to unsupervised learning. Clustering can be the objective function in the
implementation of some solutions to a problem. Clustering can be hierarchal in some problems, or a
modal based clustering [9]. Since the unlabelled data that is used for information extraction is done by
implementing unsupervised method, it is important to understand that clustering is grouping objects
that are similar to each other in a cluster. One example of clustering is K-means clustering. This
algorithm is considered to be one of the simplest unsupervised learning algorithms. It identifies the
groups that are similar in a data set, without labelling or guidance. It is implemented by defining
points in a dimensional space. The number of attributes defines the space dimensions [10]. Another
example of clustering is hierarchical clustering algorithms which can be top-down or bottom- up.
According to Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze [11], this type is
more commonly used in information retrieval and it could be visualised in a dendrogram. On the
other hand, the top-down method is more like splitting a cluster into points repeatedly until it reaches
a point that stands as a cluster by itself.
2.3.1.2 Supervised Learning
Supervised learning uses corpus to build and train a system based on specific features. The
system in this type of algorithm determines in advance the text content that should be considered and
what types of information to use from the contents. It determines the available methods of combining
the contextual evidence from the values of features in the training data [12]. There are many
approaches when applying supervised machine learning, such as Decision Trees, Naïve Bayes
Classifier and nearest neighbour algorithm [13]. According to [3], classification using supervised
machine learning methods is achieved if the following conditions were fulfilled:
1. Pre-classification of the data that will be analysed
2. “In the simplest case, these classes should be disjoint”
7
3. If the data cannot be split into classes, then convert the data to be classified to (n)
corresponding sub-problems, where a sub-problem classifies data to those that belong to the
corresponding category and those which do not {Yes, No}.
One example of supervised classification method is Naïve Bayes Classifiers that use probability
theory to achieve the possible classifications. Bayes‟ theorem was constructed by Thomas Bayes, the
first mathematician to use probability in supervised learning [13]. Naïve Bayes looks at the
distribution of terms, either with respect to their frequencies or with respect to their presence or
absence. It looks at the probability of a certain feature from the data set and classifies it. This is
achieved by calculating a function that extracts the frequency of the terms‟ occurrences [3].
Another method to apply supervised learning is to build a tree that contains features from data
sets. The root of the tree will be the categorization that is used to differentiate the class. The nodes of
the tree are decision points that test a feature. A branch in the decision tree corresponds to the value of
the results. If the tree was built based on the training set that was pre-classified, then it forms
inductive learning techniques. According to [3], “the decision tree method works best with large data
sets, training sets that are too small will lead to over fitting. The data must be in a regular attribute
value format. Thus each datum must be capable of being characterized in terms of a fixed set of
attributes and their values, whether symbolic, ordinal or continuous. Continuous values can be tested
by threshholding. Assuming that they are applicable, decision tree methods can have a number of
advantages over more conventional statistical methods:
1. They make no assumptions about the distribution of attributes‟ values.
2. They do not assume that conditional independent of attributes.” [3]
2.3.1.3 Semi-Supervised Learning
Another method of machine learning is the semi-supervised learning or “reinforcement
learning” method. This method is basically a combination of both supervised and unsupervised
learning. This means that the data set will include a set of labelled data along with unlabelled data. It
is claimed that semi-supervised results higher accuracies that the other two methods. According to
Steven Abney [14], there are six different implementations of semi-supervised learning methods
approaches when applying supervised machine learning, such as Decision Trees, Naïve Bayes
Classifier and nearest neighbour algorithm [13]. According to [3], classification using supervised
machine learning methods is achieved if the following conditions were fulfilled:
4. Pre-classification of the data that will be analysed
5. “In the simplest case, these classes should be disjoint”
6. If the data cannot be split into classes, then convert the data to be classified to (n)
corresponding sub-problems, where a sub-problem classifies data to those that belong to the
corresponding category and those which do not {Yes, No}.
8
One example of supervised classification method is Naïve Bayes Classifiers that use probability
theory to achieve the possible classifications. Bayes‟ theorem was constructed by Thomas Bayes, the
first mathematician to use probability in supervised learning [13]. Naïve Bayes looks at the
distribution of terms, either with respect to their frequencies or with respect to their presence or
absence. It looks at the probability of a certain feature from the data set and classifies it. This is
achieved by calculating a function that extracts the frequency of the terms‟ occurrences [3].
Another method to apply supervised learning is to build a tree that contains features from data
sets. The root of the tree will be the categorization that is used to differentiate the class. The nodes of
the tree are decision points that test a feature. A branch in the decision tree corresponds to the value of
the results. If the tree was built based on the training set that was pre-classified, then it forms
inductive learning techniques. According to [3], “the decision tree method works best with large data
sets, training sets that are too small will lead to over fitting. The data must be in a regular attribute
value format. Thus each datum must be capable of being characterized in terms of a fixed set of
attributes and their values, whether symbolic, ordinal or continuous. Continuous values can be tested
by threshholding. Assuming that they are applicable, decision tree methods can have a number of
advantages over more conventional statistical methods:
3. They make no assumptions about the distribution of attributes‟ values.
4. They do not assume that conditional independent of attributes.” [3]
2.3.1.3 Semi-Supervised Learning
Another method of machine learning is the semi-supervised learning or “reinforcement
learning” method. This method is basically a combination of both supervised and unsupervised
learning. This means that the data set will include a set of labelled data along with unlabelled data. It
is claimed that semi-supervised results higher accuracies that the other two methods. According to
Steven Abney [14], there are six different implementations of semi-supervised learning methodswhich
are self-training, agreement-based methods (co-training algorithm), clustering algorithms, boundary-
oriented methods, label propagation in graphs and spectral methods. According to [15], the self-
training method has been used in text categorisation by applying a classifier on a labelled data to train
it and then on unlabelled data. Co-training is another way of implantation which is accomplished by
using two classifiers and splitting the features into subsets. The two classifiers are trained with the
labelled data, each on one of the subsets. Then each classifier is used on the unlabelled data.
2.3.2 Classification Examples
Islamic scholars have shown their interest in classifying the verses of Quran for a number of
reasons. For example classification of verses help in understanding the context of the chapters,
analysing the evolution and wisdom behind some Islamic rulings in Quran, understanding the history
of the Islamic nation and the biography of the Prophet Muhammad (May peace be upon him) [2]. In
9
addition, according to Khitam Jbara [5] , “Al hadith is the saying of Prophet Muhammad (May peace
be upon him) and the second religious source for all Muslims”. It is important to be able to classify
the hadith into subjects that are classes in research. Two examples will be discussed in this section. A
first example is the Quranic classification and a second is the hadith classification. Both classification
examples required a predefined corpus to train and test the results of the classifier. The corpus is a
natural language source of language information, since it is the illustration of the linguistic
knowledge. Corpora are used as data analysis tools to determine patterns or other language processing
tasks. A corpus maybe created manually to achieve a certain aim and the objective of a study. The text
mining procedures can be carried out on large data sets such as the Quran corpus, and use words and
interesting features that were defined by the user to retrieve information. The use of a corpus helps to
reduce time and effort when classifying manually. Most text classification examples use corpus based
research. In the following examples, an explanation is provided on how the corpus is created and the
method and algorithms that were implemented to achieve the classification.
Quranic classification
According to Al-Kabi, G. Kanaan et al. [6], the journal stated that the objective of the study
was to classify the verses (Ayat) of The Opening (Fatiha) and Yaseen (Ya-seen) chapters according to
Islamic scholars, using a linear classification function. This was accomplished by building a system
that intended to categorize the different verses in each chapter. The system was designed using
Microsoft Visual Basic because it supports the Unicode of Arabic language. To begin with the
implementation, where a selection was made of the chapter and verse to classify, a list of words as
features was generated and their occurrence counted and a check was made to see what subject they
were related to in order to create a class for that subject. A corpus was created to build a list of
subjects, which was generated manually for the selected chapters. Secondly, the verses were
normalised by removing punctuations and parsing the verses into different tokens. Categorisation into
classes was made according to the subjects created by the system. Then a function was defined to
represent the percentage of a specified subject and records were taken of the highest score
classifications. This system scored 91% accuracy in classifying different verses.
In addition research, at the University of Leeds, was conducted to classify the Quran verses
into Meccan and Madinan. In this research the classification was based on the migration of the
Prophet Muhammad (May peace be upon him). The algorithm was based on pre-knowledge of the
Quran scholars that has already defined some of the chapters to be Meccan or Madinan. The accuracy
of the algorithm was based on the features set that contained keywords, the availability of the training
set on the few chapters that were classified earlier, and the use of the developed Quranic corpus. In
contrast to the previous study, this research defined 14 features that were obtained from scholars.
These features were converted into countable keywords and their frequencies obtained. After defining
these features, the classification was carried out using open source software called WEKA [2].
10
Hadith classification
Another example of text classification is al hadith classification. According to Khitam Jbara
[5], the reason for starting this study was the importance of the hadith and correct classification that
helped in Islamic studies. The text set of al hadith was obtained from Sahih al-Bukhari. This is the
hadith book that is used by most researches that include hadith. This research was carried out to
classify the hadith into subjects that the prophet had talked about at that time. The results were
thirteen classes. The system which the scholars proposed included four main procedures which are:
processing, training that included the selected feature used to help the classification, classification that
included a method to classify the hadith, and eventually the analysis of the classification results. The
scholars had to create a corpus for hadith to process their classification. The procedure that was
followed to obtain the training set was to firstly remove the part of hadith that refers to the names of
individuals who transferred it from the prophet. Secondly, tokenizing the hadith into words and
removing punctuations marks. In addition words like pronouns, prepositions, and names of people or
places that were mentioned were removed from the hadith set, and what was left of the set was
considered as a feature. The last step towards the construction of the corpus was stemming the feature
from prefixes and suffixes and the elimination of the words that did not make sense after stemming
them. The scholars eventually had 19 features that were weighted and used to classify al hadith.
Supervised classification was used as they have created a training text file to extract the features.
Three methods were implemented to classify which is AL-Kabi; Word based classification (WBC) and
Stem expansion classification (SEC). The results of the SEC study achieved the best results of all the
methods that have been implemented.
Medical classification
According to Aphinyanaphongs, Yin and Aliferis, Constantin [16], there were many
publications in the internet which claimed that there are many “unproven treatments”, inaccurate
medications and cancer patients are in terrible conditions. In order to overcome these problems
research has been carried out to identify the web pages that made the previous claims. One of the
claims that have been made was that people who are not real physicians gave medical advice and
treatments. Moreover, some cancer patients had made online purchases of abnormal medication. The
set that was used to retrieve the information is gathered from chosen unproved treatments that were
identified by the quackwatch website. To create the corpus the scholars selected eight unproven
treatments, to name a few, the “Cure for all Cancers” , “Metabolic Therapy”, “Cellular Health”, and
“Insulin Potentiation Therapy”. The web pages were identified by appending the words cancer and
treatment and the top results were retrieved from a Google query. The web pages were labelled with
positive and negative. The web pages that include unproven claims were the positively labelled,
others were negatively labelled. Eventually the corpus included 191 web pages labelled as positive
and 98 as negative. Web pages were converted to list of words by removing the script tags and
11
replacing punctuation with spaces to obtain words. Then words were stemmed and words that
appeared in less than three web pages were removed. A number of algorithms were implemented and
compared to each other. One of the algorithms that were implemented was counting the frequencies of
the terms and defines a user threshold.
2.4 WEKA
WEKA was developed at the University of Waikato in New Zealand, and the name stands for
Waikato Environment for Knowledge Analysis. It is an open source software that is written in Java
under the terms of general public licences GNU. WEKA was tested to run on any platform or
operating system. Easy interface with many different learning algorithms made available, along with
methods for pre- and post-processing and for evaluating the result of learning schemes on any given
dataset [17].
2.4.1 Graphical interface & Performance
WEKA has three graphical interfaces. First, The Explorer that provides the user with the
privilege to access all facilities implemented using the menu selection and form filling. The second
graphical interface is The Knowledge Flow interface. This interface provides the user with the design
configurations for streamed data processing. It allows the user to specify a data stream. This is done
by connecting components representing the data sources, pre-processing tools, learning algorithms,
evaluation methods, and visualization models. The third interface is called The Experimenter, which
is designed to help the user to answer the basic practical questions while applying classifications and
egression techniques.
Figure 1: WEKA Explorer
12
WEKA performs a variety of learning algorithms that the user can easily apply to their
dataset. It also includes a selection of tools for transforming datasets, such as the algorithms for
discretization. The user can apply any method on a dataset. This is accomplished by loading the
dataset into a learning algorithm, and outlining analysis of the results. It comprises methods that can
be applied to any standard data mining problem, such as regression, classification, clustering, etc.
WEKA can be applied as a learning method to a dataset and analyse its output to learn about the data.
In addition, it can be used on methods already learnt and produce estimations on new instances.
Another example is to apply several different learning methods and compare their performance in
order to choose one for a prediction [17].
2.4.2 Data format: ARFF file and Processing
The dataset is loaded into the WEKA explorer in a number of formats such as spread sheets
and database format, but the built-in format is the ARFF file. This file has three parts:
Relation name: the first line in the file should be a relation name starting with @relation
and can be any meaningful name [2].
2. Attribute list: each attribute starts with a @attribute command followed by a name defined
by the user as the attribute, and the types of this attributes [2].
3. Data set: finally the WEKA ARFF file expects a list of data sets. This is the representation
of each line formed in a row of comma and attributes [2].
After loading the data into the explorer, the attributes that were defined in the .arff file for
example appear in the processing tab. In addition to the processing tab, a classify tab is provided in
the explorer. In this tab the user can build or train the classifier. Many options of classification are
provided along with a description and the parameters that can be used. Clustering can also be
implemented using the cluster tab. Both classify and cluster tabs provide the option of testing and
training and the results are displayed in the output panel. In addition, WEKA provides an attributes
selection feature which can be accessed from the “select attribute” tab. This can help in selecting
smaller sets of attributes that help in defining the most suitable attributes in classifying and clustering.
The last tab in the explorer is the visualize tab that displays 2D distribution of data [18].
13
Chapter 3
Project Plan
3.1 Procedures and Deliverables
This project was divided into a number of procedures to achieve the explained aim and
requirements. To keep track of the performance, a blog [19] was created to include the all information
that was taken from the papers that were used in the background reading chapter, the steps that were
carried out to define the features and the design and implementation of the classifier. Understanding
the problem and the requirements were the main points to begin with, and then a preferable schedule
was planned. In the process of implementation there were a number of Java programs and reports to
deliver. The procedure explained in the following steps:
1. Understand the natural language processing algorithms that help in text classification, which
was described in the background research. Better understanding of the problem and the
possible way of implementing the solutions.
2. Classification implementation on English Quran based on predefined features that will
characterise the interesting verses.
3. Design Java program that creates the complement set that is used as training data.
(Deliverable: Java program, complement set text fil, WEKA arff file).
4. Used deliverables from the designed java code in WEKA to test classifier.
5. Write up a mid-project report, including the introduction, background research, and the plan
schedule (Deliverables: mid-term report).
6. Classification implementation on Arabic Quran based on predefined features that will
characterise the interesting verses.
7. Design Java program that creates the complement set, random subsets that is used as training
data. (Deliverable: Java program, complement set text file, Random subset, WEKA arff file).
8. Used deliverables from the designed java code in WEKA to test classifier.
9. A demonstration was organized and presented to the assessor and supervisor to show the
work that was accomplished.
10. Create Hadith corpus as additional data set.
11. Classification implementation on Arabic Hadith based on predefined features that will
characterise the interesting verses.
14
12. Design Java program that creates the complement set that is used as training data.
(Deliverable: Java program, complement set text file, Random subset, WEKA arff file).
13. Used deliverables from the designed java code in WEKA to test classifier.
14. Evaluation of the system implemented.
15. Write-up of the final report (Deliverables: The final report).
3.2 Schedule
At the beginning of the project, a suitable schedule was planned. The schedule was organised
to fit all requirements in the time provided. It was obvious that some adjustments were necessary even
though an effort was made to keep to the timescale. The reason for the adjustments that were made to
the schedule was to include a presentation that wasn‟t previously planned for. In addition, the
background reading took more time than expected. This was in order to get hold of the recent papers
published after the year 2000. Also time was taken getting hold of the papers which are related to the
project specifically, since text classification is broad area of study. In addition, it was necessary to get
hold of other examples, other than the holy Quran, that will hold the concept of the hereafter and
which will help in testing the performance of the classifier on different texts. One of the examples of a
data set that will be considered is hadith, using Sahih Muslim and creating a data set that has similar
format of the holy Quran data set. The design of the data set that was created manually took some
time since hadith corpus was not available over the internet. The data set was then used to test the
classification on this text and test to see if the features that were selected were appropriate for this
classification.
3.3 Methodology
Information retrieval and extraction are algorithms that are considered when some information is
required from any type of document that would help in some application worldwide. These algorithms
are considered an important field in Natural Language Processing (NLP). Information retrieval can be
from text, speech, videos, and images. Text categorization is an important part of NLP that uses
information retrieval and extraction algorithms to design and implement solutions. Text
categorization, “classification in other words”, may include language identifiers on a document of
unknown region, which eventually helps in identifying the region from which it came [20]. Text
classification is the artificial word used by computing scholars, but linguistics scholars tend to call it
Stylometry if the task was about the classification of a document and the author‟s style. An example
of NLP classification that may interest linguists is a classification application on a poem that was
newly discovered and linguists would like to determine if it was written by Shakespeare or by another
author. In addition, text classification can implement information and extraction by highlighting
important/interesting/suspicious text to a user using features defined by the user. Detecting suspicious
text is a research topic that is carried out at the University of Leeds as part of the “Making Sense”
15
project. This project can be useful in detecting "suspicious" texts in a large corpus of surveillance and
in intercepted texts from terrorist suspects, by implementing an algorithm that would classify different
data sets into suspicious and non- suspicious.
The classification in this project was processed in a number of steps to achieve the
requirements and evaluate the language model that was built. In order to accomplish this two Java
programs were designed and run on the English data set of the holy Quran. However, changes were
made later to achieve the same results on different Arabic data sets including the Arabic version of the
holy Quran. The other Arabic data set that was used in implementation is the hadith data set after
completing the main steps of designing the data set. The reason for the changes is that when the
classification was implemented on the English sets of the holy Quran it did not perform correctly and
it was obvious also that it would not perform correctly on the Arabic set of the holy Quran or Hadith.
An example of the changes is a Java program that will extract features from the random subset that
was selected from the complement text file and counts the frequencies of these features line by line
and outputs an .arff file. Changes that were made will be explained in full in the design and
implementation part. In order to try to increase the accuracy of the results a number of features were
considered. The word frequency count feature was carried out on the holy Quran data set and the
Hadith data set. This feature was considered in two different ways that will be explained in the next
chapter. The other feature that was added in the stage of designing the solution to the classification
problem was the Meccan and Madinan features that would define the verses. This can be used as an
extra feature in which a verse that was classified to be interesting should be contained in a Meccan
chapter. Another feature that was considered is the J-O feature that was implemented on the holy
Quran data set, in which a verse that was classified to be interesting should be contained in the 30th
section of the holy Quran starting from chapter number 78 to 114. Since the testing sets and the
training set were available for both data sets, the use of 10-fold cross validation provided by WEKA
was an option to evaluate the some results. One of the options to evaluate the results of the
classification is to use decision trees, and another option will be Naïve Bayes classifiers.
16
Chapter 4
Design and Implementation
4.1 Classification Overview
The classification is based on a designed Java program to retrieve the arff file that is used in
the implementation to classify Arabic text files that were collected from different sources, into
interesting and non- interesting. Although the designed systems were implemented on the English
version of the holy Quran in the beginning, it was finally implemented on the Arabic version along
with other examples that may include the hereafter concept. The holy Quran data set was provided by
the university including the subset that contains the verses that holds the hereafter subject. As an
additional work, the hadith data set. This data set was created manually and was selected because it
was accessible, in Arabic and holds the hereafter concept that was selected to define the interesting
text. Since the interesting text must be on the hereafter concept a number of features were defined in
the Java program to retrieve from a verse. However, when the classifier was designed and
implemented on the English version of the holy Quran, the skewed data problem appeared. The reason
of this problem was the large number of non-interesting data, that was about 6000 verses, which was
labelled No, and the small number of interesting data of about 100 verses that was labelled Yes. As a
result under sampling the data set was an option to overcome the skewed data problem. In the
approaches section a description of the procedures that were carried out on the English text files will
be provided along with the results that revealed the problem. It will also include the procedures that
were carried out on the Arabic text files. Such as a descriptions of the data set that was obtained or
created, description of designed java programs, procedures to overcome the skewed data problem,
results of implementation in WEKA.
4.2 Features
The main features that classify the verses are a list of words that relates to the topic a user is
looking for. Since this classification is on the hereafter concept, Qurany Explorer defined the hereafter
as a sub-subject that comes under the faith category. This website is a corpus, which includes the
ontology which contains the subjects that are mentioned in the holy Quran, and was built by Noorhan
Abbas, a PhD student at the University of Leeds. The Qurany Explorer hierarchy of concept was
obtained from 'Mushaf Al Tajweed‟ which includes an ontology of almost 1200 subjects that were
mentioned in the holy Quran. Qurany Explorer was used to define the features. According to [21], the
17
words which are selected could most probably be on the belief in the hereafter, the proofs of hereafter,
signs which precedes it, names of the hereafter, the resurrection, the reward, and many other sub-
concepts defined in the ontology. The words that have been selected for this project are from the
names of the hereafter sub-subject. The list of features is:
The Hereafter “اآلخرة”.
Day of Resurrection “الحاقت”.
The Hour of Resurrection “الساعت”.
Rise from the Dead “البعث”.
The Judgment Day “الدين”.
Day of Resurrection “القيامت”.
The additional feature that was considered was the Meccan and Madinan features that would
define the verses. According to [2], Meccan chapters give more highlighting the end of day‟s topics.
This feature was added to improve the classification on the holy Quran data set. Since this
classification was implemented in the University on chapters‟ level it was important to include a
method in the designed Java code to restructure of the meccan-madinan.arff file. This file has three
parts. First, the directive part @relation and was called Mecca-Madinan. Second was the @attribute
part which includes the list of features that defines the classes Meccan (K), Madinan (D). Finally the
@data part which includes the data set. In this part every line corresponds to a chapter in the holy
Quran which results in 114 lines. The end of every line is the class label. This part was used in the
Java code that was designed for the Arabic version of the holy Quran, to add the Meccan and Madinan
features to my .arff file. Another feature that was added to help to improve the results of the
classification task is checking if a verse of a chapter belongs to the 30th section of the holy Quran.
According to the islamiyyat website [22], the 30th section of the Quran talks about the hereafter. As a
result a feature was added to define all chapters that belong to this section that include the class
Judgement (J), and all the other chapters will include the feature Other (O). Chapter number 78 is the
first chapter in the 30th section and 114 is the last chapter in the section.
Moreover, another word was added to the word list to count its frequency of use in the text
classification in the hadith data set. This is the Day of Judgment, but with an alternative term in
Arabic; this term is “الحساب”.
4.3 Approaches
The first step towards the implementation of the hereafter classification on the holy Quran
data set was on the English set. The data set was provided by Claire Brierley, a researcher at the
18
University of Leeds. The allQuran.txt file contains all 6236 verses of the holy Quran.
csubsetQuran.txt contains the all the verses that consider the hereafter concept, namely 113 verses.
The required .arff file to be used in WEKA was retrieved by designing two Java programs. The first
Java program of the design is cnoarff.java which delivers the complement set in a text file. This text
file was created by comparing the verses in the all Quran text file and the subset Quran text file. If the
verse did not appear in the subset, it was written to the complement set. The number of verses that
was eventually written to the complement text file is 6123. The second part of the first Java program
creates the .arff file for the complement set. This file is called noQuran.arff file which contains all
verses that do not talk about the hereafter subject. The second Java program of the design is
allQuran.java which delivers the allQuran.arff that contains the combination of the verses of all
Quran that are labelled Yes, or No. The difference between the two .arff files that result from the two
programs is that the second output will contain the header of an .arff file. The header will include the
relation, attributes and their type, and the class. The allQuran.arff file was used in WEKA to classify
the verses into interesting or non-interesting based on the hereafter subject. The step that followed the
design of the Java program, and retrieving information that was required for the .arff file was training
the classifier on the data and testing it. The allQuran.arff was opened in WEKA to process it. Using
the 10 cross validation to train and test the classifier divides the data set to 10 folds which can be
changed to any other number in the folds text field. The precision and recall shows the positive and
negative statistics of the classification. The confusion matrix shows the instances that are defined to a
class in the processed data and the instances that are classified by WEKA. The following formulae are
the equations which are used to get the results of the classification.
The methods that were selected to test the classifier accuracy are Decision Table, Naïve
Bayes and J48. In this confusion matrix we can see that WEKA was classifying the instances based on
the defined classes {Yes, No}. „a‟ is defined to be the Yes class; „b‟ is defined to be the No. The „a‟
column shows that 0 instances were classified in „a‟ (Yes) and „b‟ (No) by WEKA. The rows
represent the real number of the instances that have been defined to belong to „a‟ (Yes) or „b‟ (No).
19
The results of the classification were 0 instances classified correctly in any of the classes defined. The
results are explained in the following table:
allQuran.arff Decision Table Naïve Bayes J48
Correctly
Classified
Instances
6123 = 98.1879% 6120 = 98.1398% 6123 = 98.1879%
Incorrectly
Classified
Instances
113 = 1.8121% 116 = 1.8602% 113 = 1.8121%
Precision 0.964 0.964 0.964
Recall 0.982 0.981 0.982
Confusion Matrix
a b <-- classified as
0 113 | a = Yes
0 6123 | b = No
a b <-- classified as
0 113 | a = Yes
3 6120| b = No
a b <-- classified as
0 113 | a = Yes
0 6123 | b = No
Table 1: Results in WEKA using about 6000 lines of No and 113 of Yes.
When these results were analysed it was obvious that the data was skewed since the data that
was labelled No was larger than the subset that was labelled Yes. At this point new approaches were
defined in order to improve the results of the classification. There were different approaches to test the
classifier that would be useful to use later in broader researches, such as detecting suspicious text. For
example sub sampling the large data, which was defined in the complement set and was labelled No,
to a random subset of different size. In this section there is a description of the procedures that were
made to design and implement the classification on two different data sets that hold the hereafter
concept using the same explained features. The first data set to implement the classification on the
Arabic text of the holy Quran and will be explained in the following section. The second set is on the
Arabic text of hadith and the approaches will be explained in 4.3.2 section.
4.3.1 The holy Quran data set
Description
The first implementation of the classification was on the verses of the holy Quran and was
based on the frequency of defined features as has been mentioned previously. The implementation of
the solution was accomplished by firstly defining the features that the classification is based on. These
features can be a list of words which will be extracted to retrieve their frequencies in an .arff file. The
list of features that were defined will be extracted from the Arabic set allQuranArabic.txt,
RandomSubsetArabic.txt, subsetQuranArabic.txt and meccan-madinan.arff. The allQuranArabic.txt
20
file contains all the verses of the holy Quran namely 6 632 verses. subsetQuranArabic.txt contains the
all the verses that consider the hereafter concept, namely 113 verses, and was defined by Claire
Brierley a researcher at the University of Leeds. RandomSubsetArabic.txt contains 400 verses
selected randomly from the complement set to begin with, and later was changed to 100 verses. These
files contain the verses of the Quran in a certain format. This format is shown in Figure 2. The first
number is the chapter number, the second number is the verse number and the string is the verse.
In addition to this approach to the implementation of the classification, combining an existing
classification could be used to increase the accuracy of the classification. This can be carried out by
using the Meccan and Madinan classification that was implemented at the University of Leeds. To
accomplish this, the Java code that was designed should add another feature to the selected verses that
will produce the final .arff file. This feature will be added after retrieving the frequencies of the
features and will describe the verses if they were Meccan or Madinan by reading the first number in
the line which refers to the chapter number. The reason for using the chapter number to describe the
verse is that the classification of the holy Quran into Meccan or Madinan was on the chapter level.
Secondly, use the frequencies and the Meccan – Madinan description that was obtained from the
designed systems in the implementation using WEKA. To begin with the implementation of this
classification, first classify based on the frequencies and get the interesting set then add the Meccan -
Madinan feature. This step of the system includes training the system to classify based on some
predefined set of text. Finally, the percentage of the correctly classified verses will be presented.
Design
A Java program was designed in order to obtain an .arff file that is used later in the
implementation of the classifier for the first approach. As has been explained previously, the first step
was to design two Java codes that used the English corpus of the holy Quran. However, when it was
time to use the Arabic corpus a new design of Java program that will work as a full system was
preferred. The new Java program is called ArabicQuran.java. The first method that is defined in the
program is the main method in which the method processFile is called up. In the method processFile
input and output files are declared by firstly defining allQuranArabic.txt and subsetQuranArabic.txt as
inputs and it uses readFile method to read them. The output files are the ComplementArabic.txt,
RandomSubsetArabic.tx and QuranArabic.arff files. These files are created using BufferWrites for
each output.
1|1| String
Figure 2 Arabic Quran dataset Format
21
The first part of the Java program obtains the complement set. This is accomplished by firstly
reading the input files and storing them in an ArrayList. Two ArrayList files will be created and each
will contain strings from ComplementArabic.txt and andomSubsetArabic.txt. The complement set is
iterated to an array list in a design method called GetComplement. The method ReadFile reads the
ArrayList files that were created using FileInputStream and InputStreamReader that read the files as
UTF8 encoding. There was a problem while running the program: this was that, after the conversion
to the UTF8, the array list would store a digit in the beginning of every text file created and then it
would not match the chapter number and the verses number that were stored in the array list. To
overcome this problem a loop was created to ignore the character if it already existed and go to the
next character in the array list which is the chapter number. The method GetComplement is of the
type ArrayList that precedes the allQuran Arabic, subset Arabic both as ArrayList, and a Writer as
output. It loops over the input text files and compares the ID of the verses using the GetId method,
which will be explained later. If the verse line was in the allQuran text file but not in the subset text
file, then it will be written to the complement array list. GetId method is used to obtain the first
numbers of each line separated by “|” example “##|##”.
The second part of the Java program obtains the random subset which requires the
complement set. The random subset is iterated using a method called CreateRandomSubset. The
CreateRandomSubset method takes the complement set, an integer which is previously defined to be
the limit of line to get in the Random Subset, which is the value of 400 verses that was changed to 100
verses, and writes an output. The verses that are selected are inserted in order of their position in
allQuarnArabic.txt. This method uses the Randomvalue method which is the declaration of the size of
the random set. The method is implemented using mathematical functions to declare the size of
random set which must be of 400 lines from the complement set.
This program is mainly about creating the .arff file that will be used later in WEKA to
classify the verses into interesting and non-interesting. To accomplish this, the header of the .arff file
is written to the .arff using the writeFile method. The writeFile method takes in a string that should
be written and to which each file will be declared by the name of the file. In addition, two other
methods are created to help in creating the .arff file. First, ProcessLine which is a method created to
get the attributes‟ frequencies by tokenizing the lines using StringTokenizer and including a regular
expression matching functions to include the counting of the frequency of any attribute, even if it
included a prefix or a suffix. Second, the GetEndLine method is used to write Yes and No end the end
of every line of the arff file. This is done by checking the line; if it occurs in the random subset then it
writes No, if it occurs in the subset then it writes Yes. This is a simple diagram that helps in
understanding the process of the Java program. The output line looks like this: "1|4",0,0,0,0,1,0,NO.
22
Additional Method
Additional methods were added later to improve the classification. These methods can be
removed easily from by omitting some lines and the output will change. Two new methods were
added later to this Java code, to add the Meccan-Madinan feature. The first method is TrimMeccan,
this is used to remove the header part from the meccan-madinan.arff file. It loops through the lines
until the first character of the line is 0, then stops and returns the new .arff file with the lines. The .arff
file is structured to include a header, that was removed, and the data part that includes 114 lines that
refers to the chapter number. The second method that was added to the Java program uses the
trimmed meccan-madinan.arff file to retrieve the feature. The method is called GetLetter. The letter
K was assigned to refer to the Meccan chapters and D to refer to the Madinan chapters. The method
will read the first number of the selected lines for the QuranArabic.arff file and look at the line that is
in that position in the meccan-amadinan.arff file and get the feature from it. The output lines
following @data in the .arff file will look like the following:
"1|4",0,0,0,0,1,0,K,NO
In addition, ProcessLineSingleCount was added to combine the frequencies into one count for the list
of word features that is defined. The output lines following @data in the .arff file will look like the
following:
"2|16",0,D,NO
Figure 3: Diagram of the sequence of the methods that were created in the Java program
23
In order to add the J-O feature GetSecondLetter was added. This method will read the first
number of the line and depending on it would add the letter J to refer to the verses that belong to a
chapter in the 30th section of Quran, and the letter O to refer to all the verses that do not. The output
lines following the @data in the .arff file will look like the following:
"2|25",0,D,O,NO
Implementation
The results of the implementation of the solution were done in WEKA using different data
sets sizes and features. It was accomplished using different classifiers in order to train and test the
classifier. As has been mentioned previously, the three classification methods that were selected to
analyse and evaluate the classifier are Decision Table, Naïve Bayes and J48. The first implementation
was on a QuranArabic.arff in three different versions of random sets. 10 cross validation was used to
train and test the data on a number of .arff files. The arff files consists of the frequencies of the
features that were defined in 400 lines of verses that are labelled No and 113 lines of verses of the
data that are labelled Yes and later was changed to 100 lines of verses that are labelled No and 113
lines of verses of the data that are labelled Yes. In addition, the Meccan and Madinan feature was
included and the classifier was tested based on the word frequencies and this feature. Similarly,
defining arff files in random 400 lines of verses that are labelled No and 113 lines of verses of the
data that are labelled Yes. Then changed to 100 lines of verses that are labelled No and 113 lines of
verses of the data that are labelled Yes. The final results of WEKA classifiers on this data set
consisted of 100 lines of verses labelled No and 113 lines of verses labelled Yes that includes the
Meccan-Madinan feature and with the frequency count combined into one and the Judgement-Other
feature added. The results of processing the arff files using the Java programs described previously
will be provided in the following chapter.
4.3.2 The hadith data set
Description
The second implementation of the classification was on hadith and was based on the
frequency of defined features as mentioned previously. The hadith data set was created manually. The
step of the data sets design was based on the al-eman website [23]. The hadith data set is created from
Sahih Muslim who was mentioned in the classification overview. Using the web based search tool in
this website, the subset data was created. According to Abdul Hamid Siddiqui [24], a saying and deed
of the prophet Muhammad (May peace be upon him) is called hadith. These sayings are the second
source after the holy Quran that Muslims refer to in the subjects of life and Islamic laws. Imam
Muslim is an important and famous scholar in Islam. Sahih Muslim is a book that includes the
collection of hadith made by this scholar. The book, in most English translations, is divided to 43
sections containing all hadith. However it is originally had 57 sections that include up to 7500 sayings
24
in the Arabic version of Sahih Muslim. The data set that was designed to implement the classification
for this project used the Arabic version. The book includes all sayings that were transmitted by
different chain of narrators. As a result if the saying had another chain of narrators it was counted as a
different saying. These repeated sayings were removed from the designed data set since the important
part of the hadith was the saying itself and not the narrators. In addition, the subset was created using
the Ibn Kathir Tafsir book that was provided online by the Quran complex website [25]. The hadiths
were gathered based on this electronic version of the book by firstly going through the verses that
were for the Quran data set and looking for the corresponding sayings of the prophet that were
available in Sahih Muslim. Secondly, el-eman website that was previously mentioned was used in
order to keep track of the reference number of the gathered saying from Ibn Kathir and retrieve the
complement set. The last Arabic version of the hadith data set used in the classification is
SahihMuslim.txt which includes 7748 hadith and subsetHadithArabic.txt which in turn include 86
hadith that hold the hereafter concept.
The implementation of the solution was accomplished by firstly creating the data sets that
were required for training and test and this was explained previously. Secondly, the features that the
classification is based on were defined. These features can be list of words which will be extracted to
retrieve their frequencies in an .arff file. The list of features that were defined will be extracted from
the Arabic set AllHadithArabic.txt, RandomSubsetArabicHadith.txt and subsetHadithArabic.txt. The
AllHadithArabic.txt file contains all the quotes that were attributed to the prophet Muhammad (May
peace be upon him), which are about 7500. subsetHadithArabic.txt contains the all the sayings (86)
that consider the hereafter concept. RandomSubsetArabicHadith.txt contains 400 sayings that were
selected randomly from the complement set to begin with and later was changed to 100 sayings.
These files contain the sayings in the format that is shown in Figure 4. The number represents the
hadith number in Sahih Muslim and the string is the hadith. To begin the implementation of this
classification, the system had to be trained and tested, which was done by 10 cross validation as an
approach to implementation. Finally, the percentage of the correctly classified hadith will be
presented.
Design
A Java program was designed in order to obtain an .arff file that is used later in the
implementation of the classifier to test on different data sets that would hold the hereafter concept. In
order to accomplish this, the first step is to design the data set. The second step is designing the Java
1|“String”
Figure 4: Arabic Hadith dataset Format
25
program that would help in retrieving the information required for the implementation. Two new Java
programs were designed; the first is SetupData.java and the second is Arabic Hadith.java. A
description of these programs will be provided in this section.
SetupData.java description
The first method that is defined in the program is the main method in which the method
processFile is called up. In the method processFile input and output files are declared by firstly
defining SahihMuslim.txt and subsetHadith.txt as inputs and it uses the readFile method to read them.
The output files are the AllHadithArabic.txt and SubsetHadithArabic.txt files. This files are created
using BufferWrites for each output. Firstly the Java program reads the input files and stores them in
an ArrayList. Two ArrayList files will be created, each will contain strings from muslim that hold
AllHadithArabic.txt, and a subset that holds the SubsetHadithArabic.txt.
The method ReadFile reads the ArrayList files that were created using FileInputStream and
InputStreamReader that reads the files as UTF8 encoding. Similarly, a problem with storing the lines
that are in UTF8 encoding was present and addressed. The method that follows ReadFile is setupFile.
This method splits the lines of hadith after the first bar that appears in the line and saves the strings
that are between quotation marks. As it has been explained previously the sayings of the prophet are
the strings in between quotation marks. The program would store the string that holds the odd position
after splitting as explained in Figure 5. The final method in this program is writeFile that prints the
line to the files.
ArabicHadith.java description
The second Java program is ArabicHadith.java and is the main method in which the method
processFile is called up. In the method processFile, input and output files are declared by firstly
defining allQuranArabic.txt and subsetQuranArabic.txt as inputs and it uses readFile method to read
them. The output files are the ComplementArabicHadith.txt, RandomSubsetArabicHadith.txt and
HadithArabic.arff files. This Java program is designed the same as ArabicQuran.java. As a result the
first part of the Java program obtains the complement set. This is accomplished using GetComplemet,
ReadFile, and GetId methods. The second part of the Java program obtains the random subset which
requires the complement set. The random subset is likewise created using methods called
CreateRandomSubset and Randomvalue.
This program is mainly about creating the .arff file that will be used later in WEKA to
classify the verses into interesting and non-interesting. To accomplish this, the header of the .arff file
is written to the .arff which in this method is using the writeFile method. The ProcessLine method is
created to get the attributes‟ frequencies by tokenizing the lines using StringTokenizer and including a
26
regular expression matching function to include the counting of the frequency of any attribute, even if
it included a prefix or a suffix in it. In addition, the GetEndLine method is used to write Yes and No
at the end of every line of the arff file. This is done by checking the line; if it occurs in the random
subset them it writes No, if it occurs in the subset then it writes Yes. This is a simple diagram that
helps in understanding the process of the Java program. The output line looks look this:
"108",0,0,0,0,1,0,YES
Additional methods were added later to improve the classification. These methods can be
removed easily by omitting some lines and the output will change. ProcessLineSingleCount was
added to this Java code to combine the frequencies into one count for the list of word features that is
defined. The output lines following @data in the .arff file will look like the following:
"108",1,YES
Implementation
The implementation of the classification task on the hadith dataset was done in WEKA using
different data set sizes and features. As has been mentioned previously, the three classification
methods that were selected to analyse and evaluate the classifier are Decision Table, Naïve Bayes and
J48. The first implementation was on a HadithArabic.arff that consists of the frequencies of the
features that were defined in 200 lines of verses that are labelled No and 86 lines of verses of the data
that are labelled Yes, later changed to 100 lines of verses that are labelled No and 86 lines of verses of
the data that are labelled Yes. The results of processing the arff files using the Java programs
described previously will be provided in the following chapter.
27
Chapter 5
Results and Evaluation
5.1 WEKA Results
5.1.1 The holy Quran data set results
The results of the implementation are shown in the tables (2 and 3) below. Table 2 shows the
results of WEKA classifiers for 100 lines of No and 113 of Yes using include {K, D} feature and one
frequency count. Three different versions were used for training and testing which are numbered 10,
11, and 12. Table 3 shows the results of WEKA classifiers for 100-113 lines that include the Meccan-
Madinan feature and with the frequency count combined into one and the Judgement-Other feature
added. Similarly, three different versions were used for training and testing which are numbered 13,
14, and 15. However the first implementation on this data set was on the dataset which was resized to
400 lines of the data labelled No and 114 labelled Yes using the frequency count feature with three
different versions of the .arff files. The results of the classification for first example
QuranArabic1.arff in WEKA, using decision table, Naïve Bayes and J48 and the best results are as
follows: correctly classified instances are 402, which is equal to 78.3626%; that was scored using
Naïve Bayes classifier. Precision is 0.743 and Recall is 0.784. The confusion matrix was the
following:
5 395 | b = NO
The second example is QuranArabic2.arff, using decision table, Naïve Bayes and J48 and the best
results are as follows: correctly classified instances are 405, which is equal to 78.9474%; that was
scored using Decision Table and Naïve Bayes classifier. Precision is 0.787 and Recall is 0.789. The
confusion matrix was the following:
a b <-- classified as
7 106 | a = YES
28
The final example is QuranArabic3.arff, using decision table, Naïve Bayes and J48 and the best
results are as follows: correctly classified instances are 403, which is equal to 78.5575% that was
scored using Naïve Bayes classifier too. Precision is 0.755 and Recall is 0.786. The confusion matrix
was the following:
All the three version of arff files scored the same, lowest results using J48. The correctly
classified instances were 400 = 77.9727 %. Precision is 0.608 and Recall is 0.78.
The results of the classification in WEKA when the dataset was resized to 100 lines of the
data labelled No and 113 labelled Yes, using the frequency count with three different versions of the
.arff files was exactly the same. Best results were by the Decision Table and J48 classifiers. The
correctly classified instances are 113, which is equal to 53.0516% and the incorrectly classified
instances are 100, which is equal to 46.9484%. Precision is 0.281 and Recall is 0.531. The confusion
matrix was the following:
Clearly this was not reliable results, and resizing didn‟t improve the classification results
using frequency count. In order to improve the results, the data sets were resized to 400 of the data
labelled No and 113 of the data labelled Yes. However the Meccan – Madinan feature was added to
this version of arff and the frequency count was changed to one count frequency for all words that
were defined previously. The first example is QuranArabic7.arff, using decision table, Naïve Bayes
and J48 and the best results are as follows: correctly classified instances are 400, which is equal
to77.9727%; that was scored using Decision Table and J48 classifier. Precision is 0.608 and Recall is
078.. The confusion matrix was the following:
a b <-- classified as
7 106 | a = YES
2 398 | b = NO
a b <-- classified as
7 106 | a = YES
4 396 | b = NO
a b <-- classified as
113 0 | a = YES
100 0 | b = NO
29
But Naive Bayes classifier produced better results in classifying some True positive instances. The
confusion matrix that resulted by Naïve Bayes using this version of arff file was as follow:
The second example is QuranArabic8.arff, using decision table, Naïve Bayes and J48 and the best
results are as follows: correctly classified instances are 397, which is equal to 77.3879%; that was
scored using Naïve Bayes classifier. However the other classifiers were less by one instance
difference. Precision is 0.719 and Recall is 0.774. The confusion matrix was the following:
The final example is QuranArabic9.arff, using decision table, Naïve Bayes and J48 and the best
results are as follows: correctly classified instances are 400, which is equal to 77.9727% that was
scored using Decision Table and J48 classifiers. Precision is 0.608 and Recall is 0.77. The confusion
matrix was the following:
But Naive Bayes classifier produced better results in classifying some True positive instances. The
confusion matrix that resulted by Naïve Bayes using this version of arff file was as follow:
15 385 | b = NO
a b <-- classified as
0 113 | a = YES
0 400 | b = NO
a b <-- classified as
11 102 | a = YES
14 386 | b = NO
a b <-- classified as
14 99 | a = YES
17 383 | b = NO
a b <-- classified as
0 113 | a = YES
0 400 | b = NO
a b <-- classified as
10 103 | a = YES
30
Table 2: Results in WEKA using 100 lines of No and 113 of Yes (include {K, D} feature and one
frequency count).
QuranArabic10.arff Decision Table Naïve Bayes J48
Correctly Classified
Instances 126 = 59.1549% 126 = 59.1549% 126 = 59.1549%
Incorrectly
Classified Instances 87 = 40.8451% 87 = 40.8451% 87 = 40.8451%
Precision 0.609 0.618 0.609
Recall 0.592 0.592 0.592
Confusion Matrix
a b <-- classified as
97 16 | a = YES
71 29 | b = NO
a b <-- classified as
100 13 | a = YES
74 26 | b = NO
a b <-- classified as
97 16 | a = YES
71 29 | b = NO
QuranArabic11.arff
Correctly Classified
Instances 132 = 61.9718% 130 = 61.0329% 132 = 61.9718%
Incorrectly
Classified Instances 81 = 38.0282% 83 = 38.9671% 81 = 38.0282%
Precision 0.64 0.64 0.64
Recall 0.62 0.61 0.62
Confusion Matrix
a b <-- classified as
97 16 | a = YES
65 35 | b = NO
a b <-- classified as
100 13 | a = YES
70 30 | b = NO
a b <-- classified as
97 16 | a = YES
65 35 | b = NO
QuranArabic12.arff
Correctly Classified
Instances 124 = 58.216% 124 = 58.216% 124 = 58.216%
Incorrectly
Classified Instances 89 = 41.784% 89 = 41.784% 89 = 41.784%
Precision 0.597 0.606 0.597
Recall 0.582 0.582 0.582
Confusion Matrix
a b <-- classified as
97 16 | a = YES
73 27 | b = NO
a b <-- classified as
100 13 | a = YES
76 24 | b = NO
a b <-- classified as
97 16 | a = YES
73 27 | b = NO
31
Table 3: Results in WEKA using 100 lines of No and 113 of Yes (include {K, D} feature, {J, O}
feature and one frequency count).
QuranArabic13.arff Decision Table Naïve Bayes J48
Correctly Classified
Instances 122 = 57.277% 115 = 53.9906% 123 = 57.7465%
Incorrectly
Classified Instances 91 = 42.723% 98 = 46.0094% 90 = 42.2535%
Precision 0.583 0.534 0.592
Recall 0.573 0.54 0.577
Confusion Matrix
a b <-- classified as
96 17 | a = YES
74 26 | b = NO
a b <-- classified as
83 30 | a = YES
68 32 | b = NO
a b <-- classified as
97 16 | a = YES
74 26 | b = NO
QuranArabic14.arff
Correctly Classified
Instances 128 = 60.0939% 125 = 58.6854% 125 = 58.6854%
Incorrectly
Classified Instances 85 = 39.9061% 88 = 41.3146% 88 = 41.3146%
Precision 0.609 0.59 0.593
Recall 0.601 0.587 0.587
Confusion Matrix
a b <-- classified as
92 21 | a = YES
64 36 | b = NO
a b <-- classified as
89 24 | a = YES
64 36 | b = NO
a b <-- classified as
91 22 | a = YES
66 34 | b = NO
QuranArabic15.arff
Correctly Classified
Instances 111 = 52.1127% 116 = 54.4601% 108 = 50.7042%
Incorrectly
Classified Instances 102 = 47.8873% 97 = 45.5399% 105 = 49.2958%
Precision 0.506 0.541 0.484
Recall 0.5021 0.545 0.507
Confusion Matrix
a b <-- classified as
91 22 | a = YES
80 20 | b = NO
a b <-- classified as
92 21 | a = YES
76 24 | b = NO
a b <-- classified as
90 23 | a = YES
82 18 | b = NO
32
5.1.2 The hadith data set results
This section will provide the results of processing the data sets using different classifiers in
WEKA. The results of the implementation are shown in the tables below. Table 7 shows the results of
WEKA classifiers to 200-86 lines. Three different versions of HadithArabic.arff files were used for
training and testing which are numbered 1, 2 and 3. Similarly, Table 8 shows the results of WEKA
classifiers for 100-86 lines with three different versions were used for classifier training and testing
which are numbered 4, 5, and 6 in the table. However when the word frequencies were combined the
results did not change much. The results of the classification when the dataset was resized to 200 lines
of the data labelled No and 86 labelled Yes using one count of the frequency with three different
versions of the .arff files was different to every data set used, although the classifiers were different.
The first example is HadithArabic7.arff, using decision table, Naïve Bayes and J48 and the results are
as follows: correctly classified instances are 236, which is equal to 83.6879%, incorrectly classified
instances are 46, which is equal to 16.3121%. Precision is 0.838 and Recall is 0.837. The confusion
matrix was the following:
The second example is HadithArabic8.arff, using decision table, Naïve Bayes and J48 and the results
are as follows: correctly classified instances are 228, which is equal to 80.8511%, incorrectly
classified instances are 54, which is equal to 19.1489%. Precision is 0.801 and Recall is 0.809. The
confusion matrix was the following:
The final example is HadithArabic9.arff, using decision table, Naïve Bayes and J48 and the results
are as follows: correctly classified instances are 230, which is equal to 81.5603%, incorrectly
classified instances are 52, which is equal to 18.4397%. Precision is 0.809 and Recall is 0.816. The
confusion matrix was the following:
a b <-- classified as
44 38 | a = YES
8 192 | b = NO
a b <-- classified as
44 38 | a = YES
16 184 | b = NO
33
The results of the classification in WEKA when the dataset was resized to 100 lines of the data
labelled No and 86 labelled Yes, using the one count of the frequency with three different versions of
the .arff files was exactly the same. The correctly classified instances are 139, which is equal to
76.3736% and the incorrectly classified instances are 43, which is equal to 23.6264%. Precision is
0.797 and Recall is 0.764. The confusion matrix was the following:
a b <-- classified as
44 38 | a = YES
14 186 | b = NO
a b <-- classified as
44 38 | a = YES
5 95 | b = NO
34
Table 4: Results in WEKA using 200 lines of No and 86 of Yes.
HadithArabic1.arff Decision Table Naïve Bayes J48
Correctly Classified
Instances 226 = 80.1418% 227 = 80.4965% 219 = 77.6596%
Incorrectly
Classified Instances 56 = 19.8582% 55 = 19.5035 63 = 22.3404%
Precision 0.795 0.798 0.768
Recall 0.801 0.805 0.777
Confusion Matrix
a b <-- classified as
39 43 | a = YES
13 187 | b = NO
a b <-- classified as
41 41 | a = YES
14 186 | b = NO
a b <-- classified as
31 51 | a = YES
12 188 | b = NO
HadithArabic2.arff
Correctly Classified
Instances 231 = 81.9149% 232 = 82.2695% 225 = 79.7872%
Incorrectly
Classified Instances 51 = 18.0851% 50 = 17.7305% 57 = 20.2128%
Precision 0.821 0.824 0.805
Recall 0.819 0.823 0.798
Confusion Matrix
a b <-- classified as
39 43 | a = YES
8 192 | b = NO
a b <-- classified as
40 42 | a = YES
8 192 | b = NO
a b <-- classified as
31 51 | a = YES
6 194 | b = NO
HadithArabic3.arff
Correctly Classified
Instances 224 = 79.4326% 209 = 74.1135% 216 = 76.5957%
Incorrectly
Classified Instances 58 = 20.5674% 73 = 25.8865% 66 = 23.4043%
Precision 0.785 0.735 0.753
Recall 0.794 0.741 0.766
Confusion Matrix
a b <-- classified as
39 43 | a = YES
15 185 | b = NO
a b <-- classified as
15 67 | a = YES
6 194 | b = NO
a b <-- classified as
30 52 | a = YES
14 186 | b = NO
35
Table 5: Results in WEKA using 100 lines of No and 86 of Yes.
HadithArabic4.arff Decision Table Naïve Bayes J48
Correctly Classified
Instances 134 = 73.6264% 138 = 75.8242% 135 = 74.1758%
Incorrectly
Classified Instances 48 = 26.3736% 44 = 24.1758% 47 = 25.8242%
Precision 0.792 0.806 0.804
Recall 0.736 0.758 0.742
Confusion Matrix
a b <-- classified as
37 45 | a = YES
3 97 | b = NO
a b <-- classified as
41 41 | a = YES
3 97 | b = NO
a b <-- classified as
37 45 | a = YES
2 98 | b = NO
HadithArabic5.arff
Correctly Classified
Instances 127 = 69.7802% 134 = 73.6264% 133 = 73. 0769%
Incorrectly
Classified Instances 55 = 30.2198% 48 = 26.3736% 49 = 26.9231%
Precision 0.75 0.772 0.767
Recall 0.698 0.736 0.731
Confusion Matrix
a b <-- classified as
32 50 | a = YES
5 95 | b = NO
a b <-- classified as
40 42 | a = YES
6 94 | b = NO
a b <-- classified as
39 43 | a = YES
6 94 | b = NO
HadithArabic6.arff
Correctly Classified
Instances 134 = 73.6264% 134 = 73.6264% 130 = 71.4286%
Incorrectly
Classified Instances 48 = 26.3736% 48 = 26.3736% 52 = 28.5714%
Precision 0.778 0.772 0.762
Recall 0.736 0.736 0.714
Confusion Matrix
a b <-- classified as
39 43 | a = YES
5 95 | b = NO
a b <-- classified as
40 42 | a = YES
6 94 | b = NO
a b <-- classified as
35 47 | a = YES
5 95 | b = NO
36
5.2 Evaluation of the Model
As it has been mentioned previously, the procedures to accomplish the text classification
problem were getting hold of a dependable data set and selecting features that would assist the
classifier. Applying a number of Java codes, the required data files that would be used in the
classification were retrieved. Using different types of classifiers that were provided in WEKA, some
results allowed the evaluation of the accuracy of the classifier and the defined features. Improvements
of the defined features were necessary. These improvements will be explained in the following
section.
5.2.1 Evaluation on data sets
Two different data sets were processed in this project in order to implement the classification
task. The first data set is the English version of the holy Quran data set that was provided by a
researcher at the University of Leeds. However, this project was to process Arabic text as a result the
data set was converted to Arabic. The full Quran Arabic data set was downloaded from the internet
and the subset was retrieved using the English subset that was provided.
Additionally, the hadith data set was created manually using the website that was mentioned
in the previous chapter. This design step was not expected and took time to create since there were no
recent natural language processing researches on hadith, other than the example that was given in the
background reading, and also the corpus that was used is not available. The data set was first gathered
from the website and then processed to a certain format to retrieve the .arff file. The subset was first
gathered from different sources and then formatted to use in the classification task. To perform the
classification task on this data set a Java program was created to help in improving the format of the
data.
Two other Java programs were designed to retrieve the .arff file that was used in WEKA to
implement the classification. Some issues appeared when the implementation was carried on the
English data set. The problem was skewed data that was a result of the large data set that was labelled
No that was compared to the small subset that was all labelled Yes. According to Andrew McKinley
[26], “Subsampling the whole data set (i.e. both test and train) so the classes are more balanced will
help a lot.” As a result the arff files that were used in the classification of the Arabic version of the
holy Quran and the Hadith were resized.
5.2.2 Evaluation of feature
The features selection was an important procedure in this project to help in designing the
solution and implementing the suitable model of the classification. The features that were selected
were based on my pre-knowledge of holy Quran and with the help of Quran explorer. In the beginning
the feature was defined to be a highly frequent list of words that are in the subset Quran data set. Key
37
words for the hereafter and Day of Judgment were defined to retrieve their frequencies from the holy
Quran data set. Since this subject was categorized to many sections the selected category was names
of the Day of Judgment. As it has been mentioned in the previous chapter the features that were
selected are The Hereafter, Day of Resurrection, Rise from the Dead and The Judgment Day.
However there were more terms that were not used, and there are a number of terms in the English
version that could be mapped to different terms in the Arabic version. For example Resurrection can
be mapped to at least three different terms in Arabic. This was one of the things that differed in the
classification result, when the number of word lists used is less than in the English version. Many
other words could be added to the word list to improve the classification results, such words that
record the hereafter as heaven and hell, and the signs which lead to the hereafter. This list of words
could improve the classification results because they are semantically related to the Day of Judgment.
In relation to the results of the project implementation, each of the defined words that were used to
retrieve their frequencies stands for one word which was later combined to be a single frequency
count. In addition to this improvement on the holy Quran, the Meccan and Madinan feature was added
along with the combined frequencies and the results were improved. Moreover, adding the J-O feature
was another option to test the classifier results using this additional feature.
The results of the feature on the hadith data set were different from on the Quran data set.
Although the features were the same, except for the additional word that was added to the frequency
count feature, the results improved. These results will be explained in the following section. An
analysis will be provided to describe and discuss the results tables that were given in the previous
chapter.
5.3 Analysis
In the beginning the classification model was implemented on the English data set and from
the results of WEKA classifiers that is represented in the table 1. It was obvious that the data was
skewed. The next step was to resize the data set that was used in the classification using the Arabic
data set. According to Bilal M. Zahran and Ghassan Kanaan [27], there are some difficulties in
classifying Arabic text. For example, Arabic language has different syntaxes and semantics from other
languages. Comparing English to Arabic, Arabic letters have many forms in writing and the
punctuation associated with it changes the meaning. Moreover, comparing Arabic roots to English
roots is complex and so natural language processing for Arabic differs from that for English. In
relation to the implementation section in the previous chapter, different approaches were used in order
to test and perform the classification task. Using the Arabic data set the procedure of resizing the data
was an additional option when looking for different classification results.
In relation to QuranArabic.arff versions number (1, 2, and 3); the J48 classifier did not
perform any useful results because changing the version of the dataset that was used did not have any
38
effect on the results. J48 did not produce any graph that showed the Yes-No class. The Decision table
did not show much difference in the classification results. However the highest results using this
classifier were 405 correctly classified instances, that is equal to 78.9474%. This number of instances
includes 7 of the True positive and 398 of the True negative instances. Naïve Bayes‟ results scored the
highest percentage of 405 correctly classified instances, which is equal to 78.9474%. The lowest
percentage of 402 correctly classified instances, that is equal to 78.3626%, is still greater than that
scored by the other classifiers. Using the frequency count and keeping in mind the size of the data, it
seems that the Naïve Bayes classifier is better in the text classification task, since J48 did not change
and the Decision table did not score any better than Naïve Bayes.
In relation to QuranArabic.arff versions number (4, 5, and 6); J48 did not improve its results
although different sets were used to test the classifier. In this approach too, J48 did not produce any
graph that shows the Yes-No class.. Similarly, the Decision Table had the same results as J48. Yet
these classifiers scored the highest percentage, which is 113 correctly classified instances, which is
equal to 53.051%. This number of instances refers to all the 113 instances of the True positive and 0
of the True negative instances. On the other hand the Naïve Bayes classifier scored the lowest
percentage which is 111 instances belongings to the True positive instances. This score did not change
although different data sets were used in testing. Using the frequency count and resizing the data set
did not help in improving the results of the classification. All of the classifiers were able to detect the
set that was interesting and could classify any of the non-interesting text. In order to improve these
results, the Meccan-Madinan feature was added along with combining the frequencies to a single
count.
In relation to QuranArabic.arff versions number (7, 8, and 9); J48 improved its result by
differentiating each time a new set was used. When the Meccan-Madinan feature was added a graph
was provided to classify the verses of Quran. However, going back to the table we see that the
correctly classified instances are 400, which is equal to 77.9727%. These instances are all True
negative instances and there were no True positive instances. In addition, the Decision table shows
similar results to J48 classifier. This may be due to selecting a random set of the verses that may
include mostly Meccan verses, but labelled No. The Naïve Bayes classifier scored the lowest result in
comparison to the other two that are mentioned. However this classifier was able to detect some of the
True positive instances although it was insignificant number which is 14 instances from 113.
In relation to table 2, 2 versions of the arff files showed exactly the same results in all
classifiers. The third version of the arff file presented a smaller number of correctly classified
instances, yet better than the others. The highest score by J48 and the Decision table is 132 correctly
classified instances, which is equal to 61.9718%. Naïve Bayes scored 130 correctly classified
instances, which is equal to 61.0329%. The 130 instances include 100 of the True positive instances
39
and 30 of the True negative instances. Additionally, J-O feature was added along with Meccan-
Madinan feature and combining the frequencies to a single count.
In relation to table 3, the J48 classifier gave one of the best results, which are 125 instances
that are equal to 58.6854%. 91 instances were correctly classifier as True positive and 34 True
negative. The Decision table classifier in this data set gave the highest score that was equal to 128
instances, which is equal to 60.0939%. This includes 92 instances of the True positive and 36 of the
True negative instances. The Naïve Bayes classifier scored exactly the same as J48 in percentages.
However when we look at the confusion matrix the True positive instances using this classifier were
89 and 36 were True negative instances.
By comparing results of the classification using the sets of the size 100-113, the classification
task improved enormously from 53.0516% to 61.9718% by adding the Meccan-Madinan feature.
However by adding the J-O feature the percentage dropped to 60.0939%. This percentage does not
represent big difference. However it is a decrease due to the number of verses in the subset that
belongs to the 30th section of the holy Quran, namely only 9 verses from the 113.
The implementation of the classification on the hadith data set presented better and different
results than those from the Holy Quran data set. In relation to table 7, J48 provided different results
and graphs for every version of the arff files used in this implementation. This classifier scored one of
the highest results taking into account that the only feature that was used is the frequency count. It
scored 225 of correctly classified instances, which is equal to 79.7872%. However the confusion
matrix shows that most of the instances are True negative and only 31 of 86 instances are True
positive. The Decision Table classifier scored 231 of the correctly classified instances, which is equal
to 81.9149% which is even higher than J48. This classifier was able to classify 39 True positive
instances. Naive Bayes scored up to 232 instances, which is equal to 82.2695% and includes 40 of the
True positive instances and 192 of the True negative instances.
However in relation to the 4th table, when the data set was resized the percentage of the correctly
classified instances decreased. J48 classifier scored 135 = 74.1758% that includes 37 of the True
positive and 98 of the True negative. However another version scored 39 of the True positive
Figure 5: Visualization of J48 classifier for QuranArabic10.arff Figure 6: Visualization of J48 classifier for QuranArabic15.arff
40
Figure 7: Visualization of J48 classifier for HadithArabic7.arff
instances using the same classifier. Decision Table classifier scored 134 = 73.6264%. Similar to J48 it
scored 37 instances that are True positive and 97 instances of the True negative. Naïve Bayes
classifier scored the highest results compared to other classifiers that have been used. It got up to 138
correctly classified instances that are 78.8242% that includes 41 instances of True positive and 97 of
the True negative instances. Moreover in relation to the 5th table, after combining the frequency count
to one single count the results improved to 83.6879% although the word list did not change. This
percentage includes 44 instances of the True positive and 192 of the True negative. Other versions of
the arff file that implemented this approach did not go below 80.8511%, which is better than was
scored when each word frequency was counted separately. However this may be due to selecting
random sets that may include none of the words used which helped in the classification. Resizing the
data set to 100-86 did not produce any changes that could be considered useful.
5.3 Future Work
This project used two data sets to implement the classification task and test possible features
that could be used to improve the classifier in text processing and classification. There are a number
of possible indications for future work that can be implemented in this task. One example that can be
implemented on the holy Quran data set is oversampling the subset that was used and increases the
number of verses in the text file that relates to the hereafter concept. In relation to the hadith data set
oversampling is a possible idea. This can be implemented by including more hadith in the subset,
using other sources then the one used in this project. In addition sub-sampling is possible in this data
set. Additionally, a new data set could be used in terms of text classification on the hereafter subject,
since Day of Judgment is a pillar of faith in Islam. An example could include a book that talks about
the pillars of faith, which will include a section on the belief of Day of Judgment. This section could
be used as a subset in testing the classifier and the features.
In relation to the previous indications of ways to improve the data sets used in classification,
combining the holy Quran subset and the hadith subset is an option to oversampling. Combining to
the subset which is a main source in Islam could help in classifying any data set that is gathered from
any source. This implementation may also help in improving the results when a high standard subset
41
is provided using supervised learning algorithms to process text and implement the classification
problem.
All previous examples talk about different data sets that can be used on the hereafter
classifier. However, new subjects could be used in the process of text classification using different
features that may include different types other than counting word frequencies, for example, looking
at the length of the verse that holds a subject. In relation to text classification and the detecting
suspicious text, future implementations may include a combination of different types of data sets and
then test the classifier and the defined features on them. Given this indication and in relation to this
project, it may be implemented by combining the holy Quran data set and the hadith data set and
select some features that would help in the classification on both data sets.
42
Bibliography
1. Zhai, ChengXiang. Statistical Language Models. [Online] 2009. [Cited: 2 4 2011.] http://0-
www.morganclaypool.com.wam.leeds.ac.uk/doi/pdf/10.2200/S00158ED1V01Y200811HLT001.
2. Atwell, Eric and Sharaf, Abdul Baquee. Is Machine Learning useful for Qu'anic Studies? School
of Computing, University of Leeds. http://www.comp.leeds.ac.uk/eric/sharaf10jqsDraft.pdf.
3. Jackson, Peter and Moulinier, Isabelle. Natural Language Processing for online applications:
Text Retrieval, Extraction and categorization. s.l. : John Benjamins Publishing Company, 2002.
4. Liddy, Elizabeth D. Text Mining. [Online] 2000. [Cited: 20 02 2010.]
http://onlinelibrary.wiley.com/doi/10.1002/bult.184/pdf. 1550-8366.
5. Jbara, Khitam. Knowledge Discovery in Al-Hadith Using Text Classification. 2010, Vol. 6, 11.
http://www.jofamericanscience.org/journals/am-sci/am0611/48_3679am0611_409_419.pdf.
6. Al-Kabi, G Kanaan, et al. Statistcal Classifier of the holy Quran Verses (Fatiha and Yaseen
Chapters). 2005, Vol. 5, 3, pp. 580-583. http://docsdrive.com/pdfs/ansinet/jas/2005/580-583.pdf.
7. S. Al-Harbi, A. Almuhareb, A. Al-Thubaity,M. S. Khorsheed, A. Al-Rajeh. Automatic Arabic
Text Classification. King Abdulaziz City for Science and Technology. Riyadh : s.n.
http://lexicometrica.univ-paris3.fr/jadt/jadt2008/pdf/harbi-almuhareb-thubaity-khorsheed-rajeh.pdf.
8. Zhu, X. and Goldberg, A.B. Introduction to Semi-Supervised Learning. s.l. : Morgan & Claypool,
2009.
9. Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski. Data mining: a knowledge discovery
approach. s.l. : Springer US, 2007.
10. The K-Means Clustering Machine Learning Algorithm. puremango. [Online] [Cited: 27 03 2011.]
http://www.puremango.co.uk/2010/01/k-means-clustering-machine-learning/.
11. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Introduction to
Information Retrieval. s.l. : Cambridge University Press, 2008. http://nlp.stanford.edu/IR-
book/pdf/17hier.pdf.
12. McCarthy, Diana. Word Sense Disambiguation: An Overview. 2009, Vol. 3, 2, pp. 537-558.
http://onlinelibrary.wiley.com/doi/10.1111/j.1749-818X.2009.00131.x/pdf.
13. Bramer, Max A. Principles of data mining Undergraduate topics in computer science. s.l. :
Springer, 2007.
14. Abney, Steven. Semisupervised Learning for Computational Linguistics. [ed.] David Madigan et
al. 2008, Vol. 34, 3, pp. 449--452. http://www.hlt.utdallas.edu/~vince/papers/abney.pdf.
15. Zhu, Xiaojin. Semi-Supervised Learning with Graphs. [Online] 2005. [Cited: 27 03 2001.]
http://www.lti.cs.cmu.edu/Research/Thesis/XiaojinZhu05.pdf.
16. Aphinyanaphongs, Yin and Aliferis, Constantin. Text Categorization Models for Identifying
Unproven Cancer Treatments on the Web. Vanderbilt University. 2007. http://www.dsl-
lab.org/Publications/Aphinyanaphongs_2007a.pdf.
43
17. Ian H. Witten, Eibe Frank. Data mining: practical machine learning tools and techniques. 2.
s.l. : Morgan Kaufmann, 2005.
18. Dimov, Rossen. Weka: Practical machine learning tools and techniques with Java
implementations. 2007. http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf.
19. Al-Farsi, May. Mayana2011. Bolg Post. [Online] 2011. http://mayana2011.blogspot.com/.
20. Schütze, Christopher D. Manning and Hinrich. Foundations of statistical natural language
processing. s.l. : MIT Press, 1999.
21. Abbas, Noorhan and Atwell, Eric. quranytopics.appspot. Qurany. [Online] University of Leeds.
[Cited: 15 03 2011.] http://quranytopics.appspot.com/.
22. alqran wa Aolomoh. islamiyyat. [Online] 04 01 2009. [Cited: 25 04 2011.]
http://www.islamiyyat.com/alqranwa3olomoh/2009-03-26-15-21-49/106----1.html.
23. Sahih Muslim. al-eman. [Online] [Cited: 4 04 2011.] http://www.al-eman.com/library/book/book-
view.htm?id=1#s1.
24. Siddiqui, Abdul Hamid. English - Sahih Muslim. quaran truth. [Online] [Cited: 4 04 2011.]
http://www.quarantruth.com/PDFS/Hadith/English%20-%20Sahih%20Muslim.pdf.
25. qurancomplex. [Online] [Cited: 18 04 2011.] http://www.qurancomplex.org/default.asp?l=eng.
26. McKinlay, Andrew . ML classifier for small minority class. s.l. : [email protected], 4th April
2011.
27. Bilal M. Zahran, Ghassan Kanaan. Text Feature Selection using Particle Swarm Optimization
Algorithm. 2009, Vol. World Applied Sciences Journal 7. www.idosi.org/wasj/wasj7(c&it)/10.pdf .
28. Bharati, Akshar, Chaitanya, Vineet and Sangal, Rajeev. Natural Language Processing: A
Paninian Perspective.
29. Dukes, Kais, et al. Online Visualization of Traditional Quranic Grammer using Dependency
Graphs. School of Computing, University of Leeds. http://www.comp.leeds.ac.uk/scsams/qcorpus-
fal2010.pdf.
30. Atwell, Eric and Sharaf, Abdul Baquee. Creating a Gold Standard Corpus for Related Texts.
School of Computing, University of Leeds.
http://www.comp.leeds.ac.uk/eric/cl2011/sharaf11clAbstract.docx.
31. Juola, Patrick. Authorship Attribution. s.l. : Now Publishers Inc, 2008.
32. D. J. Hand, Heikki Mannila, Padhraic Smyth. Principles of data mining. s.l. : MIT Press, 2001.
44
Appendix A
Personal Reflection
As a computer science student, I selected this project because I was interested in the natural language
processing area. However I did not have any pre-knowledge of a real life application that it solved,
apart from a module that I took in my final year. This module helped me in understanding what
natural language is, and this project helped me in implementing my understanding. In addition, I had
to look for previous examples that were already available to understand what had already been
implemented. There were examples of projects implemented in the school of computing at the
University of Leeds, and other examples from research in from other universities that were accessible
by looking at their publications over the internet. Then I started working on my project in order to
implement the solution that would help in text processing and classification, based on what I have
read and understood from my background reading. I had many questions for my supervisor in order to
confirm my understanding of an idea I had, and this made me more curious about this research area. I
hope I can use my knowledge and recall all that I have done in this project in my professional life.
From the beginning of the project it was essential to keep track of the time and work that should be
completed. To keep track of my work I created a blog that included some information that I gathered
from the papers I read. However it was not used much when the implementation was done. Obviously,
there were times were I was late and behind the schedule, especially when I had to find a new data set
and test the classifier on it. I had to look for a data set that would include the hereafter subject and
then create the data set text file and subset manually, which included reading instructions to
understand the books from where I retrieved the data. However, I managed to get back on track by
making every effort to catch up when setting up the data and designing a Java program that would
help in formatting it. To overcome this problem I encourage further students to work on their projects
with a good management of time. Overall this project helped me in understanding how text can be
useful in many different applications in real life. For example, in sight text processing we can have a
translation to many languages, and be able to recognize books and to whom they belong from the
style. Classifying text can help in many aspects in the world, and in this project I provide a
contribution to help in the project "Detecting Terrorist Activities: Making Sense", currently operating
at the University of Leeds. Much of the research that has been done at the university into the holy
Quran is used in helping Islamic scholars to retrieve whatever information is needed from the holy
Quran. Similarly new researches on hadith are carried out, using natural language processing
45
techniques to classify the saying of the prophet, which would be used as sources in other Islamic
studies. I recommend forthcoming students to pick a project that relates to natural language
processing which they would feel comfortable in working with since it is helpful in many applications
in the real world.
46
Appendix B
Materials used in the Project
There are many text files that were used in order to implement the classification task. This material is
provided in a CD and attached to the submitted copy. The reason for including the CD is that some of
the material that was used includes more than 300 lines and it was best to write it to a CD and attach it
to the report. The CD is divided as follows:
Quran data set Folder:
This folder is divided into two folders depending on the corpora language. Both folders include all
files that are used in the implementation as follows:
English version
1. allQuran.txt: text file provided by Claire Brierley.
2. csubsetQuran.txt: text file provided by Claire Brierley.
3. cnoarff.java: used to write noQuran.arff file that
4. allQuran.java: used to write allQuran.arff file & combine noQuran.arff with
allQuran.arff
5. allQuran.arff: includes all lines of verses ladled Yes in arff format.
6. noQuran.arff: includes all lines of verses ladled No in arff format.
7. complement.txt: all verses that are not in csubsetQuran.txt
Arabic version
1. ArabicQuran.java: used in creating the QuranArabic.arff files.
2. allQuranArabic.txt: text file retrieved by downloading it from the internet & and used
in ArabicQuran.java.
3. subsetQuranArabic.txt: text file retrieved using allQuranArabic.txt and
csubsetQuran.txt & and used in ArabicQuran.java.
4. ComplementArabicQuran.txt: text file retrieved from ArabicQuran.java.
5. meccan-madinan.arff: used in ArabicQuran.java to include Meccan-Madinan feature.
6. QuranArabic#.arff: text file retrieved from ArabicQuran.java (#: refers to the version
number of the arff file used in the implementation which is 1-15).
47
7. RandomSubsetArabicQuran#.txt: text file retrieved from ArabicQuran.java (#: refers
to the version number of the random set that corresponds to the arff file used in the
implementation which is 1-15).
Hadith data set Folder:
This folder includes all files that are used in the implementation as follows:
1. SetupData.java: used in formatting the hadith data set.
2. ArabicHadthid.java: used in creating the HadithArabic.arff files.
3. SahihMuslim.txt: text file used in SetupData.java.
4. subsetHadith.txt: text file used in SetupData.java.
5. AllHadithArabic.txt: text file retrieved from SetupData.java & and used in ArabicHadith.java.
6. SubsetHadithArabic.txt: text file retrieved from SetupData.java & and used in
ArabicHadith.java.
7. ComplementArabicHadith.txt: text file retrieved from ArabicHadith.java.
8. HadithArabic#.arff: text file retrieved from ArabicHadith.java (#: refers to the version
number of the arff file used in the implementation which is 1-12).
9. RandomSubsetArabicHadith#.txt: text file retrieved from ArabicHadith.java (#: refers to the
version number of the random set that corresponds to the arff file used in the implementation
which is 1-12