Detecting Arabic Text May Ali AL-Farsi Computer … Contents Chapter 1 ... using the English subset...

The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student):

Detecting Arabic Text

May Ali AL-Farsi

Computer Science (BSc)

2010/2011

I

Summary

This report investigates machine learning algorithms that are used in text retrieval. It aims to

implement classification to categorize different verses of Quran and sayings in the Hadith. The data

that is used in this task should be reformatted in a corpus form to be used in classification. The

classification implementation was performed using two corpora: the first includes all the text of the

holy Quran verses or the prophet sayings and the second includes the text that holds the interesting

verse or sayings. Since this project is classifying text that is in Arabic this two data sets were selected

for their availability in this language. Another reason was that this project uses supervised method that

requires pre-knowledge on the data set used in order to label it. Both data sets were pre-classified

either in previous text classification implementation such as the holy Quran or Hadith that is classified

based on Islamic scholars in their books such as Sahih Muslim. The classification task was to divide

the text into two classes interesting/ non-interesting based on a defined subject. The hereafter concepts

subject was selected to classify the Holy Quran or the prophet sayings as interesting. Predefined

features were used to help in the classification tasks which were selected based on Islamic scholars‟

researches. WEKA will be used to implement the classification task after retrieving the information

required.

II

Acknowledgements

In the beginning I would like to thank my supervisor, Eric Atwell, whose help and

supervision led me throughout the project. I would also like to thank Katja Markert for her

useful feedback on my mid-report.

Finally, I would like to thank all PhD students at the University of Leeds who supported me in any

respect during the completion of the project.

III

Contents

Chapter 1 ............................................................................................................................................. 1

Introduction ............................................................................................................................................. 1

1.1 Overview ....................................................................................................................................... 1

1.2 Aim ............................................................................................................................................... 2

1.3 Objectives ..................................................................................................................................... 2

1.4 Minimum Requirements ............................................................................................................... 3

Chapter 2 ................................................................................................................................................. 4

Background Research ............................................................................................................................. 4

2.1 Machine Learning ......................................................................................................................... 4

2.2 Natural Language Processing........................................................................................................ 4

2.3 Text Classification ........................................................................................................................ 5

2.3.1 Text Classification methods ................................................................................................... 6

2.3.2 Classification Examples ......................................................................................................... 8

2.4 WEKA......................................................................................................................................... 11

2.4.1 Graphical interface & Performance ..................................................................................... 11

2.4.2 Data format: ARFF file and Processing ............................................................................... 12

Chapter 3 ............................................................................................................................................... 13

Project Plan ........................................................................................................................................... 13

3.1 Procedures and Deliverables ....................................................................................................... 13

3.2 Schedule ...................................................................................................................................... 14

3.3 Methodology ............................................................................................................................... 14

Chapter 4 ............................................................................................................................................... 16

Design and Implementation .................................................................................................................. 16

4.1 Classification Overview .............................................................................................................. 16

4.2 Features ....................................................................................................................................... 16

4.3 Approaches ................................................................................................................................. 17

4.3.1 The holy Quran data set ....................................................................................................... 19

4.3.2 The hadith data set ............................................................................................................... 23

Chapter 5 ............................................................................................................................................... 27

Results and Evaluation .......................................................................................................................... 27

5.1 WEKA Results ............................................................................................................................ 27

5.1.1 The holy Quran data set results ............................................................................................ 27

5.1.2 The hadith data set results .................................................................................................... 32

5.2 Evaluation of the Model .............................................................................................................. 36

IV

5.2.1 Evaluation on data sets ......................................................................................................... 36

5.2.2 Evaluation of feature ............................................................................................................ 36

5.3 Analysis....................................................................................................................................... 37

5.3 Future Work ................................................................................................................................ 40

Bibliography ......................................................................................................................................... 42

Appendix A ........................................................................................................................................... 44

Personal Reflection ............................................................................................................................... 44

Appendix B ........................................................................................................................................... 46

Materials used in the Project ................................................................................................................. 46

1

Chapter 1

Introduction

1.1 Overview

This project classified interesting and non-interesting text by designing a system that would

retrieve information that will be useful in the implementation of the classifier. Then it will attempt a

number of approaches that would evaluate the classifier‟s accuracy. In order to retrieve the

information required, features that would specify the interesting subject in the text were defined.

According to ChengXiang Zhai [1], one way of representing the data that would be used in text

maiming is topic model labelling. This is by implementing supervised methods for text classification.

This project will label the data in different mergers that will be explained later in the design section.

The data sets that were selected in implementing text classification in this project are the holy Quran

and the Hadith as an additional dataset. Since this project is classifying text that is in Arabic these two

data sets were available in this language. Another reason for selecting these two data sets is that this

project uses supervised method that requires pre-knowledge on the data set that is used. For example

labelling into classes they belong to the then using supervised algorithms to train and test the

classification. Since both data sets are pre-classified either in previous text classification

implementation such as the holy Quran or Hadith that is classified based on Islamic scholars in their

books such as Sahih Bukhari and Sahih Muslim. Both data sets hold the hereafter concepts. As a

result this subject was selected to classify verses of the Holy Quran or the sayings of the prophet as

interesting. First implementation was on the holy Quran and the classification was tested on the

English data set. The English version of the holy Quran was provided by Claire Brierley, a researcher

at the University of Leeds. However the Arabic version was downloaded from the internet and was

structured to use it in the implementation. The Arabic version of the subset was created manually

using the English subset that was provided and the full Arabic version of the holy Quran that was

downloaded. In addition, hadith was also added as new data set to implement the classifier and test if

it was performing in the expected behaviour. This data set was manually created from Sahih Muslim

to include all the sayings of the prophet Muhammad (May peace be upon him) and was later used in

implementation. Creating the Hadith data set was a bit challenging since it has never been created and

2

it was essential to reformat it to a corpus that would be used in the implementation. However, the

main challenge was improving the result of the classification when it was skewed.

The outline of the report includes the procedures that were carried out to understand the problem

of text classification and implement a solution that would solve it. The following chapter will be the

Background review that represents the first attempt to understand the problem from previous work

and implementations. The third chapter will include a description of the procedures and deliverables,

schedule, and the methodology that was applied in this project. The forth chapter is the design and

implementation chapter, that includes an overview of the classification task, a description of the

features that were defined in order to aid the classification implementations, and finally the

approaches that were considered and the results of the classification from WEKA. The final chapter in

this project is the evaluation, which includes an evaluation on the data set, the selected features, an

analysis of the results, future work. In addition, an appendix A is attached that includes personal

reflection, along with appendix B that describes the contents of the appendix CD.

1.2 Aim

The aim of the project is to produce a classification of the verses in the Arabic version of

Holy Quran into two classes, which are interesting or non-interesting. This will be implemented using

supervised learning algorithms that include training sets and the Quran corpus. The system is built on

classifications that are based on predefined features that are used in open source software tools called

WEKA which will help in analysing the performance of the features that were defined.

1.3 Objectives

The main objectives of the project are to:

1. Provide a good background review on text classification and machine learning algorithms.

2. Generate a system that classifies data sets into interesting and non-interesting types based on

features that are defined by the user to describe the interesting set. In this case the interesting

set is the verses of the holy Quran that relates to the hereafter concept.

3. Research on how to define beneficial features in order to improve the text classification. In

the case of the project the defined feature must help in identifying the verses that hold the

hereafter. This may include signs of the hereafter, name of the hereafter and the awards in the

hereafter.

4. Generate class labels for all verses in the holy Quran, ready for training and testing.

5. Try 10-fold cross-validation as a testing method.

6. Consider WEKA options of classification implemented on designed .arff file.

3

1.4 Minimum Requirements

The minimum requirements are:

1. Understand how categorizing should be done in this project and how to build an accurate

classifier.

2. Build a Java program that helps in retrieving information required from the holy Quran and

generate an .arff file that is used in classification tools.

3. Train and test the data set.

4. Implement the classifier and evaluate the results.

4

Chapter 2

Background Research

2.1 Machine Learning

Machine learning is a scientific field in artificial intelligence. It permits computers to find

patterns in, classify, or cluster a given set of text. Classification is one of the important tasks that

could be accomplished in a machine learning field via natural language processing. Computers are

trained to classify certain given attributes and then experts evaluate the algorithm to check if it meets

the requirements and give the percentage accuracy of the outcome [2]. Natural language processing

tasks could be implemented using machine learning methods.

2.2 Natural Language Processing

Natural Language Processing (NLP) can be defined as a computational approach to analyse

text that is based a set of theories and technologies. The term NLP is normally used to describe the

function of software or hardware components in a computer system which analyse or synthesize

spoken or written language. The “natural” description is meant to distinguish human speech and

writing from more formal languages such as mathematical or logical notations, or computer languages

such as Java and C++ [3]. In order to achieve these tasks NLP uses some linguistic tools that help in

text mining and information retrieval. It includes many important techniques to apply and extract

knowledge automatically from the text. Documents are split into paragraphs, then into sentences

which are eventually split into words. These words are tagged by part of speech, grammatical analysis

and other features prior to the sentence being parsed. Thus, parsing is carried out on sentence

delimiters, tokenizers, stemmers and part of speech taggers. In general detecting sentence boundaries

is done accurately by stating delimiters that rely on regular expressions or punctuation signs. Regular

expressions or punctuation signs can disambiguate, for example, the end of a sentence. In addition this

type of analysis can rely on training corpora, or use more substance such as part of speech

frequencies. Furthermore, the use of tokenization can be utilised to disambiguate punctuation

characters since tokenizers are lexical analysers that divide a motion of characters into meaning units

called tokens. Parsing cannot proceed in the absence of lexical analysis, and so it is required to use

stemmers. Stemmers are morphological analysers that identify alternative terms to a word using root

5

form. Moreover, part of speech tagger builds upon tokenizers and sentence delimiters label every

word with its proper tag such as noun, verb, adjective etc. Finally, parsing maybe accomplished by

addressing the simplified variant of the parsing extend, which helps in extracting interesting parts of

the text. Moreover parsing can be done with respect to grammar, which is a set of rules that say which

combinations of parts of speech generate well-formed phrase and term structure [3].

2.3 Text Classification

Text mining is the knowledge of analysing and applying specific algorithms to text to detect

and extract substance. Text mining uses techniques primarily developed in the field of information

retrieval, statistics, and machine learning. It is achieved in three main stages. Firstly, the text

preparation stage where the text selection and management is developed, presenting training sets

which were created manually by AI experts. Secondly, the text processing stage that uses text mining

algorithms to treat the processed data sets. At this stage of the text mining procedure a fully featured

natural language processing system would define rules and attributes, and features of the provided

data to be used in designated algorithms or techniques such as decision trees [4]. The NLP systems

usually convert the input to an internal representation that interacts with the external language

resource such as dictionaries. This is done to produce a useful analysis and annotation on the input

text. This can be used in many types of application such as: automatic question –answering, text

summarization, machine translation into another language, analysis of customer preferences,

automated tagging of internet advertising, etc. [2]. Thirdly, is the text analysis stage that consists of

the evaluation and demonstration of assumptions made by the experts in the beginning. The extracted

text is passed to text tools to visualize and eventually present a detailed analysis [4]. According to

Khitam Jbara [5], automatic text classification is an essential research subject in text classification

researches. This is due to the availability of the large number of digital documents that are used. In

addition, according to Al-Kabi, G Kanaan; Al-Shalabi, R; M.O, Khalid; Bani-Ismail, and Basel

Mohammed [6], Automatic Text Categorization (ATC) refers to producing software that includes

predefined categories to handle “unseen” text files.

Text classification has been the subject of many researches in different languages around the

world. Projects on techniques to classify texts in different languages such as English, European and

Asian are widely spread. One of the examples of a project that will be implemented in this project is

classifying Arabic text. Automatic Arabic text classifications are mostly implemented in the same

sequence by first compiling the text document in a corpus and have the option of labelling it.

Secondly select the most suitable feature to classify. Thirdly the selection of classification algorithms

that is trained and tested. [7].

6

2.3.1 Text Classification methods

Data are divided into two main types based on the attributes. At first, data in which the

attributes were taken into account and intended to be used are called labelled data. This type of data

uses supervised learning methods for data mining. If the attributes can be divided into categories, then

it is a classification. On the other hand, analysing numerical attributes is called regression. Secondly,

data in which no attributes were taken into account are called unlabelled data, which aims to extract

information from the data set. This type of data uses an unsupervised learning method. In addition,

semi-supervised learning is another machine learning method that is used in text classification and is a

combination of the previously mentioned methods.

2.3.1.1 Unsupervised Learning

Unsupervised learning uses training samples that include a number of instances. These

instances do not help in directing the results of the training [8]. Clustering is considered to be an

alternative term to unsupervised learning. Clustering can be the objective function in the

implementation of some solutions to a problem. Clustering can be hierarchal in some problems, or a

modal based clustering [9]. Since the unlabelled data that is used for information extraction is done by

implementing unsupervised method, it is important to understand that clustering is grouping objects

that are similar to each other in a cluster. One example of clustering is K-means clustering. This

algorithm is considered to be one of the simplest unsupervised learning algorithms. It identifies the

groups that are similar in a data set, without labelling or guidance. It is implemented by defining

points in a dimensional space. The number of attributes defines the space dimensions [10]. Another

example of clustering is hierarchical clustering algorithms which can be top-down or bottom- up.

According to Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze [11], this type is

more commonly used in information retrieval and it could be visualised in a dendrogram. On the

other hand, the top-down method is more like splitting a cluster into points repeatedly until it reaches

a point that stands as a cluster by itself.

2.3.1.2 Supervised Learning

Supervised learning uses corpus to build and train a system based on specific features. The

system in this type of algorithm determines in advance the text content that should be considered and

what types of information to use from the contents. It determines the available methods of combining

the contextual evidence from the values of features in the training data [12]. There are many

approaches when applying supervised machine learning, such as Decision Trees, Naïve Bayes

Classifier and nearest neighbour algorithm [13]. According to [3], classification using supervised

machine learning methods is achieved if the following conditions were fulfilled:

1. Pre-classification of the data that will be analysed

2. “In the simplest case, these classes should be disjoint”

7

3. If the data cannot be split into classes, then convert the data to be classified to (n)

corresponding sub-problems, where a sub-problem classifies data to those that belong to the

corresponding category and those which do not {Yes, No}.

One example of supervised classification method is Naïve Bayes Classifiers that use probability

theory to achieve the possible classifications. Bayes‟ theorem was constructed by Thomas Bayes, the

first mathematician to use probability in supervised learning [13]. Naïve Bayes looks at the

distribution of terms, either with respect to their frequencies or with respect to their presence or

absence. It looks at the probability of a certain feature from the data set and classifies it. This is

achieved by calculating a function that extracts the frequency of the terms‟ occurrences [3].

Another method to apply supervised learning is to build a tree that contains features from data

sets. The root of the tree will be the categorization that is used to differentiate the class. The nodes of

the tree are decision points that test a feature. A branch in the decision tree corresponds to the value of

the results. If the tree was built based on the training set that was pre-classified, then it forms

inductive learning techniques. According to [3], “the decision tree method works best with large data

sets, training sets that are too small will lead to over fitting. The data must be in a regular attribute

value format. Thus each datum must be capable of being characterized in terms of a fixed set of

attributes and their values, whether symbolic, ordinal or continuous. Continuous values can be tested

by threshholding. Assuming that they are applicable, decision tree methods can have a number of

advantages over more conventional statistical methods:

1. They make no assumptions about the distribution of attributes‟ values.

2. They do not assume that conditional independent of attributes.” [3]

2.3.1.3 Semi-Supervised Learning

Another method of machine learning is the semi-supervised learning or “reinforcement

learning” method. This method is basically a combination of both supervised and unsupervised

learning. This means that the data set will include a set of labelled data along with unlabelled data. It

is claimed that semi-supervised results higher accuracies that the other two methods. According to

Steven Abney [14], there are six different implementations of semi-supervised learning methods

approaches when applying supervised machine learning, such as Decision Trees, Naïve Bayes

Classifier and nearest neighbour algorithm [13]. According to [3], classification using supervised

machine learning methods is achieved if the following conditions were fulfilled:

4. Pre-classification of the data that will be analysed

5. “In the simplest case, these classes should be disjoint”

6. If the data cannot be split into classes, then convert the data to be classified to (n)

corresponding sub-problems, where a sub-problem classifies data to those that belong to the

corresponding category and those which do not {Yes, No}.

8

One example of supervised classification method is Naïve Bayes Classifiers that use probability

theory to achieve the possible classifications. Bayes‟ theorem was constructed by Thomas Bayes, the

first mathematician to use probability in supervised learning [13]. Naïve Bayes looks at the

distribution of terms, either with respect to their frequencies or with respect to their presence or

absence. It looks at the probability of a certain feature from the data set and classifies it. This is

achieved by calculating a function that extracts the frequency of the terms‟ occurrences [3].

Another method to apply supervised learning is to build a tree that contains features from data

sets. The root of the tree will be the categorization that is used to differentiate the class. The nodes of

the tree are decision points that test a feature. A branch in the decision tree corresponds to the value of

the results. If the tree was built based on the training set that was pre-classified, then it forms

inductive learning techniques. According to [3], “the decision tree method works best with large data

sets, training sets that are too small will lead to over fitting. The data must be in a regular attribute

value format. Thus each datum must be capable of being characterized in terms of a fixed set of

attributes and their values, whether symbolic, ordinal or continuous. Continuous values can be tested

by threshholding. Assuming that they are applicable, decision tree methods can have a number of

advantages over more conventional statistical methods:

3. They make no assumptions about the distribution of attributes‟ values.

4. They do not assume that conditional independent of attributes.” [3]

2.3.1.3 Semi-Supervised Learning

Another method of machine learning is the semi-supervised learning or “reinforcement

learning” method. This method is basically a combination of both supervised and unsupervised

learning. This means that the data set will include a set of labelled data along with unlabelled data. It

is claimed that semi-supervised results higher accuracies that the other two methods. According to

Steven Abney [14], there are six different implementations of semi-supervised learning methodswhich

are self-training, agreement-based methods (co-training algorithm), clustering algorithms, boundary-

oriented methods, label propagation in graphs and spectral methods. According to [15], the self-

training method has been used in text categorisation by applying a classifier on a labelled data to train

it and then on unlabelled data. Co-training is another way of implantation which is accomplished by

using two classifiers and splitting the features into subsets. The two classifiers are trained with the

labelled data, each on one of the subsets. Then each classifier is used on the unlabelled data.

2.3.2 Classification Examples

Islamic scholars have shown their interest in classifying the verses of Quran for a number of

reasons. For example classification of verses help in understanding the context of the chapters,

analysing the evolution and wisdom behind some Islamic rulings in Quran, understanding the history

of the Islamic nation and the biography of the Prophet Muhammad (May peace be upon him) [2]. In

9

addition, according to Khitam Jbara [5] , “Al hadith is the saying of Prophet Muhammad (May peace

be upon him) and the second religious source for all Muslims”. It is important to be able to classify

the hadith into subjects that are classes in research. Two examples will be discussed in this section. A

first example is the Quranic classification and a second is the hadith classification. Both classification

examples required a predefined corpus to train and test the results of the classifier. The corpus is a

natural language source of language information, since it is the illustration of the linguistic

knowledge. Corpora are used as data analysis tools to determine patterns or other language processing

tasks. A corpus maybe created manually to achieve a certain aim and the objective of a study. The text

mining procedures can be carried out on large data sets such as the Quran corpus, and use words and

interesting features that were defined by the user to retrieve information. The use of a corpus helps to

reduce time and effort when classifying manually. Most text classification examples use corpus based

research. In the following examples, an explanation is provided on how the corpus is created and the

method and algorithms that were implemented to achieve the classification.

Quranic classification

According to Al-Kabi, G. Kanaan et al. [6], the journal stated that the objective of the study

was to classify the verses (Ayat) of The Opening (Fatiha) and Yaseen (Ya-seen) chapters according to

Islamic scholars, using a linear classification function. This was accomplished by building a system

that intended to categorize the different verses in each chapter. The system was designed using

Microsoft Visual Basic because it supports the Unicode of Arabic language. To begin with the

implementation, where a selection was made of the chapter and verse to classify, a list of words as

features was generated and their occurrence counted and a check was made to see what subject they

were related to in order to create a class for that subject. A corpus was created to build a list of

subjects, which was generated manually for the selected chapters. Secondly, the verses were

normalised by removing punctuations and parsing the verses into different tokens. Categorisation into

classes was made according to the subjects created by the system. Then a function was defined to

represent the percentage of a specified subject and records were taken of the highest score

classifications. This system scored 91% accuracy in classifying different verses.

In addition research, at the University of Leeds, was conducted to classify the Quran verses

into Meccan and Madinan. In this research the classification was based on the migration of the

Prophet Muhammad (May peace be upon him). The algorithm was based on pre-knowledge of the

Quran scholars that has already defined some of the chapters to be Meccan or Madinan. The accuracy

of the algorithm was based on the features set that contained keywords, the availability of the training

set on the few chapters that were classified earlier, and the use of the developed Quranic corpus. In

contrast to the previous study, this research defined 14 features that were obtained from scholars.

These features were converted into countable keywords and their frequencies obtained. After defining

these features, the classification was carried out using open source software called WEKA [2].

10

Hadith classification

Another example of text classification is al hadith classification. According to Khitam Jbara

[5], the reason for starting this study was the importance of the hadith and correct classification that

helped in Islamic studies. The text set of al hadith was obtained from Sahih al-Bukhari. This is the

hadith book that is used by most researches that include hadith. This research was carried out to

classify the hadith into subjects that the prophet had talked about at that time. The results were

thirteen classes. The system which the scholars proposed included four main procedures which are:

processing, training that included the selected feature used to help the classification, classification that

included a method to classify the hadith, and eventually the analysis of the classification results. The

scholars had to create a corpus for hadith to process their classification. The procedure that was

followed to obtain the training set was to firstly remove the part of hadith that refers to the names of

individuals who transferred it from the prophet. Secondly, tokenizing the hadith into words and

removing punctuations marks. In addition words like pronouns, prepositions, and names of people or

places that were mentioned were removed from the hadith set, and what was left of the set was

considered as a feature. The last step towards the construction of the corpus was stemming the feature

from prefixes and suffixes and the elimination of the words that did not make sense after stemming

them. The scholars eventually had 19 features that were weighted and used to classify al hadith.

Supervised classification was used as they have created a training text file to extract the features.

Three methods were implemented to classify which is AL-Kabi; Word based classification (WBC) and

Stem expansion classification (SEC). The results of the SEC study achieved the best results of all the

methods that have been implemented.

Medical classification

According to Aphinyanaphongs, Yin and Aliferis, Constantin [16], there were many

publications in the internet which claimed that there are many “unproven treatments”, inaccurate

medications and cancer patients are in terrible conditions. In order to overcome these problems

research has been carried out to identify the web pages that made the previous claims. One of the

claims that have been made was that people who are not real physicians gave medical advice and

treatments. Moreover, some cancer patients had made online purchases of abnormal medication. The

set that was used to retrieve the information is gathered from chosen unproved treatments that were

identified by the quackwatch website. To create the corpus the scholars selected eight unproven

treatments, to name a few, the “Cure for all Cancers” , “Metabolic Therapy”, “Cellular Health”, and

“Insulin Potentiation Therapy”. The web pages were identified by appending the words cancer and

treatment and the top results were retrieved from a Google query. The web pages were labelled with

positive and negative. The web pages that include unproven claims were the positively labelled,

others were negatively labelled. Eventually the corpus included 191 web pages labelled as positive

and 98 as negative. Web pages were converted to list of words by removing the script tags and

11

replacing punctuation with spaces to obtain words. Then words were stemmed and words that

appeared in less than three web pages were removed. A number of algorithms were implemented and

compared to each other. One of the algorithms that were implemented was counting the frequencies of

the terms and defines a user threshold.

2.4 WEKA

WEKA was developed at the University of Waikato in New Zealand, and the name stands for

Waikato Environment for Knowledge Analysis. It is an open source software that is written in Java

under the terms of general public licences GNU. WEKA was tested to run on any platform or

operating system. Easy interface with many different learning algorithms made available, along with

methods for pre- and post-processing and for evaluating the result of learning schemes on any given

dataset [17].

2.4.1 Graphical interface & Performance

WEKA has three graphical interfaces. First, The Explorer that provides the user with the

privilege to access all facilities implemented using the menu selection and form filling. The second

graphical interface is The Knowledge Flow interface. This interface provides the user with the design

configurations for streamed data processing. It allows the user to specify a data stream. This is done

by connecting components representing the data sources, pre-processing tools, learning algorithms,

evaluation methods, and visualization models. The third interface is called The Experimenter, which

is designed to help the user to answer the basic practical questions while applying classifications and

egression techniques.

Figure 1: WEKA Explorer

12

WEKA performs a variety of learning algorithms that the user can easily apply to their

dataset. It also includes a selection of tools for transforming datasets, such as the algorithms for

discretization. The user can apply any method on a dataset. This is accomplished by loading the

dataset into a learning algorithm, and outlining analysis of the results. It comprises methods that can

be applied to any standard data mining problem, such as regression, classification, clustering, etc.

WEKA can be applied as a learning method to a dataset and analyse its output to learn about the data.

In addition, it can be used on methods already learnt and produce estimations on new instances.

Another example is to apply several different learning methods and compare their performance in

order to choose one for a prediction [17].

2.4.2 Data format: ARFF file and Processing

The dataset is loaded into the WEKA explorer in a number of formats such as spread sheets

and database format, but the built-in format is the ARFF file. This file has three parts:

Relation name: the first line in the file should be a relation name starting with @relation

and can be any meaningful name [2].

2. Attribute list: each attribute starts with a @attribute command followed by a name defined

by the user as the attribute, and the types of this attributes [2].

3. Data set: finally the WEKA ARFF file expects a list of data sets. This is the representation

of each line formed in a row of comma and attributes [2].

After loading the data into the explorer, the attributes that were defined in the .arff file for

example appear in the processing tab. In addition to the processing tab, a classify tab is provided in

the explorer. In this tab the user can build or train the classifier. Many options of classification are

provided along with a description and the parameters that can be used. Clustering can also be

implemented using the cluster tab. Both classify and cluster tabs provide the option of testing and

training and the results are displayed in the output panel. In addition, WEKA provides an attributes

selection feature which can be accessed from the “select attribute” tab. This can help in selecting

smaller sets of attributes that help in defining the most suitable attributes in classifying and clustering.

The last tab in the explorer is the visualize tab that displays 2D distribution of data [18].

13

Chapter 3

Project Plan

3.1 Procedures and Deliverables

This project was divided into a number of procedures to achieve the explained aim and

requirements. To keep track of the performance, a blog [19] was created to include the all information

that was taken from the papers that were used in the background reading chapter, the steps that were

carried out to define the features and the design and implementation of the classifier. Understanding

the problem and the requirements were the main points to begin with, and then a preferable schedule

was planned. In the process of implementation there were a number of Java programs and reports to

deliver. The procedure explained in the following steps:

1. Understand the natural language processing algorithms that help in text classification, which

was described in the background research. Better understanding of the problem and the

possible way of implementing the solutions.

2. Classification implementation on English Quran based on predefined features that will

characterise the interesting verses.

3. Design Java program that creates the complement set that is used as training data.

(Deliverable: Java program, complement set text fil, WEKA arff file).

4. Used deliverables from the designed java code in WEKA to test classifier.

5. Write up a mid-project report, including the introduction, background research, and the plan

schedule (Deliverables: mid-term report).

6. Classification implementation on Arabic Quran based on predefined features that will


7. Design Java program that creates the complement set, random subsets that is used as training

data. (Deliverable: Java program, complement set text file, Random subset, WEKA arff file).


9. A demonstration was organized and presented to the assessor and supervisor to show the

work that was accomplished.

10. Create Hadith corpus as additional data set.

11. Classification implementation on Arabic Hadith based on predefined features that will


14

12. Design Java program that creates the complement set that is used as training data.

(Deliverable: Java program, complement set text file, Random subset, WEKA arff file).


14. Evaluation of the system implemented.

15. Write-up of the final report (Deliverables: The final report).

3.2 Schedule

At the beginning of the project, a suitable schedule was planned. The schedule was organised

to fit all requirements in the time provided. It was obvious that some adjustments were necessary even

though an effort was made to keep to the timescale. The reason for the adjustments that were made to

the schedule was to include a presentation that wasn‟t previously planned for. In addition, the

background reading took more time than expected. This was in order to get hold of the recent papers

published after the year 2000. Also time was taken getting hold of the papers which are related to the

project specifically, since text classification is broad area of study. In addition, it was necessary to get

hold of other examples, other than the holy Quran, that will hold the concept of the hereafter and

which will help in testing the performance of the classifier on different texts. One of the examples of a

data set that will be considered is hadith, using Sahih Muslim and creating a data set that has similar

format of the holy Quran data set. The design of the data set that was created manually took some

time since hadith corpus was not available over the internet. The data set was then used to test the

classification on this text and test to see if the features that were selected were appropriate for this

classification.

3.3 Methodology

Information retrieval and extraction are algorithms that are considered when some information is

required from any type of document that would help in some application worldwide. These algorithms

are considered an important field in Natural Language Processing (NLP). Information retrieval can be

from text, speech, videos, and images. Text categorization is an important part of NLP that uses

information retrieval and extraction algorithms to design and implement solutions. Text

categorization, “classification in other words”, may include language identifiers on a document of

unknown region, which eventually helps in identifying the region from which it came [20]. Text

classification is the artificial word used by computing scholars, but linguistics scholars tend to call it

Stylometry if the task was about the classification of a document and the author‟s style. An example

of NLP classification that may interest linguists is a classification application on a poem that was

newly discovered and linguists would like to determine if it was written by Shakespeare or by another

author. In addition, text classification can implement information and extraction by highlighting

important/interesting/suspicious text to a user using features defined by the user. Detecting suspicious

text is a research topic that is carried out at the University of Leeds as part of the “Making Sense”

15

project. This project can be useful in detecting "suspicious" texts in a large corpus of surveillance and

in intercepted texts from terrorist suspects, by implementing an algorithm that would classify different

data sets into suspicious and non- suspicious.

The classification in this project was processed in a number of steps to achieve the

requirements and evaluate the language model that was built. In order to accomplish this two Java

programs were designed and run on the English data set of the holy Quran. However, changes were

made later to achieve the same results on different Arabic data sets including the Arabic version of the

holy Quran. The other Arabic data set that was used in implementation is the hadith data set after

completing the main steps of designing the data set. The reason for the changes is that when the

classification was implemented on the English sets of the holy Quran it did not perform correctly and

it was obvious also that it would not perform correctly on the Arabic set of the holy Quran or Hadith.

An example of the changes is a Java program that will extract features from the random subset that

was selected from the complement text file and counts the frequencies of these features line by line

and outputs an .arff file. Changes that were made will be explained in full in the design and

implementation part. In order to try to increase the accuracy of the results a number of features were

considered. The word frequency count feature was carried out on the holy Quran data set and the

Hadith data set. This feature was considered in two different ways that will be explained in the next

chapter. The other feature that was added in the stage of designing the solution to the classification

problem was the Meccan and Madinan features that would define the verses. This can be used as an

extra feature in which a verse that was classified to be interesting should be contained in a Meccan

chapter. Another feature that was considered is the J-O feature that was implemented on the holy

Quran data set, in which a verse that was classified to be interesting should be contained in the 30th

section of the holy Quran starting from chapter number 78 to 114. Since the testing sets and the

training set were available for both data sets, the use of 10-fold cross validation provided by WEKA

was an option to evaluate the some results. One of the options to evaluate the results of the

classification is to use decision trees, and another option will be Naïve Bayes classifiers.

16

Chapter 4

Design and Implementation

4.1 Classification Overview

The classification is based on a designed Java program to retrieve the arff file that is used in

the implementation to classify Arabic text files that were collected from different sources, into

interesting and non- interesting. Although the designed systems were implemented on the English

version of the holy Quran in the beginning, it was finally implemented on the Arabic version along

with other examples that may include the hereafter concept. The holy Quran data set was provided by

the university including the subset that contains the verses that holds the hereafter subject. As an

additional work, the hadith data set. This data set was created manually and was selected because it

was accessible, in Arabic and holds the hereafter concept that was selected to define the interesting

text. Since the interesting text must be on the hereafter concept a number of features were defined in

the Java program to retrieve from a verse. However, when the classifier was designed and

implemented on the English version of the holy Quran, the skewed data problem appeared. The reason

of this problem was the large number of non-interesting data, that was about 6000 verses, which was

labelled No, and the small number of interesting data of about 100 verses that was labelled Yes. As a

result under sampling the data set was an option to overcome the skewed data problem. In the

approaches section a description of the procedures that were carried out on the English text files will

be provided along with the results that revealed the problem. It will also include the procedures that

were carried out on the Arabic text files. Such as a descriptions of the data set that was obtained or

created, description of designed java programs, procedures to overcome the skewed data problem,

results of implementation in WEKA.

4.2 Features

The main features that classify the verses are a list of words that relates to the topic a user is

looking for. Since this classification is on the hereafter concept, Qurany Explorer defined the hereafter

as a sub-subject that comes under the faith category. This website is a corpus, which includes the

ontology which contains the subjects that are mentioned in the holy Quran, and was built by Noorhan

Abbas, a PhD student at the University of Leeds. The Qurany Explorer hierarchy of concept was

obtained from 'Mushaf Al Tajweed‟ which includes an ontology of almost 1200 subjects that were

mentioned in the holy Quran. Qurany Explorer was used to define the features. According to [21], the

17

words which are selected could most probably be on the belief in the hereafter, the proofs of hereafter,

signs which precedes it, names of the hereafter, the resurrection, the reward, and many other sub-

concepts defined in the ontology. The words that have been selected for this project are from the

names of the hereafter sub-subject. The list of features is:

The Hereafter “اآلخرة”.

Day of Resurrection “الحاقت”.

The Hour of Resurrection “الساعت”.

Rise from the Dead “البعث”.

The Judgment Day “الدين”.

Day of Resurrection “القيامت”.

The additional feature that was considered was the Meccan and Madinan features that would

define the verses. According to [2], Meccan chapters give more highlighting the end of day‟s topics.

This feature was added to improve the classification on the holy Quran data set. Since this

classification was implemented in the University on chapters‟ level it was important to include a

method in the designed Java code to restructure of the meccan-madinan.arff file. This file has three

parts. First, the directive part @relation and was called Mecca-Madinan. Second was the @attribute

part which includes the list of features that defines the classes Meccan (K), Madinan (D). Finally the

@data part which includes the data set. In this part every line corresponds to a chapter in the holy

Quran which results in 114 lines. The end of every line is the class label. This part was used in the

Java code that was designed for the Arabic version of the holy Quran, to add the Meccan and Madinan

features to my .arff file. Another feature that was added to help to improve the results of the

classification task is checking if a verse of a chapter belongs to the 30th section of the holy Quran.

According to the islamiyyat website [22], the 30th section of the Quran talks about the hereafter. As a

result a feature was added to define all chapters that belong to this section that include the class

Judgement (J), and all the other chapters will include the feature Other (O). Chapter number 78 is the

first chapter in the 30th section and 114 is the last chapter in the section.

Moreover, another word was added to the word list to count its frequency of use in the text

classification in the hadith data set. This is the Day of Judgment, but with an alternative term in

Arabic; this term is “الحساب”.

4.3 Approaches

The first step towards the implementation of the hereafter classification on the holy Quran

data set was on the English set. The data set was provided by Claire Brierley, a researcher at the

18

University of Leeds. The allQuran.txt file contains all 6236 verses of the holy Quran.

csubsetQuran.txt contains the all the verses that consider the hereafter concept, namely 113 verses.

The required .arff file to be used in WEKA was retrieved by designing two Java programs. The first

Java program of the design is cnoarff.java which delivers the complement set in a text file. This text

file was created by comparing the verses in the all Quran text file and the subset Quran text file. If the

verse did not appear in the subset, it was written to the complement set. The number of verses that

was eventually written to the complement text file is 6123. The second part of the first Java program

creates the .arff file for the complement set. This file is called noQuran.arff file which contains all

verses that do not talk about the hereafter subject. The second Java program of the design is

allQuran.java which delivers the allQuran.arff that contains the combination of the verses of all

Quran that are labelled Yes, or No. The difference between the two .arff files that result from the two

programs is that the second output will contain the header of an .arff file. The header will include the

relation, attributes and their type, and the class. The allQuran.arff file was used in WEKA to classify

the verses into interesting or non-interesting based on the hereafter subject. The step that followed the

design of the Java program, and retrieving information that was required for the .arff file was training

the classifier on the data and testing it. The allQuran.arff was opened in WEKA to process it. Using

the 10 cross validation to train and test the classifier divides the data set to 10 folds which can be

changed to any other number in the folds text field. The precision and recall shows the positive and

negative statistics of the classification. The confusion matrix shows the instances that are defined to a

class in the processed data and the instances that are classified by WEKA. The following formulae are

the equations which are used to get the results of the classification.

The methods that were selected to test the classifier accuracy are Decision Table, Naïve

Bayes and J48. In this confusion matrix we can see that WEKA was classifying the instances based on

the defined classes {Yes, No}. „a‟ is defined to be the Yes class; „b‟ is defined to be the No. The „a‟

column shows that 0 instances were classified in „a‟ (Yes) and „b‟ (No) by WEKA. The rows

represent the real number of the instances that have been defined to belong to „a‟ (Yes) or „b‟ (No).

19

The results of the classification were 0 instances classified correctly in any of the classes defined. The

results are explained in the following table:

allQuran.arff Decision Table Naïve Bayes J48

Correctly

Classified

Instances

6123 = 98.1879% 6120 = 98.1398% 6123 = 98.1879%

Incorrectly

Classified

Instances

113 = 1.8121% 116 = 1.8602% 113 = 1.8121%

Precision 0.964 0.964 0.964

Recall 0.982 0.981 0.982

Confusion Matrix

a b <-- classified as

0 113 | a = Yes

0 6123 | b = No


0 113 | a = Yes

3 6120| b = No


0 113 | a = Yes

0 6123 | b = No

Table 1: Results in WEKA using about 6000 lines of No and 113 of Yes.

When these results were analysed it was obvious that the data was skewed since the data that

was labelled No was larger than the subset that was labelled Yes. At this point new approaches were

defined in order to improve the results of the classification. There were different approaches to test the

classifier that would be useful to use later in broader researches, such as detecting suspicious text. For

example sub sampling the large data, which was defined in the complement set and was labelled No,

to a random subset of different size. In this section there is a description of the procedures that were

made to design and implement the classification on two different data sets that hold the hereafter

concept using the same explained features. The first data set to implement the classification on the

Arabic text of the holy Quran and will be explained in the following section. The second set is on the

Arabic text of hadith and the approaches will be explained in 4.3.2 section.

4.3.1 The holy Quran data set

Description

The first implementation of the classification was on the verses of the holy Quran and was

based on the frequency of defined features as has been mentioned previously. The implementation of

the solution was accomplished by firstly defining the features that the classification is based on. These

features can be a list of words which will be extracted to retrieve their frequencies in an .arff file. The

list of features that were defined will be extracted from the Arabic set allQuranArabic.txt,

RandomSubsetArabic.txt, subsetQuranArabic.txt and meccan-madinan.arff. The allQuranArabic.txt

20

file contains all the verses of the holy Quran namely 6 632 verses. subsetQuranArabic.txt contains the

all the verses that consider the hereafter concept, namely 113 verses, and was defined by Claire

Brierley a researcher at the University of Leeds. RandomSubsetArabic.txt contains 400 verses

selected randomly from the complement set to begin with, and later was changed to 100 verses. These

files contain the verses of the Quran in a certain format. This format is shown in Figure 2. The first

number is the chapter number, the second number is the verse number and the string is the verse.

In addition to this approach to the implementation of the classification, combining an existing

classification could be used to increase the accuracy of the classification. This can be carried out by

using the Meccan and Madinan classification that was implemented at the University of Leeds. To

accomplish this, the Java code that was designed should add another feature to the selected verses that

will produce the final .arff file. This feature will be added after retrieving the frequencies of the

features and will describe the verses if they were Meccan or Madinan by reading the first number in

the line which refers to the chapter number. The reason for using the chapter number to describe the

verse is that the classification of the holy Quran into Meccan or Madinan was on the chapter level.

Secondly, use the frequencies and the Meccan – Madinan description that was obtained from the

designed systems in the implementation using WEKA. To begin with the implementation of this

classification, first classify based on the frequencies and get the interesting set then add the Meccan -

Madinan feature. This step of the system includes training the system to classify based on some

predefined set of text. Finally, the percentage of the correctly classified verses will be presented.

Design

A Java program was designed in order to obtain an .arff file that is used later in the

implementation of the classifier for the first approach. As has been explained previously, the first step

was to design two Java codes that used the English corpus of the holy Quran. However, when it was

time to use the Arabic corpus a new design of Java program that will work as a full system was

preferred. The new Java program is called ArabicQuran.java. The first method that is defined in the

program is the main method in which the method processFile is called up. In the method processFile

input and output files are declared by firstly defining allQuranArabic.txt and subsetQuranArabic.txt as

inputs and it uses readFile method to read them. The output files are the ComplementArabic.txt,

RandomSubsetArabic.tx and QuranArabic.arff files. These files are created using BufferWrites for

each output.

1|1| String

Figure 2 Arabic Quran dataset Format

21

The first part of the Java program obtains the complement set. This is accomplished by firstly

reading the input files and storing them in an ArrayList. Two ArrayList files will be created and each

will contain strings from ComplementArabic.txt and andomSubsetArabic.txt. The complement set is

iterated to an array list in a design method called GetComplement. The method ReadFile reads the

ArrayList files that were created using FileInputStream and InputStreamReader that read the files as

UTF8 encoding. There was a problem while running the program: this was that, after the conversion

to the UTF8, the array list would store a digit in the beginning of every text file created and then it

would not match the chapter number and the verses number that were stored in the array list. To

overcome this problem a loop was created to ignore the character if it already existed and go to the

next character in the array list which is the chapter number. The method GetComplement is of the

type ArrayList that precedes the allQuran Arabic, subset Arabic both as ArrayList, and a Writer as

output. It loops over the input text files and compares the ID of the verses using the GetId method,

which will be explained later. If the verse line was in the allQuran text file but not in the subset text

file, then it will be written to the complement array list. GetId method is used to obtain the first

numbers of each line separated by “|” example “##|##”.

The second part of the Java program obtains the random subset which requires the

complement set. The random subset is iterated using a method called CreateRandomSubset. The

CreateRandomSubset method takes the complement set, an integer which is previously defined to be

the limit of line to get in the Random Subset, which is the value of 400 verses that was changed to 100

verses, and writes an output. The verses that are selected are inserted in order of their position in

allQuarnArabic.txt. This method uses the Randomvalue method which is the declaration of the size of

the random set. The method is implemented using mathematical functions to declare the size of

random set which must be of 400 lines from the complement set.

This program is mainly about creating the .arff file that will be used later in WEKA to

classify the verses into interesting and non-interesting. To accomplish this, the header of the .arff file

is written to the .arff using the writeFile method. The writeFile method takes in a string that should

be written and to which each file will be declared by the name of the file. In addition, two other

methods are created to help in creating the .arff file. First, ProcessLine which is a method created to

get the attributes‟ frequencies by tokenizing the lines using StringTokenizer and including a regular

expression matching functions to include the counting of the frequency of any attribute, even if it

included a prefix or a suffix. Second, the GetEndLine method is used to write Yes and No end the end

of every line of the arff file. This is done by checking the line; if it occurs in the random subset then it

writes No, if it occurs in the subset then it writes Yes. This is a simple diagram that helps in

understanding the process of the Java program. The output line looks like this: "1|4",0,0,0,0,1,0,NO.

22

Additional Method

Additional methods were added later to improve the classification. These methods can be

removed easily from by omitting some lines and the output will change. Two new methods were

added later to this Java code, to add the Meccan-Madinan feature. The first method is TrimMeccan,

this is used to remove the header part from the meccan-madinan.arff file. It loops through the lines

until the first character of the line is 0, then stops and returns the new .arff file with the lines. The .arff

file is structured to include a header, that was removed, and the data part that includes 114 lines that

refers to the chapter number. The second method that was added to the Java program uses the

trimmed meccan-madinan.arff file to retrieve the feature. The method is called GetLetter. The letter

K was assigned to refer to the Meccan chapters and D to refer to the Madinan chapters. The method

will read the first number of the selected lines for the QuranArabic.arff file and look at the line that is

in that position in the meccan-amadinan.arff file and get the feature from it. The output lines

following @data in the .arff file will look like the following:

"1|4",0,0,0,0,1,0,K,NO

In addition, ProcessLineSingleCount was added to combine the frequencies into one count for the list

of word features that is defined. The output lines following @data in the .arff file will look like the

following:

"2|16",0,D,NO

Figure 3: Diagram of the sequence of the methods that were created in the Java program

23

In order to add the J-O feature GetSecondLetter was added. This method will read the first

number of the line and depending on it would add the letter J to refer to the verses that belong to a

chapter in the 30th section of Quran, and the letter O to refer to all the verses that do not. The output

lines following the @data in the .arff file will look like the following:

"2|25",0,D,O,NO

Implementation

The results of the implementation of the solution were done in WEKA using different data

sets sizes and features. It was accomplished using different classifiers in order to train and test the

classifier. As has been mentioned previously, the three classification methods that were selected to

analyse and evaluate the classifier are Decision Table, Naïve Bayes and J48. The first implementation

was on a QuranArabic.arff in three different versions of random sets. 10 cross validation was used to

train and test the data on a number of .arff files. The arff files consists of the frequencies of the

features that were defined in 400 lines of verses that are labelled No and 113 lines of verses of the

data that are labelled Yes and later was changed to 100 lines of verses that are labelled No and 113

lines of verses of the data that are labelled Yes. In addition, the Meccan and Madinan feature was

included and the classifier was tested based on the word frequencies and this feature. Similarly,

defining arff files in random 400 lines of verses that are labelled No and 113 lines of verses of the

data that are labelled Yes. Then changed to 100 lines of verses that are labelled No and 113 lines of

verses of the data that are labelled Yes. The final results of WEKA classifiers on this data set

consisted of 100 lines of verses labelled No and 113 lines of verses labelled Yes that includes the

Meccan-Madinan feature and with the frequency count combined into one and the Judgement-Other

feature added. The results of processing the arff files using the Java programs described previously

will be provided in the following chapter.

4.3.2 The hadith data set

Description

The second implementation of the classification was on hadith and was based on the

frequency of defined features as mentioned previously. The hadith data set was created manually. The

step of the data sets design was based on the al-eman website [23]. The hadith data set is created from

Sahih Muslim who was mentioned in the classification overview. Using the web based search tool in

this website, the subset data was created. According to Abdul Hamid Siddiqui [24], a saying and deed

of the prophet Muhammad (May peace be upon him) is called hadith. These sayings are the second

source after the holy Quran that Muslims refer to in the subjects of life and Islamic laws. Imam

Muslim is an important and famous scholar in Islam. Sahih Muslim is a book that includes the

collection of hadith made by this scholar. The book, in most English translations, is divided to 43

sections containing all hadith. However it is originally had 57 sections that include up to 7500 sayings

24

in the Arabic version of Sahih Muslim. The data set that was designed to implement the classification

for this project used the Arabic version. The book includes all sayings that were transmitted by

different chain of narrators. As a result if the saying had another chain of narrators it was counted as a

different saying. These repeated sayings were removed from the designed data set since the important

part of the hadith was the saying itself and not the narrators. In addition, the subset was created using

the Ibn Kathir Tafsir book that was provided online by the Quran complex website [25]. The hadiths

were gathered based on this electronic version of the book by firstly going through the verses that

were for the Quran data set and looking for the corresponding sayings of the prophet that were

available in Sahih Muslim. Secondly, el-eman website that was previously mentioned was used in

order to keep track of the reference number of the gathered saying from Ibn Kathir and retrieve the

complement set. The last Arabic version of the hadith data set used in the classification is

SahihMuslim.txt which includes 7748 hadith and subsetHadithArabic.txt which in turn include 86

hadith that hold the hereafter concept.

The implementation of the solution was accomplished by firstly creating the data sets that

were required for training and test and this was explained previously. Secondly, the features that the

classification is based on were defined. These features can be list of words which will be extracted to

retrieve their frequencies in an .arff file. The list of features that were defined will be extracted from

the Arabic set AllHadithArabic.txt, RandomSubsetArabicHadith.txt and subsetHadithArabic.txt. The

AllHadithArabic.txt file contains all the quotes that were attributed to the prophet Muhammad (May

peace be upon him), which are about 7500. subsetHadithArabic.txt contains the all the sayings (86)

that consider the hereafter concept. RandomSubsetArabicHadith.txt contains 400 sayings that were

selected randomly from the complement set to begin with and later was changed to 100 sayings.

These files contain the sayings in the format that is shown in Figure 4. The number represents the

hadith number in Sahih Muslim and the string is the hadith. To begin the implementation of this

classification, the system had to be trained and tested, which was done by 10 cross validation as an

approach to implementation. Finally, the percentage of the correctly classified hadith will be

presented.

Design

A Java program was designed in order to obtain an .arff file that is used later in the

implementation of the classifier to test on different data sets that would hold the hereafter concept. In

order to accomplish this, the first step is to design the data set. The second step is designing the Java

1|“String”

Figure 4: Arabic Hadith dataset Format

25

program that would help in retrieving the information required for the implementation. Two new Java

programs were designed; the first is SetupData.java and the second is Arabic Hadith.java. A

description of these programs will be provided in this section.

SetupData.java description

The first method that is defined in the program is the main method in which the method

processFile is called up. In the method processFile input and output files are declared by firstly

defining SahihMuslim.txt and subsetHadith.txt as inputs and it uses the readFile method to read them.

The output files are the AllHadithArabic.txt and SubsetHadithArabic.txt files. This files are created

using BufferWrites for each output. Firstly the Java program reads the input files and stores them in

an ArrayList. Two ArrayList files will be created, each will contain strings from muslim that hold

AllHadithArabic.txt, and a subset that holds the SubsetHadithArabic.txt.

The method ReadFile reads the ArrayList files that were created using FileInputStream and

InputStreamReader that reads the files as UTF8 encoding. Similarly, a problem with storing the lines

that are in UTF8 encoding was present and addressed. The method that follows ReadFile is setupFile.

This method splits the lines of hadith after the first bar that appears in the line and saves the strings

that are between quotation marks. As it has been explained previously the sayings of the prophet are

the strings in between quotation marks. The program would store the string that holds the odd position

after splitting as explained in Figure 5. The final method in this program is writeFile that prints the

line to the files.

ArabicHadith.java description

The second Java program is ArabicHadith.java and is the main method in which the method

processFile is called up. In the method processFile, input and output files are declared by firstly

defining allQuranArabic.txt and subsetQuranArabic.txt as inputs and it uses readFile method to read

them. The output files are the ComplementArabicHadith.txt, RandomSubsetArabicHadith.txt and

HadithArabic.arff files. This Java program is designed the same as ArabicQuran.java. As a result the

first part of the Java program obtains the complement set. This is accomplished using GetComplemet,

ReadFile, and GetId methods. The second part of the Java program obtains the random subset which

requires the complement set. The random subset is likewise created using methods called

CreateRandomSubset and Randomvalue.

This program is mainly about creating the .arff file that will be used later in WEKA to

classify the verses into interesting and non-interesting. To accomplish this, the header of the .arff file

is written to the .arff which in this method is using the writeFile method. The ProcessLine method is

created to get the attributes‟ frequencies by tokenizing the lines using StringTokenizer and including a

26

regular expression matching function to include the counting of the frequency of any attribute, even if

it included a prefix or a suffix in it. In addition, the GetEndLine method is used to write Yes and No

at the end of every line of the arff file. This is done by checking the line; if it occurs in the random

subset them it writes No, if it occurs in the subset then it writes Yes. This is a simple diagram that

helps in understanding the process of the Java program. The output line looks look this:

"108",0,0,0,0,1,0,YES

Additional methods were added later to improve the classification. These methods can be

removed easily by omitting some lines and the output will change. ProcessLineSingleCount was

added to this Java code to combine the frequencies into one count for the list of word features that is

defined. The output lines following @data in the .arff file will look like the following:

"108",1,YES

Implementation

The implementation of the classification task on the hadith dataset was done in WEKA using

different data set sizes and features. As has been mentioned previously, the three classification

methods that were selected to analyse and evaluate the classifier are Decision Table, Naïve Bayes and

J48. The first implementation was on a HadithArabic.arff that consists of the frequencies of the

features that were defined in 200 lines of verses that are labelled No and 86 lines of verses of the data

that are labelled Yes, later changed to 100 lines of verses that are labelled No and 86 lines of verses of

the data that are labelled Yes. The results of processing the arff files using the Java programs

described previously will be provided in the following chapter.

27

Chapter 5

Results and Evaluation

5.1 WEKA Results

5.1.1 The holy Quran data set results

The results of the implementation are shown in the tables (2 and 3) below. Table 2 shows the

results of WEKA classifiers for 100 lines of No and 113 of Yes using include {K, D} feature and one

frequency count. Three different versions were used for training and testing which are numbered 10,

11, and 12. Table 3 shows the results of WEKA classifiers for 100-113 lines that include the Meccan-

Madinan feature and with the frequency count combined into one and the Judgement-Other feature

added. Similarly, three different versions were used for training and testing which are numbered 13,

14, and 15. However the first implementation on this data set was on the dataset which was resized to

400 lines of the data labelled No and 114 labelled Yes using the frequency count feature with three

different versions of the .arff files. The results of the classification for first example

QuranArabic1.arff in WEKA, using decision table, Naïve Bayes and J48 and the best results are as

follows: correctly classified instances are 402, which is equal to 78.3626%; that was scored using

Naïve Bayes classifier. Precision is 0.743 and Recall is 0.784. The confusion matrix was the

following:

5 395 | b = NO

The second example is QuranArabic2.arff, using decision table, Naïve Bayes and J48 and the best

results are as follows: correctly classified instances are 405, which is equal to 78.9474%; that was

scored using Decision Table and Naïve Bayes classifier. Precision is 0.787 and Recall is 0.789. The

confusion matrix was the following:


7 106 | a = YES

28

The final example is QuranArabic3.arff, using decision table, Naïve Bayes and J48 and the best

results are as follows: correctly classified instances are 403, which is equal to 78.5575% that was

scored using Naïve Bayes classifier too. Precision is 0.755 and Recall is 0.786. The confusion matrix

was the following:

All the three version of arff files scored the same, lowest results using J48. The correctly

classified instances were 400 = 77.9727 %. Precision is 0.608 and Recall is 0.78.

The results of the classification in WEKA when the dataset was resized to 100 lines of the

data labelled No and 113 labelled Yes, using the frequency count with three different versions of the

.arff files was exactly the same. Best results were by the Decision Table and J48 classifiers. The

correctly classified instances are 113, which is equal to 53.0516% and the incorrectly classified

instances are 100, which is equal to 46.9484%. Precision is 0.281 and Recall is 0.531. The confusion

matrix was the following:

Clearly this was not reliable results, and resizing didn‟t improve the classification results

using frequency count. In order to improve the results, the data sets were resized to 400 of the data

labelled No and 113 of the data labelled Yes. However the Meccan – Madinan feature was added to

this version of arff and the frequency count was changed to one count frequency for all words that

were defined previously. The first example is QuranArabic7.arff, using decision table, Naïve Bayes

and J48 and the best results are as follows: correctly classified instances are 400, which is equal

to77.9727%; that was scored using Decision Table and J48 classifier. Precision is 0.608 and Recall is

078.. The confusion matrix was the following:


7 106 | a = YES

2 398 | b = NO


7 106 | a = YES

4 396 | b = NO


113 0 | a = YES

100 0 | b = NO

29

But Naive Bayes classifier produced better results in classifying some True positive instances. The

confusion matrix that resulted by Naïve Bayes using this version of arff file was as follow:

The second example is QuranArabic8.arff, using decision table, Naïve Bayes and J48 and the best

results are as follows: correctly classified instances are 397, which is equal to 77.3879%; that was

scored using Naïve Bayes classifier. However the other classifiers were less by one instance

difference. Precision is 0.719 and Recall is 0.774. The confusion matrix was the following:

The final example is QuranArabic9.arff, using decision table, Naïve Bayes and J48 and the best

results are as follows: correctly classified instances are 400, which is equal to 77.9727% that was

scored using Decision Table and J48 classifiers. Precision is 0.608 and Recall is 0.77. The confusion


But Naive Bayes classifier produced better results in classifying some True positive instances. The

confusion matrix that resulted by Naïve Bayes using this version of arff file was as follow:

15 385 | b = NO


0 113 | a = YES

0 400 | b = NO


11 102 | a = YES

14 386 | b = NO


14 99 | a = YES

17 383 | b = NO


0 113 | a = YES

0 400 | b = NO


10 103 | a = YES

30

Table 2: Results in WEKA using 100 lines of No and 113 of Yes (include {K, D} feature and one

frequency count).

QuranArabic10.arff Decision Table Naïve Bayes J48

Correctly Classified

Instances 126 = 59.1549% 126 = 59.1549% 126 = 59.1549%

Incorrectly

Classified Instances 87 = 40.8451% 87 = 40.8451% 87 = 40.8451%

Precision 0.609 0.618 0.609

Recall 0.592 0.592 0.592

Confusion Matrix


97 16 | a = YES

71 29 | b = NO


100 13 | a = YES

74 26 | b = NO


97 16 | a = YES

71 29 | b = NO

QuranArabic11.arff


Instances 132 = 61.9718% 130 = 61.0329% 132 = 61.9718%

Incorrectly


Precision 0.64 0.64 0.64

Recall 0.62 0.61 0.62

Confusion Matrix


97 16 | a = YES

65 35 | b = NO


100 13 | a = YES

70 30 | b = NO


97 16 | a = YES

65 35 | b = NO

QuranArabic12.arff


Instances 124 = 58.216% 124 = 58.216% 124 = 58.216%

Incorrectly


Precision 0.597 0.606 0.597

Recall 0.582 0.582 0.582

Confusion Matrix


97 16 | a = YES

73 27 | b = NO


100 13 | a = YES

76 24 | b = NO


97 16 | a = YES

73 27 | b = NO

31

Table 3: Results in WEKA using 100 lines of No and 113 of Yes (include {K, D} feature, {J, O}

feature and one frequency count).

QuranArabic13.arff Decision Table Naïve Bayes J48


Instances 122 = 57.277% 115 = 53.9906% 123 = 57.7465%

Incorrectly


Precision 0.583 0.534 0.592

Recall 0.573 0.54 0.577

Confusion Matrix


96 17 | a = YES

74 26 | b = NO


83 30 | a = YES

68 32 | b = NO


97 16 | a = YES

74 26 | b = NO

QuranArabic14.arff


Instances 128 = 60.0939% 125 = 58.6854% 125 = 58.6854%

Incorrectly


Precision 0.609 0.59 0.593

Recall 0.601 0.587 0.587

Confusion Matrix


92 21 | a = YES

64 36 | b = NO


89 24 | a = YES

64 36 | b = NO


91 22 | a = YES

66 34 | b = NO

QuranArabic15.arff


Instances 111 = 52.1127% 116 = 54.4601% 108 = 50.7042%

Incorrectly


Precision 0.506 0.541 0.484

Recall 0.5021 0.545 0.507

Confusion Matrix


91 22 | a = YES

80 20 | b = NO


92 21 | a = YES

76 24 | b = NO


90 23 | a = YES

82 18 | b = NO

32

5.1.2 The hadith data set results

This section will provide the results of processing the data sets using different classifiers in

WEKA. The results of the implementation are shown in the tables below. Table 7 shows the results of

WEKA classifiers to 200-86 lines. Three different versions of HadithArabic.arff files were used for

training and testing which are numbered 1, 2 and 3. Similarly, Table 8 shows the results of WEKA

classifiers for 100-86 lines with three different versions were used for classifier training and testing

which are numbered 4, 5, and 6 in the table. However when the word frequencies were combined the

results did not change much. The results of the classification when the dataset was resized to 200 lines

of the data labelled No and 86 labelled Yes using one count of the frequency with three different

versions of the .arff files was different to every data set used, although the classifiers were different.

The first example is HadithArabic7.arff, using decision table, Naïve Bayes and J48 and the results are

as follows: correctly classified instances are 236, which is equal to 83.6879%, incorrectly classified

instances are 46, which is equal to 16.3121%. Precision is 0.838 and Recall is 0.837. The confusion


The second example is HadithArabic8.arff, using decision table, Naïve Bayes and J48 and the results

are as follows: correctly classified instances are 228, which is equal to 80.8511%, incorrectly

classified instances are 54, which is equal to 19.1489%. Precision is 0.801 and Recall is 0.809. The


The final example is HadithArabic9.arff, using decision table, Naïve Bayes and J48 and the results

are as follows: correctly classified instances are 230, which is equal to 81.5603%, incorrectly

classified instances are 52, which is equal to 18.4397%. Precision is 0.809 and Recall is 0.816. The



44 38 | a = YES

8 192 | b = NO


44 38 | a = YES

16 184 | b = NO

33

The results of the classification in WEKA when the dataset was resized to 100 lines of the data

labelled No and 86 labelled Yes, using the one count of the frequency with three different versions of

the .arff files was exactly the same. The correctly classified instances are 139, which is equal to

76.3736% and the incorrectly classified instances are 43, which is equal to 23.6264%. Precision is

0.797 and Recall is 0.764. The confusion matrix was the following:


44 38 | a = YES

14 186 | b = NO


44 38 | a = YES

5 95 | b = NO

34

Table 4: Results in WEKA using 200 lines of No and 86 of Yes.

HadithArabic1.arff Decision Table Naïve Bayes J48


Instances 226 = 80.1418% 227 = 80.4965% 219 = 77.6596%

Incorrectly

Classified Instances 56 = 19.8582% 55 = 19.5035 63 = 22.3404%

Precision 0.795 0.798 0.768

Recall 0.801 0.805 0.777

Confusion Matrix


39 43 | a = YES

13 187 | b = NO


41 41 | a = YES

14 186 | b = NO


31 51 | a = YES

12 188 | b = NO

HadithArabic2.arff


Instances 231 = 81.9149% 232 = 82.2695% 225 = 79.7872%

Incorrectly


Precision 0.821 0.824 0.805

Recall 0.819 0.823 0.798

Confusion Matrix


39 43 | a = YES

8 192 | b = NO


40 42 | a = YES

8 192 | b = NO


31 51 | a = YES

6 194 | b = NO

HadithArabic3.arff


Instances 224 = 79.4326% 209 = 74.1135% 216 = 76.5957%

Incorrectly


Precision 0.785 0.735 0.753

Recall 0.794 0.741 0.766

Confusion Matrix


39 43 | a = YES

15 185 | b = NO


15 67 | a = YES

6 194 | b = NO


30 52 | a = YES

14 186 | b = NO

35

Table 5: Results in WEKA using 100 lines of No and 86 of Yes.

HadithArabic4.arff Decision Table Naïve Bayes J48


Instances 134 = 73.6264% 138 = 75.8242% 135 = 74.1758%

Incorrectly


Precision 0.792 0.806 0.804

Recall 0.736 0.758 0.742

Confusion Matrix


37 45 | a = YES

3 97 | b = NO


41 41 | a = YES

3 97 | b = NO


37 45 | a = YES

2 98 | b = NO

HadithArabic5.arff


Instances 127 = 69.7802% 134 = 73.6264% 133 = 73. 0769%

Incorrectly


Precision 0.75 0.772 0.767

Recall 0.698 0.736 0.731

Confusion Matrix


32 50 | a = YES

5 95 | b = NO


40 42 | a = YES

6 94 | b = NO


39 43 | a = YES

6 94 | b = NO

HadithArabic6.arff


Instances 134 = 73.6264% 134 = 73.6264% 130 = 71.4286%

Incorrectly


Precision 0.778 0.772 0.762

Recall 0.736 0.736 0.714

Confusion Matrix


39 43 | a = YES

5 95 | b = NO


40 42 | a = YES

6 94 | b = NO


35 47 | a = YES

5 95 | b = NO

36

5.2 Evaluation of the Model

As it has been mentioned previously, the procedures to accomplish the text classification

problem were getting hold of a dependable data set and selecting features that would assist the

classifier. Applying a number of Java codes, the required data files that would be used in the

classification were retrieved. Using different types of classifiers that were provided in WEKA, some

results allowed the evaluation of the accuracy of the classifier and the defined features. Improvements

of the defined features were necessary. These improvements will be explained in the following

section.

5.2.1 Evaluation on data sets

Two different data sets were processed in this project in order to implement the classification

task. The first data set is the English version of the holy Quran data set that was provided by a

researcher at the University of Leeds. However, this project was to process Arabic text as a result the

data set was converted to Arabic. The full Quran Arabic data set was downloaded from the internet

and the subset was retrieved using the English subset that was provided.

Additionally, the hadith data set was created manually using the website that was mentioned

in the previous chapter. This design step was not expected and took time to create since there were no

recent natural language processing researches on hadith, other than the example that was given in the

background reading, and also the corpus that was used is not available. The data set was first gathered

from the website and then processed to a certain format to retrieve the .arff file. The subset was first

gathered from different sources and then formatted to use in the classification task. To perform the

classification task on this data set a Java program was created to help in improving the format of the

data.

Two other Java programs were designed to retrieve the .arff file that was used in WEKA to

implement the classification. Some issues appeared when the implementation was carried on the

English data set. The problem was skewed data that was a result of the large data set that was labelled

No that was compared to the small subset that was all labelled Yes. According to Andrew McKinley

[26], “Subsampling the whole data set (i.e. both test and train) so the classes are more balanced will

help a lot.” As a result the arff files that were used in the classification of the Arabic version of the

holy Quran and the Hadith were resized.

5.2.2 Evaluation of feature

The features selection was an important procedure in this project to help in designing the

solution and implementing the suitable model of the classification. The features that were selected

were based on my pre-knowledge of holy Quran and with the help of Quran explorer. In the beginning

the feature was defined to be a highly frequent list of words that are in the subset Quran data set. Key

37

words for the hereafter and Day of Judgment were defined to retrieve their frequencies from the holy

Quran data set. Since this subject was categorized to many sections the selected category was names

of the Day of Judgment. As it has been mentioned in the previous chapter the features that were

selected are The Hereafter, Day of Resurrection, Rise from the Dead and The Judgment Day.

However there were more terms that were not used, and there are a number of terms in the English

version that could be mapped to different terms in the Arabic version. For example Resurrection can

be mapped to at least three different terms in Arabic. This was one of the things that differed in the

classification result, when the number of word lists used is less than in the English version. Many

other words could be added to the word list to improve the classification results, such words that

record the hereafter as heaven and hell, and the signs which lead to the hereafter. This list of words

could improve the classification results because they are semantically related to the Day of Judgment.

In relation to the results of the project implementation, each of the defined words that were used to

retrieve their frequencies stands for one word which was later combined to be a single frequency

count. In addition to this improvement on the holy Quran, the Meccan and Madinan feature was added

along with the combined frequencies and the results were improved. Moreover, adding the J-O feature

was another option to test the classifier results using this additional feature.

The results of the feature on the hadith data set were different from on the Quran data set.

Although the features were the same, except for the additional word that was added to the frequency

count feature, the results improved. These results will be explained in the following section. An

analysis will be provided to describe and discuss the results tables that were given in the previous

chapter.

5.3 Analysis

In the beginning the classification model was implemented on the English data set and from

the results of WEKA classifiers that is represented in the table 1. It was obvious that the data was

skewed. The next step was to resize the data set that was used in the classification using the Arabic

data set. According to Bilal M. Zahran and Ghassan Kanaan [27], there are some difficulties in

classifying Arabic text. For example, Arabic language has different syntaxes and semantics from other

languages. Comparing English to Arabic, Arabic letters have many forms in writing and the

punctuation associated with it changes the meaning. Moreover, comparing Arabic roots to English

roots is complex and so natural language processing for Arabic differs from that for English. In

relation to the implementation section in the previous chapter, different approaches were used in order

to test and perform the classification task. Using the Arabic data set the procedure of resizing the data

was an additional option when looking for different classification results.

In relation to QuranArabic.arff versions number (1, 2, and 3); the J48 classifier did not

perform any useful results because changing the version of the dataset that was used did not have any

38

effect on the results. J48 did not produce any graph that showed the Yes-No class. The Decision table

did not show much difference in the classification results. However the highest results using this

classifier were 405 correctly classified instances, that is equal to 78.9474%. This number of instances

includes 7 of the True positive and 398 of the True negative instances. Naïve Bayes‟ results scored the

highest percentage of 405 correctly classified instances, which is equal to 78.9474%. The lowest

percentage of 402 correctly classified instances, that is equal to 78.3626%, is still greater than that

scored by the other classifiers. Using the frequency count and keeping in mind the size of the data, it

seems that the Naïve Bayes classifier is better in the text classification task, since J48 did not change

and the Decision table did not score any better than Naïve Bayes.

In relation to QuranArabic.arff versions number (4, 5, and 6); J48 did not improve its results

although different sets were used to test the classifier. In this approach too, J48 did not produce any

graph that shows the Yes-No class.. Similarly, the Decision Table had the same results as J48. Yet

these classifiers scored the highest percentage, which is 113 correctly classified instances, which is

equal to 53.051%. This number of instances refers to all the 113 instances of the True positive and 0

of the True negative instances. On the other hand the Naïve Bayes classifier scored the lowest

percentage which is 111 instances belongings to the True positive instances. This score did not change

although different data sets were used in testing. Using the frequency count and resizing the data set

did not help in improving the results of the classification. All of the classifiers were able to detect the

set that was interesting and could classify any of the non-interesting text. In order to improve these

results, the Meccan-Madinan feature was added along with combining the frequencies to a single

count.

In relation to QuranArabic.arff versions number (7, 8, and 9); J48 improved its result by

differentiating each time a new set was used. When the Meccan-Madinan feature was added a graph

was provided to classify the verses of Quran. However, going back to the table we see that the

correctly classified instances are 400, which is equal to 77.9727%. These instances are all True

negative instances and there were no True positive instances. In addition, the Decision table shows

similar results to J48 classifier. This may be due to selecting a random set of the verses that may

include mostly Meccan verses, but labelled No. The Naïve Bayes classifier scored the lowest result in

comparison to the other two that are mentioned. However this classifier was able to detect some of the

True positive instances although it was insignificant number which is 14 instances from 113.

In relation to table 2, 2 versions of the arff files showed exactly the same results in all

classifiers. The third version of the arff file presented a smaller number of correctly classified

instances, yet better than the others. The highest score by J48 and the Decision table is 132 correctly

classified instances, which is equal to 61.9718%. Naïve Bayes scored 130 correctly classified

instances, which is equal to 61.0329%. The 130 instances include 100 of the True positive instances

39

and 30 of the True negative instances. Additionally, J-O feature was added along with Meccan-

Madinan feature and combining the frequencies to a single count.

In relation to table 3, the J48 classifier gave one of the best results, which are 125 instances

that are equal to 58.6854%. 91 instances were correctly classifier as True positive and 34 True

negative. The Decision table classifier in this data set gave the highest score that was equal to 128

instances, which is equal to 60.0939%. This includes 92 instances of the True positive and 36 of the

True negative instances. The Naïve Bayes classifier scored exactly the same as J48 in percentages.

However when we look at the confusion matrix the True positive instances using this classifier were

89 and 36 were True negative instances.

By comparing results of the classification using the sets of the size 100-113, the classification

task improved enormously from 53.0516% to 61.9718% by adding the Meccan-Madinan feature.

However by adding the J-O feature the percentage dropped to 60.0939%. This percentage does not

represent big difference. However it is a decrease due to the number of verses in the subset that

belongs to the 30th section of the holy Quran, namely only 9 verses from the 113.

The implementation of the classification on the hadith data set presented better and different

results than those from the Holy Quran data set. In relation to table 7, J48 provided different results

and graphs for every version of the arff files used in this implementation. This classifier scored one of

the highest results taking into account that the only feature that was used is the frequency count. It

scored 225 of correctly classified instances, which is equal to 79.7872%. However the confusion

matrix shows that most of the instances are True negative and only 31 of 86 instances are True

positive. The Decision Table classifier scored 231 of the correctly classified instances, which is equal

to 81.9149% which is even higher than J48. This classifier was able to classify 39 True positive

instances. Naive Bayes scored up to 232 instances, which is equal to 82.2695% and includes 40 of the

True positive instances and 192 of the True negative instances.

However in relation to the 4th table, when the data set was resized the percentage of the correctly

classified instances decreased. J48 classifier scored 135 = 74.1758% that includes 37 of the True

positive and 98 of the True negative. However another version scored 39 of the True positive

Figure 5: Visualization of J48 classifier for QuranArabic10.arff Figure 6: Visualization of J48 classifier for QuranArabic15.arff

40

Figure 7: Visualization of J48 classifier for HadithArabic7.arff

instances using the same classifier. Decision Table classifier scored 134 = 73.6264%. Similar to J48 it

scored 37 instances that are True positive and 97 instances of the True negative. Naïve Bayes

classifier scored the highest results compared to other classifiers that have been used. It got up to 138

correctly classified instances that are 78.8242% that includes 41 instances of True positive and 97 of

the True negative instances. Moreover in relation to the 5th table, after combining the frequency count

to one single count the results improved to 83.6879% although the word list did not change. This

percentage includes 44 instances of the True positive and 192 of the True negative. Other versions of

the arff file that implemented this approach did not go below 80.8511%, which is better than was

scored when each word frequency was counted separately. However this may be due to selecting

random sets that may include none of the words used which helped in the classification. Resizing the

data set to 100-86 did not produce any changes that could be considered useful.

5.3 Future Work

This project used two data sets to implement the classification task and test possible features

that could be used to improve the classifier in text processing and classification. There are a number

of possible indications for future work that can be implemented in this task. One example that can be

implemented on the holy Quran data set is oversampling the subset that was used and increases the

number of verses in the text file that relates to the hereafter concept. In relation to the hadith data set

oversampling is a possible idea. This can be implemented by including more hadith in the subset,

using other sources then the one used in this project. In addition sub-sampling is possible in this data

set. Additionally, a new data set could be used in terms of text classification on the hereafter subject,

since Day of Judgment is a pillar of faith in Islam. An example could include a book that talks about

the pillars of faith, which will include a section on the belief of Day of Judgment. This section could

be used as a subset in testing the classifier and the features.

In relation to the previous indications of ways to improve the data sets used in classification,

combining the holy Quran subset and the hadith subset is an option to oversampling. Combining to

the subset which is a main source in Islam could help in classifying any data set that is gathered from

any source. This implementation may also help in improving the results when a high standard subset

41

is provided using supervised learning algorithms to process text and implement the classification

problem.

All previous examples talk about different data sets that can be used on the hereafter

classifier. However, new subjects could be used in the process of text classification using different

features that may include different types other than counting word frequencies, for example, looking

at the length of the verse that holds a subject. In relation to text classification and the detecting

suspicious text, future implementations may include a combination of different types of data sets and

then test the classifier and the defined features on them. Given this indication and in relation to this

project, it may be implemented by combining the holy Quran data set and the hadith data set and

select some features that would help in the classification on both data sets.

42

Bibliography

1. Zhai, ChengXiang. Statistical Language Models. [Online] 2009. [Cited: 2 4 2011.] http://0-

www.morganclaypool.com.wam.leeds.ac.uk/doi/pdf/10.2200/S00158ED1V01Y200811HLT001.

2. Atwell, Eric and Sharaf, Abdul Baquee. Is Machine Learning useful for Qu'anic Studies? School

of Computing, University of Leeds. http://www.comp.leeds.ac.uk/eric/sharaf10jqsDraft.pdf.

3. Jackson, Peter and Moulinier, Isabelle. Natural Language Processing for online applications:

Text Retrieval, Extraction and categorization. s.l. : John Benjamins Publishing Company, 2002.

4. Liddy, Elizabeth D. Text Mining. [Online] 2000. [Cited: 20 02 2010.]

http://onlinelibrary.wiley.com/doi/10.1002/bult.184/pdf. 1550-8366.

5. Jbara, Khitam. Knowledge Discovery in Al-Hadith Using Text Classification. 2010, Vol. 6, 11.

http://www.jofamericanscience.org/journals/am-sci/am0611/48_3679am0611_409_419.pdf.

6. Al-Kabi, G Kanaan, et al. Statistcal Classifier of the holy Quran Verses (Fatiha and Yaseen

Chapters). 2005, Vol. 5, 3, pp. 580-583. http://docsdrive.com/pdfs/ansinet/jas/2005/580-583.pdf.

7. S. Al-Harbi, A. Almuhareb, A. Al-Thubaity,M. S. Khorsheed, A. Al-Rajeh. Automatic Arabic

Text Classification. King Abdulaziz City for Science and Technology. Riyadh : s.n.

http://lexicometrica.univ-paris3.fr/jadt/jadt2008/pdf/harbi-almuhareb-thubaity-khorsheed-rajeh.pdf.

8. Zhu, X. and Goldberg, A.B. Introduction to Semi-Supervised Learning. s.l. : Morgan & Claypool,

2009.

9. Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski. Data mining: a knowledge discovery

approach. s.l. : Springer US, 2007.

10. The K-Means Clustering Machine Learning Algorithm. puremango. [Online] [Cited: 27 03 2011.]

http://www.puremango.co.uk/2010/01/k-means-clustering-machine-learning/.

11. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Introduction to

Information Retrieval. s.l. : Cambridge University Press, 2008. http://nlp.stanford.edu/IR-

book/pdf/17hier.pdf.

12. McCarthy, Diana. Word Sense Disambiguation: An Overview. 2009, Vol. 3, 2, pp. 537-558.

http://onlinelibrary.wiley.com/doi/10.1111/j.1749-818X.2009.00131.x/pdf.

13. Bramer, Max A. Principles of data mining Undergraduate topics in computer science. s.l. :

Springer, 2007.

14. Abney, Steven. Semisupervised Learning for Computational Linguistics. [ed.] David Madigan et

al. 2008, Vol. 34, 3, pp. 449--452. http://www.hlt.utdallas.edu/~vince/papers/abney.pdf.

15. Zhu, Xiaojin. Semi-Supervised Learning with Graphs. [Online] 2005. [Cited: 27 03 2001.]

http://www.lti.cs.cmu.edu/Research/Thesis/XiaojinZhu05.pdf.

16. Aphinyanaphongs, Yin and Aliferis, Constantin. Text Categorization Models for Identifying

Unproven Cancer Treatments on the Web. Vanderbilt University. 2007. http://www.dsl-

lab.org/Publications/Aphinyanaphongs_2007a.pdf.

43

17. Ian H. Witten, Eibe Frank. Data mining: practical machine learning tools and techniques. 2.

s.l. : Morgan Kaufmann, 2005.

18. Dimov, Rossen. Weka: Practical machine learning tools and techniques with Java

implementations. 2007. http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf.

19. Al-Farsi, May. Mayana2011. Bolg Post. [Online] 2011. http://mayana2011.blogspot.com/.

20. Schütze, Christopher D. Manning and Hinrich. Foundations of statistical natural language

processing. s.l. : MIT Press, 1999.

21. Abbas, Noorhan and Atwell, Eric. quranytopics.appspot. Qurany. [Online] University of Leeds.

[Cited: 15 03 2011.] http://quranytopics.appspot.com/.

22. alqran wa Aolomoh. islamiyyat. [Online] 04 01 2009. [Cited: 25 04 2011.]

http://www.islamiyyat.com/alqranwa3olomoh/2009-03-26-15-21-49/106----1.html.

23. Sahih Muslim. al-eman. [Online] [Cited: 4 04 2011.] http://www.al-eman.com/library/book/book-

view.htm?id=1#s1.

24. Siddiqui, Abdul Hamid. English - Sahih Muslim. quaran truth. [Online] [Cited: 4 04 2011.]

http://www.quarantruth.com/PDFS/Hadith/English%20-%20Sahih%20Muslim.pdf.

25. qurancomplex. [Online] [Cited: 18 04 2011.] http://www.qurancomplex.org/default.asp?l=eng.

26. McKinlay, Andrew . ML classifier for small minority class. s.l. : [email protected], 4th April

2011.

27. Bilal M. Zahran, Ghassan Kanaan. Text Feature Selection using Particle Swarm Optimization

Algorithm. 2009, Vol. World Applied Sciences Journal 7. www.idosi.org/wasj/wasj7(c&it)/10.pdf .

28. Bharati, Akshar, Chaitanya, Vineet and Sangal, Rajeev. Natural Language Processing: A

Paninian Perspective.

29. Dukes, Kais, et al. Online Visualization of Traditional Quranic Grammer using Dependency

Graphs. School of Computing, University of Leeds. http://www.comp.leeds.ac.uk/scsams/qcorpus-

fal2010.pdf.

30. Atwell, Eric and Sharaf, Abdul Baquee. Creating a Gold Standard Corpus for Related Texts.

School of Computing, University of Leeds.

http://www.comp.leeds.ac.uk/eric/cl2011/sharaf11clAbstract.docx.

31. Juola, Patrick. Authorship Attribution. s.l. : Now Publishers Inc, 2008.

32. D. J. Hand, Heikki Mannila, Padhraic Smyth. Principles of data mining. s.l. : MIT Press, 2001.

44

Appendix A

Personal Reflection

As a computer science student, I selected this project because I was interested in the natural language

processing area. However I did not have any pre-knowledge of a real life application that it solved,

apart from a module that I took in my final year. This module helped me in understanding what

natural language is, and this project helped me in implementing my understanding. In addition, I had

to look for previous examples that were already available to understand what had already been

implemented. There were examples of projects implemented in the school of computing at the

University of Leeds, and other examples from research in from other universities that were accessible

by looking at their publications over the internet. Then I started working on my project in order to

implement the solution that would help in text processing and classification, based on what I have

read and understood from my background reading. I had many questions for my supervisor in order to

confirm my understanding of an idea I had, and this made me more curious about this research area. I

hope I can use my knowledge and recall all that I have done in this project in my professional life.

From the beginning of the project it was essential to keep track of the time and work that should be

completed. To keep track of my work I created a blog that included some information that I gathered

from the papers I read. However it was not used much when the implementation was done. Obviously,

there were times were I was late and behind the schedule, especially when I had to find a new data set

and test the classifier on it. I had to look for a data set that would include the hereafter subject and

then create the data set text file and subset manually, which included reading instructions to

understand the books from where I retrieved the data. However, I managed to get back on track by

making every effort to catch up when setting up the data and designing a Java program that would

help in formatting it. To overcome this problem I encourage further students to work on their projects

with a good management of time. Overall this project helped me in understanding how text can be

useful in many different applications in real life. For example, in sight text processing we can have a

translation to many languages, and be able to recognize books and to whom they belong from the

style. Classifying text can help in many aspects in the world, and in this project I provide a

contribution to help in the project "Detecting Terrorist Activities: Making Sense", currently operating

at the University of Leeds. Much of the research that has been done at the university into the holy

Quran is used in helping Islamic scholars to retrieve whatever information is needed from the holy

Quran. Similarly new researches on hadith are carried out, using natural language processing

45

techniques to classify the saying of the prophet, which would be used as sources in other Islamic

studies. I recommend forthcoming students to pick a project that relates to natural language

processing which they would feel comfortable in working with since it is helpful in many applications

in the real world.

46

Appendix B

Materials used in the Project

There are many text files that were used in order to implement the classification task. This material is

provided in a CD and attached to the submitted copy. The reason for including the CD is that some of

the material that was used includes more than 300 lines and it was best to write it to a CD and attach it

to the report. The CD is divided as follows:

Quran data set Folder:

This folder is divided into two folders depending on the corpora language. Both folders include all

files that are used in the implementation as follows:

English version

1. allQuran.txt: text file provided by Claire Brierley.

2. csubsetQuran.txt: text file provided by Claire Brierley.

3. cnoarff.java: used to write noQuran.arff file that

4. allQuran.java: used to write allQuran.arff file & combine noQuran.arff with

allQuran.arff

5. allQuran.arff: includes all lines of verses ladled Yes in arff format.

6. noQuran.arff: includes all lines of verses ladled No in arff format.

7. complement.txt: all verses that are not in csubsetQuran.txt

Arabic version

1. ArabicQuran.java: used in creating the QuranArabic.arff files.

2. allQuranArabic.txt: text file retrieved by downloading it from the internet & and used

in ArabicQuran.java.

3. subsetQuranArabic.txt: text file retrieved using allQuranArabic.txt and

csubsetQuran.txt & and used in ArabicQuran.java.

4. ComplementArabicQuran.txt: text file retrieved from ArabicQuran.java.

5. meccan-madinan.arff: used in ArabicQuran.java to include Meccan-Madinan feature.

6. QuranArabic#.arff: text file retrieved from ArabicQuran.java (#: refers to the version

number of the arff file used in the implementation which is 1-15).

47

7. RandomSubsetArabicQuran#.txt: text file retrieved from ArabicQuran.java (#: refers

to the version number of the random set that corresponds to the arff file used in the

implementation which is 1-15).

Hadith data set Folder:

This folder includes all files that are used in the implementation as follows:

1. SetupData.java: used in formatting the hadith data set.

2. ArabicHadthid.java: used in creating the HadithArabic.arff files.

3. SahihMuslim.txt: text file used in SetupData.java.

4. subsetHadith.txt: text file used in SetupData.java.

5. AllHadithArabic.txt: text file retrieved from SetupData.java & and used in ArabicHadith.java.

6. SubsetHadithArabic.txt: text file retrieved from SetupData.java & and used in

ArabicHadith.java.

7. ComplementArabicHadith.txt: text file retrieved from ArabicHadith.java.

8. HadithArabic#.arff: text file retrieved from ArabicHadith.java (#: refers to the version

number of the arff file used in the implementation which is 1-12).

9. RandomSubsetArabicHadith#.txt: text file retrieved from ArabicHadith.java (#: refers to the

version number of the random set that corresponds to the arff file used in the implementation

which is 1-12

Date post:	06-May-2018
Category:	Documents
Upload:	truongthuan
View:	213 times
Download:	0 times