+ All Categories
Home > Documents > The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters

Date post: 01-Mar-2023
Category:
Upload: ukm-my
View: 0 times
Download: 0 times
Share this document with a friend
21
6 The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters SULIANA SULAIMAN, Sultan Idris Education University, Malaysia KHAIRUDDIN OMAR, NAZLIA OMAR, MOHD ZAMRI MURAH, and HAMDAN ABDUL RAHMAN, Universiti Kebangsaan Malaysia The Malay language has two types of writing script, known as Rumi and Jawi. Most previous stemmer results have reported on Malay Rumi characters and only a few have tested Jawi characters. In this article, a new Jawi stemmer has been proposed and tested for document retrieval. A total of 36 queries and datasets from the transliterated Jawi Quran were used. The experiment shows that the mean average precision for a “stemmed Jawi” document is 8.43%. At the same time, the mean average precision for a “nonstemmed Jawi” document is 5.14%. The result from a paired sample t-test showed that the use of a “stemmed Jawi” document increased the precision in document retrieval. Further experiments were performed to examine the precision of the relevant documents that were retrieved at various cutoff points for all 36 queries. The results for the “stemmed Jawi” document showed a significantly different start, at a cutoff of 40, compared with the “nonstemmed Jawi” documents. This result shows the usefulness of a Jawi stemmer for retrieving relevant documents in the Jawi script. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing— Language models; Language parsing and understanding; Text analysis; H.3.4 [Information Storage and Retrieval]: Systems and Software—Performance evaluation (efficiency and effectiveness); H.3.1 [Informa- tion Storage and Retrieval]: Content Analysis and Indexing—Linguistic General Terms: Languages, Performance Additional Key Words and Phrases: Jawi stemmer, Malay stemmer, Jawi document retrieval, stemming ACM Reference Format: Sulaiman, S., Omar, K., Omar, N., Murah, M. Z., and Rahman, H. A. 2014. The effectiveness of a Jawi stemmer for retrieving relevant Malay documents in Jawi characters. ACM Trans. Asian Lang. Inform. Process. 13, 2, Article 6 (June 2014), 21 pages. DOI:http://dx.doi.org/10.1145/2540988 1. INTRODUCTION Stemming in Malay is more complex than in English. The Malay language has two different types of script: the Jawi script and the Rumi script. Jawi is an Arabic-script- based orthography. Jawi is based on Arabic, and Rumi is a Roman-based script. Jawi is read from right to left and has different forms of characters. For example, the word “king” in Malay can be written as “ ” in the Jawi or “Raja” in the Rumi. The Jawi script was used as early as 674 [Nasruddin et al. 2008]. It is also used as a writing system in the Malay archipelagos. Jawi has also been used as an art form to perform Islamic calligraphy. This type of calligraphy can be seen in architecture, where walls are decorated using the Jawi Authors’ addresses: S. Suliana (corresponding author), Faculty of Art Computing and Creative Industry, Sultan Idris Education University, Tanjung Malim, Perak Darul, Ridzuan 35900, Malaysia; email: [email protected]; K. Omar, N. Omar, M. Z. Murah, and H. A. Rahman, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub- lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2014 ACM 1530-0226/2014/06-ART6 $15.00 DOI:http://dx.doi.org/10.1145/2540988 ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.
Transcript

6

The Effectiveness of a Jawi Stemmer for Retrieving Relevant MalayDocuments in Jawi Characters

SULIANA SULAIMAN, Sultan Idris Education University, MalaysiaKHAIRUDDIN OMAR, NAZLIA OMAR, MOHD ZAMRI MURAH, andHAMDAN ABDUL RAHMAN, Universiti Kebangsaan Malaysia

The Malay language has two types of writing script, known as Rumi and Jawi. Most previous stemmerresults have reported on Malay Rumi characters and only a few have tested Jawi characters. In this article,a new Jawi stemmer has been proposed and tested for document retrieval. A total of 36 queries and datasetsfrom the transliterated Jawi Quran were used. The experiment shows that the mean average precision fora “stemmed Jawi” document is 8.43%. At the same time, the mean average precision for a “nonstemmedJawi” document is 5.14%. The result from a paired sample t-test showed that the use of a “stemmed Jawi”document increased the precision in document retrieval. Further experiments were performed to examinethe precision of the relevant documents that were retrieved at various cutoff points for all 36 queries. Theresults for the “stemmed Jawi” document showed a significantly different start, at a cutoff of 40, comparedwith the “nonstemmed Jawi” documents. This result shows the usefulness of a Jawi stemmer for retrievingrelevant documents in the Jawi script.

Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing—Language models; Language parsing and understanding; Text analysis; H.3.4 [Information Storage andRetrieval]: Systems and Software—Performance evaluation (efficiency and effectiveness); H.3.1 [Informa-tion Storage and Retrieval]: Content Analysis and Indexing—Linguistic

General Terms: Languages, Performance

Additional Key Words and Phrases: Jawi stemmer, Malay stemmer, Jawi document retrieval, stemming

ACM Reference Format:Sulaiman, S., Omar, K., Omar, N., Murah, M. Z., and Rahman, H. A. 2014. The effectiveness of a Jawistemmer for retrieving relevant Malay documents in Jawi characters. ACM Trans. Asian Lang. Inform.Process. 13, 2, Article 6 (June 2014), 21 pages.DOI:http://dx.doi.org/10.1145/2540988

1. INTRODUCTION

Stemming in Malay is more complex than in English. The Malay language has twodifferent types of script: the Jawi script and the Rumi script. Jawi is an Arabic-script-based orthography. Jawi is based on Arabic, and Rumi is a Roman-based script. Jawiis read from right to left and has different forms of characters. For example, the word“king” in Malay can be written as “ ” in the Jawi or “Raja” in the Rumi. The Jawiscript was used as early as 674 [Nasruddin et al. 2008]. It is also used as a writingsystem in the Malay archipelagos.

Jawi has also been used as an art form to perform Islamic calligraphy. This typeof calligraphy can be seen in architecture, where walls are decorated using the Jawi

Authors’ addresses: S. Suliana (corresponding author), Faculty of Art Computing and Creative Industry,Sultan Idris Education University, Tanjung Malim, Perak Darul, Ridzuan 35900, Malaysia; email:[email protected]; K. Omar, N. Omar, M. Z. Murah, and H. A. Rahman, Universiti Kebangsaan Malaysia,43600 Bangi, Selangor, Malaysia.Permission to make digital or hard copies of all or part of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrights for components of this work ownedby others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]© 2014 ACM 1530-0226/2014/06-ART6 $15.00DOI:http://dx.doi.org/10.1145/2540988

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:2 S. Sulaiman

script. Historically, the Jawi script was used for writing on inscribed stones and wood[Moain 1992]. The Jawi script was used as an official written script for communicationbetween the Malay king and the British king. The Jawi script was very important atthat time; for example, Sultan Muzaffar Syah (King of Perak) embossed his name onthe Perak currency using the Jawi script during his rule [Yatim 1990]. The Jawi scripthas been used in approximately 15000 manuscripts that are kept in the British library[Nasruddin et al. 2008].

After many years, the Jawi script was altered to the Rumi script, which has beenused until today. The official use of the Rumi script started in the early 20th century.The Rumi script is a romanized transliteration of the Jawi script. During that periodthe Malaysian government announced that the official language of Malaysia would usethe Rumi script and that all documents must be written in this script.

The motivation for developing a Jawi stemmer is because the Jawi is not just awriting script. Jawi has been used since the 14th century when most Malays wrotetheir manuscripts in Jawi. Ding Choo Ming emphasized the importance of preserv-ing these manuscripts and making them accessible to scholars, as they are critical tothe study of Malay literature and culture [Ming 1986]. In the 19th century many ofthese manuscripts were copied by the Europeans [Ming 1986]. Until today, NationalLibrary of Malaysia and other agencies have been involved in efforts to preserve andprevent these manuscripts from being destroyed. One such effort has been to digitizethe manuscripts, thus the Jawi stemmer is beneficial, especially in helping to searchfor the appropriate word or term from the digitized manuscripts. Other than that, itcan also be used as one of the components to transliterate the Jawi script into the Rumiscript [Ghani et al. 2009; Yon Hendri 2009]. The Malaysian Ministry of Education istaking serious measures to preserve the Jawi script and one of their aims is to ensurethat primary-school children can read Jawi.

The difference between Jawi and Rumi can be seen from the characters, the vowels,the spelling method, and the loan words. Even though the two scripts are differentfrom each other, the language is purely Malay. Vowels in the Rumi script are repre-sented using six different sounds: [a], [e], [i], [o], [u], and [e]. However, in the Jawiscript, these six vowel sounds are represented by three characters: , , and .This difference is one of the reasons why the Rumi and Jawi scripts are distinct eventhough they represent the same language. Figure 1 shows vowel representation in theJawi and Rumi scripts for the Malay language.

Malay words consist of a combination of single syllables or more than one syllable.A syllable is a sound of a vowel that is created when we pronounce the word. Forexample, a single-syllable word is <ru>, <cap>, and <bah>; two-syllablewords include <bulan>, which is a combination of two syllables, + ; three-syllable words are a combination of three single syllables, such as <utara>=+ + ; four-syllable words are a combination of four single syllables, such as<sementara>= + + + ; five-syllable words are a combination of five syllables,such as <universiti>= + + + + ; and six-syllable words are acombination of six syllables, such as - <keanak-anakan>= + + + ++ . These syllables can be divided into open and closed syllables, based on specificpatterns. Open syllables have three patterns, such as the Vowel pattern (Vp), whichis composed of only one vowel sound, the Consonant Vowel pattern (CVp), composedof one consonant and one vowel sound, and the Consonant Diphthong pattern (CDp),composed of a consonant sound followed by a diphthong. At the same time, closedsyllables are syllables that end with a consonant character. Closed syllables containtwo patterns: the Vowel Consonant pattern (VCp), which is composed of one vowel andone consonant, and the Consonant Vowel Consonant pattern (CVCp), which contains

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:3

Fig. 1. Vowel representation in the Jawi and Rumi scripts for the Malay language.

a consonant followed by a vowel followed by a consonant. Most pure Malay words arebased on disyllables.

The aim of this article is to investigate whether the use of a Jawi stemmer canincrease the precision and recall in Jawi document retrieval. An experiment was per-formed in which the effect of a stemmer was computed using the Mean Average Pre-cision (MAP) between stemmed Jawi documents and nonstemmed Jawi documents(no stemmer was used), to find out whether the stemmer had a positive effect on theprecision and recall. Next, statistical testing was performed to be certain that therewas a significant difference between the stemmed Jawi documents’ MAP and the non-stemmed Jawi documents’ MAP.

This article is divided into six sections. Section 2 presents an overview of relatedstudied. Section 3 describes the Jawi stemmer. Section 4 clarifies the test collection.Section 5 presents the experiments and results. Finally, Section 6 presents our conclu-sions and possible directions for future research.

2. RELATED STUDIES

A stemmer is also beneficial for transliteration. Roslan [2009] suggested a new methodfor transliterating the Jawi to Rumi script using a rule-based system. The affix and theroot word must be separated by a stemming to produce an easy and fast transliterationprocess [Roslan 2009; YonHendri 2009].

Stemming was developed to reduce morphological variants of root words [Hull 1996].A stemmer is used to increase recall and precision in some languages, such as English[Harman 1991], Swedish [Carlberger et al. 2001], Bengali [Islam et al. 2007], Dutch[Kraaij and Pohlmann 1996], and French [Savoy 1999]. Abdullah [2006] and Ahmad[1995] studied the effect of stemmers on the Malay document. Their studies showedthat search engines retrieve more relevant results from stemmed documents than fromnonstemmed documents. However, the experiment was tested only on Rumi Malaydocuments, and no results were reported on the effect of stemmers for Jawi documents.

In order to produce the most accurately stemmed words, techniques such as rulebased, n-gram, supervised learning, and dictionaries have been used. Suffix strippingwas an English stemmer introduced by Porter [1980]. Its algorithm is short and fast.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:4 S. Sulaiman

Complex suffixes can be removed using simple steps from the stemmer. The suffix isremoved depending on the remaining word. For this reason, the length of the word isimportant [Porter 1980]. To be certain that the stemmer could be used to improvethe retrieval performance, the stemmer was tested using the Cranfield 200 collec-tion [Cleverdon et al. 1966]. The results showed that the stemmer could improve theretrieval compared with the program used in Cambridge since 1971 [Porter 1980].Frakes and Baeza-Yates [1992] reported from a previous study that it was still un-clear whether stemming is useful for certain languages. Harman [1991] demonstratedthat there was no significant improvement between the S-stemmer, the Porter stem-mer, and Lovin’s stemmer [Harman 1991]. Flores et al. [2010] evaluated whether thebest stemmers for Portuguese could achieve retrieval effectiveness. Results reportedfor their experiment showed that a relationship exists but that it is not as strong aspreviously estimated [Flores et al. 2010].

A Malay stemmer has been employed on the Malay-English Terminology RetrievalSystem by Sembok et al. [2003]. Using the Malay stemmer, Malay science terminol-ogy could be retrieved. There are many types of stemmers, such as rule based, n-gram,and suffix stripping. In 1999, Abu Bakar [1999] proposed the conflation method using acombination of n-gram string and RAO stemming algorithm and showed the improve-ment in retrieval effectiveness on Malay documents. Adriani et al. [2007] developeda confix-stripping approach to stem Indonesian-derived words. The inflectional andderivational suffixes were removed, followed by derivational prefixes; then, recodingwas started, if possible. Prefix disambiguates were treated using a rule-based method.If the requirement of the prefix met the rule, then it was returned as an appropriateresult, otherwise, it would skip the rule. For the rule’s precedence, the prefix was re-moved followed by the suffix when suffix pairs were encountered (be-..-lah, be-..-an,me-..-i, di-..-i, pe-..i, and te-..i) when addressing common ambiguities. However, in rareinstances, the suffix was removed before the prefix. Hyphenated words were treatedusing explicit lookup lists. The stemmer was tested using two experiments, namely tofind how good is the stemmer and how stemming affects information retrieval fromIndonesian text. The results showed that, even though the confix-stripping approachgave the best results, it still could not solve all of the stemming problems because am-biguity is inborn in human languages. The results also showed that stemming does notsignificantly help the retrieval performance on the Indonesian collection.

Based on Malacon [2004], a rule-based approach is a necessary tool for processingMalay documents. The author proposed a rule-based approach to analyse affix words inMalay [Malacon 2004]. The morpho-graphemic problem was solved by affirming thatmodifications only affected the form of the base. It could also be solved by hashingthe word using segmentation rules and applying morpho-graphemic rules to build thecitation form of the base. Searching affixes from two directions (left and right) enablesus to identify a circumfix and thus to produce a correct segmentation. The resultsshow that the accuracy of the Malay morphology analysis was between 92% and 94%for Malay text and was 89% for Indonesian text.

Othman [1993] developed the first Malay stemmer algorithm and used a dictionaryto stem Malay-derived words into their root words. The dictionary was divided into 26different files. Another 26 files were used to help search for the roots in the dictionary.The first Malay stemmer was developed using a rule-based approach. A total of 121rules were used to ensure that the stemmer could stem as well as possible. Theserules were arranged and applied in alphabetical order to be certain that the computerprogram was flexible in accepting changes to its morphological rules.

Derived words were checked using the morphological rules. The rules removed theaffix through binary searching, and the stemmed word was then checked against adictionary. According to Othman [1993], the affix precedent must follow the circumfix,

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:5

prefix, suffix, and infix precedents. Root words were deemed valid when the stemmedword was found in the dictionary, otherwise, a spelling exception was performed. Thespelling exception changed the first character of the stemmed word into its correspond-ing character, based on a rule. The result was then checked using the dictionary. If theword was found in the dictionary, it was then output as a root word; otherwise, it wouldcontinue onto the next rule, until the end of the rule was reached. If the derived wordfailed to match any of the rules, then the word would be returned as a root word. Noresults were reported on retrieval performance.

Ahmad [1995] developed a Rule-Application-Order (RAO) stemmer based onOthman [1993]. Ahmad et al. [1996] performed several experiments to ensure the bestaffix precedence in the Malay stemmer. In the RAO stemmer, each input was checkedwith a dictionary to confirm whether the word was a root word or not. The three mostfrequent affixes (i.e., prefix, suffix, and circumfix) were compared, and an infix wasinserted at the end of each list. It was found from the experiment that the best orderto stem affixation is the following: prefix, circumfix, suffix, and infix. A recoding rulewas performed on words that did not match the root word dictionary. Most of thesecases were implemented on a prefix and circumfix. The first letter of the stemmedword was replaced with a stemmed character based on the recoding rule and again,it was checked with the root word dictionary. This same process was used by Othman[1993], except for the rule precedence and the type of dictionary that was used. TheRAO stemmer was tested with 736 distinct words and produced an accuracy of 98.4%.Ahmad [1995] also tested the RAO stemmer to gauge the retrieval effectiveness of thestemming algorithm. The experiment showed that there was a significant performancedifference between an RAO-stemmed document and a nonstemmed document at the10% level but not at the 5% level.

Abdullah et al. [2009] proposed a Rule-Frequency-Order (RFO) stemmer based onthe RAO stemmer, for Malay. Errors from the RAO stemmer were examined to in-crease the stemmer performance. The dictionary and rules from Ahmad’s set A wereused [Ahmad 1995]. These rules were sorted in decreasing order according to theirfrequency. For the test collection, the first two chapters of the Quran were used. Toenhance Ahmad’s stemmer [1995], another eight affixes were added to set A, andseveral modifications were made to improve the spelling variation rule. Several newwords were updated to the root word dictionary, which made a total of 22433 entries.We can conclude from these results that the list of rules, the spelling variation, andthe root word dictionary affected the performance of the stemming algorithm. TheRFO stemmer produced minimum errors compared with Ahmad’s stemmer [1995].Abdullah [2006] also tested the RFO to investigate whether the use of the RFO im-proved the retrieval effectiveness. The results show that there was a significant perfor-mance difference between an RFO-stemmed document and a nonstemmed documentat the 5% level.

There are many tests that can be used in statistical testing. Therefore, the besttest should be chosen to reflect the test’s objective. Ahmad [1995] and Abdullah [2006]used significance tests to test the performance of the stemmer using a stemmed doc-ument and a nonstemmed document. Smucker et al. [2007] emphasized that a statis-tical significance test was better because it allows the researcher to detect significantimprovements, even when the difference is small. The authors performed a study toidentify the best statistical significance test to use for information retrieval evaluation[Smucker et al. 2007]. Their results showed that the bootstrap test and the student’st- test produced comparable significance values, meaning that this type of test wouldproduce the same p-values for the same experiment. No practical difference was de-tected between them. However, the authors reported that, using the same dataset, theWilcoxon signed rank test and the sign test obtained different p-values, which means

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:6 S. Sulaiman

that these two tests could reduce the capability of detecting the significance and pro-vide a false result [Smucker et al. 2007]. For this article a paired sample t-test waschosen for measuring the significance of the difference between the means.

3. JAWI STEMMER

3.1. Spelling in Jawi

There are special rules that must be followed for spelling in Jawi. These rules areimplemented either for root words or for derived words. The first rule is the DERLUNGrule. This rule is applied to disyllable words that use [a + a] as vowels in the first andsecond syllables. The vowel alif can be presented in both syllables with one condition:the first character in the second syllable starts with { , , , , }. Otherwise, alif ispresented in the first syllable [Rahman 1999].

The second rule is the KAFGA rule. This rule applies when the first character in thesecond syllable belongs to or . If this requirement is met, then a vowel is presentin the first syllable, otherwise, a vowel is present in both syllables. The third rule isthe slide HAMZA rule. This rule is implemented when the first character in the secondsyllable represented is the vowel [i] or [u]. If this construct occurs, then the vowel ispresent in both syllables, and is inserted between those vowels. Otherwise, the vowelis present in both syllables.

The fourth rule is the distinctive ALIF rule. This rule is used to differentiate the pro-nunciation of a word. For example, the word can be read as <buaya> or <buai>.To distinguish these two words, a distinctive ALIF is used, hence, the new word of<buaya> is spelled and not . The last rule is the SEKEDI rule. Here, thecharacter is used as a diacritic on the to differentiate between the word <daun>(leave) and <diawan> (at the cloud).

3.2. Affixes in Jawi

Affixation can be described as a morphological process in which the base possibly ex-pands by one or more affixes. A base can be free or a compound root morpheme, acomplex, a duplication, or a compound form. There are four forms of affixes, known asthe prefix, suffix, circumfix, and infix. Prefixes are found at the beginning of a word,suffixes are found at the end of a word, infixes are inserted within a word, and circum-fixes are found at both the beginning and the end of a word.

There can be three layers of affixation. One- and two-layer constructions of affix-ation are common in Malay. Examples of one-layer and two-layer constructions are

<seorang> (to be alone) and <keseorangan> (loneliness). Affixation canbe layered, but not more than three times. However, a three-layer construction is ex-traordinary in Malay: for example, <berkeseorangan> (to suffer loneliness)[Hassan 1974].

A prefix can appear as {- , - , - , - , - , - , - and - }}. Some of these prefixes canhave variants. Most common suffixes used in Malay can be listed as { -, -, -, -,-, -, -, -, -, -, - , and -}}. Circumfixes are a combination of a prefix and

suffix on one base word, to construct a derived word. Table I and Table II show exam-ples of prefix and circumfix variants.

Basically, when an affix is added to the root word to form a derivative word, thespelling of the root word endures. However, under several conditions, the spelling ofthe root word is changed by the affix. This arrangement can be explained in the case ofprefixes for - <se>, - <ke> and - <di>. If the first character of the root word is<alif>, then the prefixes - <se>, - <ke> and - <di> are spelled as - <se>, -<ke> and - <di>. This arrangement is different from Rumi spelling. In Rumi, whenthe prefixes se-, ke-, and di- are added to another root word that starts with <a>, no

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:7

Table I. Example of PrefixVariants

Prefix Variant- No- No- No- -

- -

- -

- - , - , - , -

- - , - , - , -

Table II. Example of Circumfix Variants

Circumfixes Variant-..- -..-

-..- -..-

-..- -..-

-..- -..- , -..- , -..- , -..-

-..- -..- , -..- , -..- , -..-

extra character is added to the prefix. For example, when the prefix “di-” is added to theroot word ambil, it produces the word “diambil”. However, in Jawi, this word is spelled‘ ’ <diambil>. The character is added after the prefix as an extra character.

Another difficulty related to Jawi prefixes is that first character of the root word canbe eliminated to obtain the correct derivative word. For example, when the prefix -

<mem> is added to the root word <fokus>, the character is eliminated toform the derivative word <memfokus>. This procedure occurs in several prefixcases. There are many Arabic loan words in Malay. Most of the loan words start withthe character <alif>. When a prefix is added to this loan word, the character <alif>is replaced with the character <ya> or <wau>.

A suffix can be inserted based on the end sound of the root word. A root word thatends with the sound [a] is represented using the character alif and is spelled - <an>.Nevertheless, when the [a] sound is not represented by the character <alif>, this suf-

fix must be spelled together with , such as <tajaan> = <an> + <taj>.Several derivative words are based on repeated root words. For example, -

<kebudak-budakan> is derived from the word <budak>.

3.3. Rumi Deaffixation Rule

We proposed a Jawi stemmer for the Malay language to stem Jawi-derived wordsinto their root words. Ahmad [1995] and Abdullah’s [2006] framework was used as abaseline for this Jawi stemmer. A book authored by Rahman [1999] was used to under-stand Jawi spelling methods and how the affixes were added to the root word to createa derived word.

Before the rules for the Jawi stemmer were created, we tested the rules that wereused in the Rule Application Order (RAO) proposed by Ahmad and the Rule FrequencyOrder (RFO) proposed by Abdullah to investigate whether these rules were compati-ble with the Jawi script. The experiment was performed using 104 unique derivative

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:8 S. Sulaiman

Table III. The Result of Tested Rule

Types of Error & Accuracy RAO RFOOverstemming 2 2Understemming 1 1Spelling Exception 1 1Unchanged 30 29Others - -Accuracy 67.30 68.27

words taken from an online newspaper (Berita Harian1 and Utusan Melayu2). Theentire set of rules was transliterated directly into Jawi script using TERUJA [2010].The data was used because no other online Jawi documents were available at thattime. Table III shows the result of this experiment, and Table IV shows details ofthe errors.

From the experiment, Ahmad [1995] and Abdullah [2006] produced most of the er-rors in the unchanged category. RAO tended to produce 34 errors compared to RFO’s33 errors. This happened because an additional rule was developed in the RFO toovercome some of the errors in the RAO. In this case, the word ‘ ’ (result) wasstemmed correctly using the RFO and reduced the error of the RFO. As can be seen,most of the errors that occurred for both the RAO and RFO were “unchanged” errors.The unchanged error occurred because the rule is not appropriate to stem a Jawi word.In Rumi, the way that the suffix -an is spelled has only one rule; however, spelling thesame suffix in Jawi involves three different rules. This experiment shows that the ruleproduced for Rumi script is not sufficiently appropriate to stem Jawi script. To obtaina correct root word, the specific rules for the Jawi stemmer were developed based onthe conditions given next.

3.4. Jawi Deaffixation Rule

There are two main components in a Jawi stemmer. The first component is a deaffix-ation rule, and the second component is a Spelling Error Detector Rule (SEDR). Thedeaffixation rule includes affixation and spelling variation rules, while the SEDR ruleis used to check the spelling of the stemmed word. A stemmed word with the correctspelling will be output as a root word.

Malay contains four different types of affixation. For the prefix rule, several rulesmust be emphasized to avoid errors such as understemming, overstemming, andspelling exceptions. These rules include the prefixes meN- and peN-. These two pre-fixes contain the variant {- , - , - , - , - , - , - , - , - and - }}. If the first twocharacters match the words for the prefix variant, then the prefix must be removed (asshown in Table V). Table V shows the conditions of the prefix rule (meN- and peN-).

Another important rule for the prefix is {- , - and - }. These prefixes have onevariant, which is {- , - and - }. The pattern of the word was checked to avoidoverstemming. Consonant characters are replaced with a “C”, while vowels are re-placed with a “V”. Because many pure Malay words are disyllabic, we implementedthe pattern of disyllables into this prefix rule. Table VI shows the conditions of pre-fixes for {- ,- and - }.

Suffixes were eliminated cautiously because the use of suffixes in Jawi is more com-plicated than in Rumi. The exclusion of the suffix - must depend on the original root

1http://www.bharian.com.my/2http://www.utusan.com.my/

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:9

Table IV. Details of the Error

Derive wordCorrect RAO RFORoot Stemmed

Types of ErrorStemmed

Types of ErrorWord Word Word

(market) Unchanged Unchanged(downfall) Unchanged Unchanged(turbulent) Overstemming Overstemming

(analysts) Understemming Understemming(close) Unchanged Unchanged

(increase) Unchanged Unchanged(following) Unchanged Unchanged(enhancement) Unchanged Unchanged

(food) Unchanged Unchanged

(tension) Unchanged Unchanged(continued) Unchanged Unchanged

(forecast) Unchanged Unchanged

(prolonged) Unchanged Unchanged(supply) Unchanged Unchanged(incident) Unchanged Unchanged

(expectation) Unchanged Unchanged

(about) Unchanged Unchanged

(benefit) Overstemming Overstemming(picture) Unchanged Unchanged

(result) Spelling Exception Spelling Exception

(possibility) Unchanged Unchanged

(planning) Unchanged Unchanged

(address) Unchanged Unchanged

(pressure) Unchanged Unchanged(ingredient) Unchanged Unchanged

(business) Unchanged Unchanged

(result) Unchanged -

(financial) Unchanged Unchanged

(funded) Unchanged Unchanged

(through) Unchanged Unchanged

(expenses) Unchanged Unchanged

(income) Unchanged Unchanged(old) Unchanged Unchanged(recommendation) Unchanged Unchanged

word’s spelling. The characters before the - were examined to prevent understem-ming errors. Table VII shows the conditions for the suffix -.

The deaffixation rule for the circumfixes is a combination of the prefix and suffixrules. Affixes are stemmed from the beginning and the end of the words. Therefore,the rule for prefix recoding and suffixes must be applied to the circumfix rule. Anothercondition is when there is more than one affix present at the same time. For example,the word <mempelbagaikan> contains a combination of the prefixes - , -

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:10 S. Sulaiman

Table V. Conditions of the Prefix Rule (meN- and peN-)

Prefix Rule Example- / - If Length > 3 & 2nd char = & 3rd char = || ||

Then Remove the 1st & 2nd charElse

If 3rd char != || || ,Then remove 1st & 2nd char & add at beginning of

the word.End IfEnd If

(chooser) - (choose)(killer) - (kill)

If Length > 3& 2nd char= & 3rd char = || || || ||

|| ||

Then Remove the 1st & 2nd char.Else,

If the 3rd char != || || || || || || ,Then remove the 1st & 2nd char & add at beginning

of the word.End if

End if

(writer) - (write)(tailor) - (sew)

If Length > 3 & 2nd char = & 3rd char = & 4th char =Then Remove the 1st & 2nd char & add at the beginning

of the word.Else

Remove the 1st & 2nd char & add at beginning of theword.

(production) - (out)(remembering) -

(remember)

If Length > 3 & 2nd char = 3rd char = || || || ||g|| ||

Then Remove the 1st & 2nd char.Else

Remove the 1st & 2nd char & add at beginning of theword.

(connecter) -(connect)

If Length > 3 & 2nd char =Then Remove the 1st & 2nd char & add at beginning of

the word.Else

Remove the 1st char

(copying) - (copy)(singing) - (sing)

Table VI. The Conditions of Prefixes for {- ,- ,- }Prefix Rule Example

- ,- ,- If Length > 3 & 2nd char =If word pattern = CCCVCV

Then remove the 1st & 2nd char.End if

End if

(has leg) - (leg)

If Length > 3 & 2nd char = & 3rd char = ||Then Remove the 1st char

ElseRemove the 1st & 2nd char

End if

(to feel) - (feel)(together) - (same)

and a suffix -. To stem this type of word, we must stem the first prefix and examinethe first character after the second prefix (as shown in Table II).

The second process involved in the Jawi stemmer is the Spelling Error Detector Rule(SEDR). The SEDR was developed as a substitute for the root word dictionary. The rootword dictionary was used by Ahmad [1995] and Abdullah et al. [2009] in their stemmer.After affixations are eliminated from the derived words, the stemmed words must be

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:11

Table VII. The Conditions of the Suffix -.

Prefix Rule Example

- If Length > 3 & 2nd last char = & 2nd char = | | & 4th last char= ||

Then Remove the last three charactersElse

Remove the last two charactersEnd if

(opening) - (open)(renting) -

(rental)

If Length > 3 & 2nd last char = & 2nd char = | | & 4th lastchar = || || || ||

Then Remove the last two charactersElse

Remove the last three charactersEnd if

(subsistence) -(subsistence)

If Length > 3 & 2nd last char = & 3rd last char =Then Remove the last two characters

ElseRemove the last character.\

End if

(helping) - (help)

Table VIII. Details of Malay Disyllable Combinations

Syllable combination Pattern Exam in Ru Example in JawiOpen syllable + OpenSyllable

v + v i (v) + a (v) = ia (it)v + cv i (v) + tu (cv) = itu (that)cv + v du (cv) + a (v) = dua (two)cv + cv bo (cv) + la (cv) = bola (ball)v + cd i (v) + bai (cd) = abai (ignore)cv + cd ba (cv) + loi (cd) = baloi

Closed syllable + Opensyllable

vc + cv an (vc) + da (cv) = anda (you)cvc + cv ban (cvc) + tu (cv) = bantu (help)vc + cd an (vc) + dai (cd) = andai (if)cvc + cd ran (cvc) + tau (cd) = rantau (corner)

Open syllable + closedsyllable

v + vc a (v) + ur (vc) = aur (bamboo)v + cvc i (v) + kan (cvc) = ikan (fish)cv + vc ma (cv) + in (vc) = main (play)cv + cvc se (cv) + pit (cvc) = sepit (clip)

Closed syllable + closedsyllable

cv + cvc in (cv) + tan (cvc) = intan (diamond)cvc + cvc sun (cvc) + tik (cvc) = suntik (inject)cd + cvc tau (cd) + lan (cvc) = taulan (friend)

c = consonant; v = vowel; d= diphthong

checked with the SEDR to be certain that the stemmed word was spelled correctly. TheSEDR rule was constructed based on the word patterns and spelling methods for Jawi.

Words in Jawi are created from a syllable or a combination of syllables. The differ-ence between Jawi and Rumi can be seen in their use of vowels in each syllable. InJawi, there are cases in which syllables are spelled using consonants only. Each sylla-ble in Rumi words has a vowel, which is not the case in Jawi. There are situations inwhich spelling in Jawi does not involve vowels at all. There are four types of disyllablecombinations in Malay local words. Table VIII shows the details of Malay disyllablecombinations.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:12 S. Sulaiman

Table IX. Summary of the Spelling Methods for the Vowel [a] at the End of a Word

Type of Pattern Condition Symbolise Examplesyllable for 2nd

syllable2ndsyllable =Opensyllable

v None Dua / (two)

cv

1st syllable is open syllable and vowel atthe 1st syllable is [e]

Kera / (monkey)

1st syllable is open syllable and vowel atthe 1st syllable is [a]. Plus the 1stcharacter in the 2nd syllable is / / / /

Bawa / (take)

1st syllable is open syllable and vowel atthe 1st syllable is [a]. Plus the 1stcharacter in the 2nd syllable not equal to/ / / /

None Raja / (king)

1st syllable is open syllable and vowel atthe 1st syllable not equal to [a] or [e] andlast character at the 2nd syllable is /

None Muka / (face)

1st syllable is open syllable and vowelat the 1st syllable not equal to [a] or [e]and last character at the 2nd syllable notequal to /

Kerja / (work)

c = consonant; v = vowel;

Vowels in Jawi are based on six different sounds, which are represented using threedifferent characters. The six different sounds are [e], [a], [i], [e], [u], and [o]. Thesevowels can be symbolized as , and . Vowels and represent [e]. The SEDR rulewas developed using this rule. The SEDR rule must follow the three conditions for the[e] pattern. Words that contain [e] at the beginning must be represented with andthe [e] vowel must be used at an open syllable with the word pattern (v). Examples areemak / (mother) and enam / (six). The other condition is that the [e] vowel mustbe used at the beginning of a word for a closed syllable with the pattern vc; examplesare entah / (know), erti / (mean), and engkau / (you). There is no vowel usedfor [e] in the middle of a word for an open syllable. The pattern for this condition is(cv) and (cvc), which is used for a closed syllable. Examples are kena / (have) andsumber / (source). If [e] is present at the end of a word, then it represents thevowel with the character (alif maqsura). This rule was true for the vowel [e] at theend of an open syllable, with the pattern (cv). For example, the word egoisme (egoism)is spelled .

Next, if the vowel [a] is used at the beginning of the word, for an open syllable withpattern (v) and a closed syllable with pattern (vc), the vowel [a] must be symbolizedwith the character , such as anak / (child). The vowel [a], for closed syllables withpattern (cv) and closed syllables with pattern (cvc), is represented as . Examples arekami / (we) and sumpah / (curse). The use of the vowel [a] at the end of a wordis more complicated than in Rumi. Table IX shows a summary of the spelling methodsfor the vowel [a] at the end of a word.

The vowels [i] and [e] are represented for open syllables (pattern: v) and closedsyllables (pattern: vc), such as ikut / (follow). When these vowels are present inthe middle of a word with an open syllable (pattern: cv/v) or a closed syllable (pattern:cvc), they will be represented as . An example is ribu / (thousand). However,when used with a closed syllable (pattern: vc) and when the first syllable is an opensyllable, then the vowel should be represented as the characters , such as in buih /

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:13

Fig. 2. The processes that are involved in the SEDR.

(bubble). If this vowel is present at the end of a word and the second syllable is anopen syllable with the pattern (v) or (cv), then the vowel should be represented as .Examples are roti / (bread) and kari / (curry).

The vowels [u] or [o] are represented as for open syllables (pattern: v) and closedsyllables (pattern: vc). If they appear in the middle of a word, then for open syllables(cv) and closed syllables (cvc), the vowels will be represented as a . If they are presentat the end of a word, for open syllables, the vowel should be . However, if they arepresent for closed syllables (pattern: cv), the character will be .

Basically, the SEDR rule was developed using these conditions. After circumfixeswere eliminated (the deaffixation rule), the text was checked using the SEDR rule toensure that the spelling was correct, otherwise, the stemmer would proceed to the nextstep in the deaffixation process. The stemmer must perform the prefix rule and checkthe stemmed word against the SEDR. This process continues until all of the prefixprecedencies have been checked. If no rule is detected, then it outputs the word as aroot word. Figure 2 shows the processes that are involved in the SEDR.

The consonant vowel (cv) pattern was used to transform the Jawi word into a con-sonant and vowel pattern. Using the identifying syllable pattern process, the poten-tial pattern was examined to generate a possible syllable. Then, the syllable rule wasimplemented during the syllable rule process and, finally, the result was output. Forexample, to check the word (open), first we must make it an input for this process.Next, the stemmed word is transformed into a consonant vowel pattern. Here, char-acters other than , , and are transformed into c (as a character) and , , as v(as a vowel). For , this character is slightly special and is represented as two othercharacters, namely (c) and (v). Next, the syllable pattern for the stemmed word isidentified. For example, the word (open) will generate a cvc pattern. This patternis only generated for two conditions. The first condition occurs when the vowel [a] isat the end of the word, and the second condition occurs when the spelled word hasonly one syllable (a closed syllable). The identifying process is narrowed down by thesyllable rule, and the result is output as suggested by the rule. In this example, bothof the conditions fulfil the rule, but the first condition is output as the result becausemost pure Malay words are disyllable. Therefore, the rule for one syllable is avoided.

Affix precedence is important when developing a Malay stemmer. Othman [1993]suggested that the best way to stem an affix in Rumi is to begin with the circumfixfollowed by the prefix, the suffix, and the infix. An experiment by Ahmad [1995] showedthat, to reduce RAO stemmer errors, the first affix that must be eliminated is theprefix, followed by the suffix, circumfix, and infix. An experiment was performed toidentify the best affix precedence for Jawi. The test dataset comprised 1200 uniquelyderived words in Jawi that which were taken from online newspapers. Testing for affixprecedence included six tests. Table X shows the six tests that were covered in thisexperiment.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:14 S. Sulaiman

Table X. List of the Six Tests Covered in this Experiment

Labelled Affix PrecedenceT1 Prefix, circumfix, suffix and infixT2 Prefix, suffix, circumfix and infixT3 Circumfix, prefix, suffix and infixT4 Circumfix, suffix, prefix and infixT5 Suffix, prefix, circumfix, infixT6 Suffix, circumfix, prefix and infix

Fig. 3. The results of the affix precedence tests for Malay.

The T1 sequence was suggested by Ahmad [1995], and the T3 sequence was sug-gested by Othman [1993]. The stemmer was run for all six tests. For the first test (T1),if the rule matched the derived word, then the result appeared as a root word. If norule was found, then the circumfix rules were loaded and the stemmer attempted tofind a rule based on the input. If the rule requirement was met, then the affix wasstemmed and the stemmed word was output as a root word, otherwise, the suffix rulewas loaded. If the rule requirement was again met, then the affix was stemmed andoutput as a root word. If no rule was found, then the infix rule was loaded and the wordwas stemmed based on that rule. The result was then output as a root word. This pro-cess was repeated using the prefix sequence in T2. This process was continued usingthe sequences of T3, T4, T5, and T6. The results for the first experiment are shown inFigure 3.

The results in Figure 3 show that the highest accuracy for affix precedence was in T3,which refers to the circumfix, prefix, suffix, and infix [Sulaiman et al. 2011]. However,this result differs from the results reported in Ahmad [1995].

4. TEST COLLECTIONS

This experiment was divided into two parts: the first part tested the accuracy of theJawi stemmer and the second part tested whether the Jawi stemmer is useful for re-trieving relevant Jawi documents. For the first experiment, 1200 unique words were

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:15

Fig. 4. Accuracy of the stemmers.

derived from online newspapers and used for the experiment. Using the Transliter-ation Engine for Rumi to Jawi [TERUJA 2010], each word was transliterated intothe Jawi script. Some minor errors made by TERUJA were corrected manually. Thestopwords were eliminated using Abdullah’s [2006] stop-word list.

The second experiment was a test on the document retrieval. The Quran collection,as used by Ahmad [1995] and Abdullah [2006], was used as the corpus. Again, theTERUJA [2010] transliteration engine, with the help of a Jawi expert, was used totransliterate the Rumi corpus, and queries were made into the Jawi script. Versesof each chapter were divided into separate text files, formatted as .txt. For example,Chapter 1 had seven verses. Documents were created from each verse, which madeseven documents for the first chapter. Unique names were assigned to each file. Forexample, document 13 in Chapter 1 was named S1A13, meaning S1 = chapter 1 andA13 = verse 13; (S1A13 = Chapter 1, verse 13). Table I shows the Quran’s chaptersand the number of verses that are involved in the test collection.

The corpus included 6236 documents. These documents covered the 114 chaptersof the Quran. The queries and relevant set used in this experiment were also basedon Ahmad [1995] and Abdullah’s [2006] work. In this experiment, 36 queries weretransliterated using TERUJA [2010].

5. EXPERIMENTS AND RESULTS

The first experiment was conducted to investigate the accuracy of the Jawi stemmerto produce the correct root word. A total of 1200 data words were used and the exper-iment was performed on three different stemmers. These three stemmers used thesame deaffixation rule and SEDR rule. Stemmer A is a stemmer algorithm basedon Ahmad [1995]. At the same time, stemmer B is a stemmer algorithm based onAbdullah [2006], and stemmer C is a Jawi stemmer algorithm. Figure 4 shows theresults of this experiment and Table XI shows the errors of each stemmer.

From the previous result, the highest accuracy is produced by the Jawi stemmer,which belongs to the Jawi stemmer. Each stemmer has a different precedence for thedeaffixation rule. The first group is based on the Ahmad [1995] precedence rule and the

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:16 S. Sulaiman

Table XI. Error of Each Stemmer

Types of Error Stemmer A Stemmer B Stemmer C(Ahmad) (Abdullah) (Jawi Stemmer)

Understemming 147 138 34Overstemming 30 29 61Spelling Exception 21 31 40Others 58 26 44

second group follows the RFO (Rule Frequency Order), proposed by Abdullah [2006].The Jawi stemmer produced more errors on overstemming because the stemmer tendsto stem as many characters as it can; for example, between - and - , the - rulewas instantiated first followed by - . This scenario is reversed for the first and secondgroup. These two groups produced more errors on understemming because the SEDRrule cannot distinguish between the uses of [e] and [e] [Sulaiman et al. 2011]. Usingthe Jawi deaffixation rule and the SEDR rule, Stemmer A and Stemmer B tend toproduce 79% and 81% accuracy. This experiment shows that the Jawi deaffixation ruleis more suitable for stemming Jawi script compared to the deaffixation rule [Ahmad1995; Abdullah 2006] that was developed for Rumi script.

The stemmer was also tested for document retrieval purposes in terms of precisionand recall. The same queries and datasets from Ahmad [1995] and Abdullah [2006]were transliterated to Jawi using TERUJA [2010]. The datasets contained 36 queriesand 6232 documents from The Quran. The experiment was divided into two sets. Thefirst set was “Stemmed Jawi”, and the second set was “Nonstemmed Jawi”. The rel-evance set from Ahmad was used. The determination of the relevance of a documentwas done manually by Ahmad based on 36 queries. This was done by searching throughthe subject index, concordance that, and glossary of the Quran as well as a several Is-lamic books [Ahmad 1995]. According to author, there are 3440 relevant documents forthe 36 queries [Ahmad 1995]. The purpose of this experiment was to test whether thesearch engine could retrieve more relevant documents using a stemmed query and astemmed Jawi document. The experiment was performed and the results are shown inFigure 5.

The interpolated average recall-precision graph was formed using 11 cutoff averageprecision values. This graph has 11 cutoff average precision values, and the remainingvalues are interpolated. The interpolation was performed based on the following rule[Manning et al. 2008].

P(r) = maxr′≥r

P(r′)

From the aforesaid rule, the interpolated precision is the maximum known precisionat any higher recall level. Based on this rule, the precision at recall 0 can be computed.From Figure 5, the recall line declined from 0.0 to 1, which implies that the search en-gine could find all of the relevant documents as it reached a recall of 1. These relevantdocuments also contained a substantial number of nonrelevant documents. The high-est precision value was 50% for stemmed Jawi and was 28% for nonstemmed Jawi.The precision of the stemmed Jawi remained constant from a recall of 0.8 onwards,whereas the precision of the nonstemmed Jawi leveled off at a recall of 0.4 because,without the use of the stemmer, the search engine retrieved documents that containedthe exact words of the query, thus cutting short the unrelated documents for the query.The graph shows that the stemmed Jawi performed much better than the nonstemmedJawi at all the recall levels.

Next, the Mean Average Precision (MAP) value was calculated for each test. No in-terpolation process was involved in this calculation. The results show that the MAP

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:17

Fig. 5. Interpolated average recall-precision graph.

Table XII. Paired Sample Statistics

Mean N Std. Deviation Std. Error Mean

Pair 1Stemmed Jawi 8.432 36 10.150 1.690Non-Stemmed Jawi 5.137 36 7.450 1.240

Table XIII. Paired Sample Test

Paired Differencest Df

Sig.(2-tailed)

Mean Std.Deviation

Std.ErrorMean

95% ConfidenceInterval of theDifferenceLower Upper

Pair 1 Stemmed Jawi -Non-StemmedJawi

3.295 7.171 1.195 .869 5.721 2.757 35 .009

values for the stemmed Jawi and nonstemmed Jawi were 8.432 and 5.137, respec-tively. There was a difference between these values because the search engine tendedto retrieve more relevant documents for stemmed Jawi than for nonstemmed Jawi.Tables XII and XIII show the results of this statistical testing.

A paired sample t-test was conducted to compare MAP values for stemmed Jawi andnonstemmed Jawi. There was a significant difference in the scores for the stemmedJawi (M=8.432, SD=10.15) and nonstemmed Jawi (M=5.137, SD=7.45) conditions; t(35) = 2.757, p = 0.009. These results suggest that the use of stemmed Jawi docu-ments increased the precision in the document retrieval. Another experiment was alsoperformed using the Indri [Strohman et al. 2005] search engine to examine the preci-sion of the relevant documents that were retrieved at various cutoff points for all 36queries. The cutoffs were defined at the positions 10, 20, 30, 40, 50, 60, 70, 80, 90, 100,200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, and 3000. There were no changesafter the 3000 cutoff point. The ranked list produced by the search engine was ranked

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:18 S. Sulaiman

Table XIV. Paired Sample T-Test

Rank Cut-off PointsStemmed Jawi Non-Stemmed Jawi

Sig. (2-tailed)Mean Std. Deviation Mean Std. Deviation

Cut-off 10 3.716 7.636 2.014 5.074 .134Cut-off 20 4.684 8.732 2.455 5.588 .071Cut-off 30 5.098 8.717 2.689 5.824 .055Cut-off 40 5.585 8.780 2.949 5.801 .029Cut-off 50 5.990 8.790 3.220 5.827 .022Cut-off 60 6.336 8.941 3.462 5.942 .017Cut-off 70 6.635 9.158 3.651 6.044 .014Cut-off 80 6.930 9.290 3.777 6.134 .010Cut-off 90 7.269 9.759 3.881 6.244 .007Cut-off 100 7.475 9.976 3.968 6.326 .006Cut-off 200 7.815 9.868 4.432 6.532 .007Cut-off 300 8.002 9.930 4.629 6.623 .008Cut-off 400 8.173 10.044 4.780 6.823 .007Cut-off 500 8.146 10.143 4.826 6.872 .009Cut-off 600 8.285 10.071 4.863 6.883 .007Cut-off 700 8.322 10.071 4.881 6.890 .007Cut-off 800 8.317 10.077 4.884 6.889 .007Cut-off 900 8.320 10.077 4.892 6.893 .007Cut-off 1000 8.334 10.087 4.897 6.897 .007Cut-off 2000 8.412 10.150 4.926 6.945 .006Cut-off 3000 8.432 10.150 4.926 6.945 .006

Fig. 6. Performance comparison for the interpolated average recall-precision graph between the Jawi stem-mer and the Malay stemmer (RAO and RFO).

based on the probability and commonly called a likelihood model. Table XIV shows thepaired sample t-test results for each cutoff point.

The hypotheses of this test are described as follows.

H0 = There was no significant difference in the means of the stemmed Jawi andnonstemmed Jawi accuracies.

H1 = There was a significant difference in the means of the stemmed jawi andnonstemmed Jawi accuracies.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:19

Table XIV clearly shows that there was a difference between the means for theStemmed Jawi and nonstemmed Jawi documents. From a cutoff of 40 and above, thenull hypothesis was rejected and the alternative hypothesis was accepted. This resultexplains that there was a significant difference in the means of the stemmed Jawi(M=5.585. SD=8.78) accuracy and the nonstemmed Jawi (M=2.949, SD=5.801) accu-racies; t (35) = 2.280, p = 0.029.

Sembok [2005] and Abdullah [2006] have conducted experiments to compare the re-trieval effectiveness of using the conflation method. RAO is a Rumi stemmer proposedby Ahmad et al. [1996], and the RFO stemmer was proposed by Abdullah [2006]. TheRAO and RFO stemmer represent the Rumi stemmer. Figure 6 shows the performancecomparison for the average recall-precision graph between the Jawi stemmer and theMalay stemmer (RAO and RFO).

Figure 6 shows the average recall-precision values for all of the 36 queries for theJawi stemmer and the Rumi stemmer. From that figure, it can be observed that theJawi stemming performs better than the others.

6. CONCLUSIONS

In this article, we proposed a Jawi stemmer to stem derived Jawi words into their rootwords. The deaffixation rule used by Ahmad [1995] and Abdullah [2006] was not effi-cient enough to stem Jawi-derived words. This result arises because spelling in Jawiinvolves many rules. The accuracy of the Ahmad stemmer [1995] and the Abdullahstemmer [2006], when tested on the Jawi script, is 67.3% and 68.3%, respectively. Tocreate a good stemmer, we produce a new Jawi deaffixation rule to stem Jawi-derivedwords. Here, the deaffixation rule used in Ahmad [1995] and Abdullah [2006] was re-placed by the Jawi deaffixation rule. The accuracy of these two stemmers increasedto 79% for that used in Ahmad [1996] and 81% for that used in Abdullah [2006]. TheJawi stemmer shows a high accuracy compared to the others. This stemmer was alsoevaluated using document retrieval. This evaluation method was chosen because wewanted to investigate the effect of the Jawi stemmer on increasing the precision andrecall. Most of the stemmers have been proven to help search engines retrieve rele-vant documents, but some have no effect on document retrieval. Even though Jawirepresents the Malay language in the same way as Rumi, their spelling methods aretotally different. There is a significant difference in retrieval effectiveness (measuredin MAP) between the stemmed Jawi documents and the nonstemmed Jawi documents.

Detailed experiments were conducted: nonstemmed Jawi stabilized at a recall of0.4 because, without using the stemmer, the search engine retrieved documents thatcontained the exact words of the query, thus cutting short the unrelated documentsfor the query. This trend made the nonstemmed Jawi documents level off at earliercutoff points compared to the stemmed Jawi documents. However, the stemmed Jawiretrieved all of the documents that contained the stemmed word in the query. As aresult, it retrieved more nonrelevant documents than the nonstemmed Jawi, but wasstill able to attain a higher precision. The Jawi stemmer can be used to increase doc-ument retrieval with a MAP value of 8.432%, and the paired sample t-test showedthat there was a significant difference between the stemmed Jawi documents and thenonstemmed Jawi documents. For the precision at each document ranking, thestemmed Jawi showed a significant difference at a cutoff of 40 and above. These re-sults can be used to answer the question of whether the Jawi stemmer can be used toincrease document retrieval. The performance of peak F-measure for three stemmersare RAO = 0.20, RFO = 0.21, and Jawi stemmer = 0.26 at recall 20%. The evaluationcan also be achieved using the Paice method to analyze the errors that are producedby the stemmer [Paice 1994].

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

6:20 S. Sulaiman

ACKNOWLEDGMENTS

The authors would like to thank Tn Haji Hamdan Abdul Rahman for spelling checking all data and queries.

REFERENCES

Abdullah, M. T. 2006. Monolingual and cross-language information retrieval approaches for Malay andEnglish language documents. Ph.D Thesis. Universiti Putra Malaysia.

Abdullah, M. T., Ahmad, F., Sembok T. M. T. 2009. Rules frequency order stemmer for Malay language. Int.J. Comput. Sci. Netw. Secur. 9, 2, 433–438.

Abu Bakar, Z. 1999. Evaluation of retrieval effectiveness of n-gram string similarity matching on Malaydocuments. Tech. rep., Universiti Kebangsaan Malaysia. Bangi.

Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S. M., and Williams, H. 2007. Stemming Indonesia: A confix-stripping approach. ACM Trans. Asian Lang. Inf. Process. 6, 4, 13–30.

Ahmad, F. 1995. A Malay language document retrieval system: An experimental approach and analysis.Ph.D. thesis, Universiti Kebangsaan Malaysia. 1–248.

Ahmad, F., Yusoff, M., and Sembok, T. M. T. 1996 Experiments with a stemming algorithm for Malay words.J. Amer. Soc. Inf. Sci. 47, 12, 909–918.

Carlberger, J., Dalianis, H., Hassel, M., and Knutsson, O. 2001. Improving precision in information retrievalfor Swedish using stemming. In Proceedings of the 13th Nordic Computational Linguistics Conference(NODALIDA’01) 1–5.

Cleverdon, C. W., Mills, J., and Keen, M. 1966. Factors determining the performance of indexing systems.Tech. rep., College of Aeronautics, University of Michigan, MI.

Flores, F. N., Moreira, V. P., and Heuser, C. A. 2010. Assessing the impact of stemming accurancy on infor-mation retrieval. In Proceedings of the 9th International Conference on Computational Processing of thePortuguese Language (PROPAR’10). 10–20.

Frakes, W. and Baeza-Yates, R. 1992. Information retrieval: Data Structures and Algorithms. Prentice-Hall.Ghani, R. A., Zakaria, M. S., and Omar, K. 2009. Jawi-Malay transliteration. In Proceedings of the Interna-

tional Conference on Electrical Engineering and Informatics (ICEEI’09). 154–157.Harman, D. 1991. How effective is suffixing. J. Amer. Soc. Inf. Sci. 42, 1, 7–15.Hassan, A. 1974. The Morphology of Malay. Dewan Bahasa Dan Pustaka, Kementerian Pelajaran Malaysia.Hull, D. 1996. Stemming algorithms: A case study for detailed evaluation. J. Amer. Soc. Inf. Sci. 47, 1, 70–84.Islam, M. Z., Uddin, M. N., and Khan, M. 2007. A light weight stemmer for Bengali and its use in spelling

checker. In Proceedings of 1st International Conference on Digital Communications and ComputerApplications (DCCA’07).

Kraaij, W. and Pohlmann, R. 1996. Viewing stemming as recall enhancement. In Proceedings of the 19thAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’96). 40–48.

Malacon, B. R, 2004. Computational analysis of affixed words in Malay language. In Proceedings of the 8thInternational Symposium on Malay/Indonesia Linguistic (ISMIL’04).

Manning, C. D., Raghavan P., and Schutze, H. 2008. Introduction to Iinformation Retrieval. CambridgeUniversity Press. UK.

Ming, D. C. 1986. Access to Malay manuscripts. In Proceeding of the 32nd International Congress for Asianand North African Studies.

Moain, A. J. 1992. Sejarah Tulisan Jawi. Jurnal Bahasa 35, 11. 101–1012.Nasruddin, M. F., Omar, K., Zakaria, M. S., and Liong, C.-Y. 2008. Handwritten cursive Jawi character

recognition: A survey. In Proceedings of the 5th IEEE International Conference on Computer Graphics,Imaging and Visualisation (CGIV’08). 247–249.

Othman, A. 1993. Pengakar perkataan Melayu dan sistem capaian dokumen. Tech. rep., UniversitiKebangsaan.

Paice, C. D. 1994. An evaluation method for stemming algorithms. In Proceedings of the 17th Annual In-ternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94).42–50.

Porter, M. F. 1980. An algorithm for suffix stripping. Program. Electron. Libr. Inf. Syst. 14, 3. 130–137.Rahman, H. A. 1999. Panduan Menulis dan Mengeja Jawi. Dewan Bahasa dan Pustaka. Kuala Lumpur.Roslan, G. 2009. Jawi-Malay transliteration. In Proceedings of the International Conference on Electrical

Engineering and Informatics. 154–157.

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.

The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents 6:21

Savoy, J. 1999. A stemming procedure and stopword list for general France corpora. J. Amer. Soc. Inf. Sci.50, 10, 944–952.

Sembok, T. M. T, Palasundram, K., Ali, N. M, Yahya, A., and Wook, T. S. M. T. 2003. Istilah sains: AMalay-English terminology retrieval system experiment using stemming and n-grams approach inmalay words. In Proceeding of the 6th International Conference on Asian Digital Libraries: Technologyand Management of Indigenous Knowledge for Global Access (ICADL’03). Lecture Notes in ComputerScience, Vol. 2911, Springer, 173–177.

Sembok, T. M. T. 2005. Word stemming algorithm and retrieval effectiveness in Malay and Arabic documentsretrieval systems. World Acad. Sci. Engin. Techno. 2911, 95–97, 173–177.

Smucker, M. D., Allen, J., and Carterette, B. 2007. A comparison of statistical significant test for informa-tion retrieval evaluation. In Proceeding of the 16th ACM Conference on Conference on Information andKnowledge Management (CIKM’07). 623–632.

Smucker, M. D., Allen, J., and Carterette, B. 2009. Agreement among statistical significance tests for in-formation retrieval evaluation at varying sample sizes. In Proceeding of the 32nd International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 630–631.

Strohman, T., Metzler, D., Turtle, H., and Croft, W. B. 2005. Indri. A language-model based search enginefor complex queries. In Proceeding of the International Conference on Intelligence Analysis.

Sulaiman, S., Omar, K., Nazlia, O., Murah, M. Z., and Abdul Rahman, H. 2011. A Malay stemmer for Jawicharacter. In Proceeding of the 24th Australasian Joint Conference on Artificial Intelligence (AI’11).668–676.

Teruja, 2010. Transliteration engine for Rumi to Jawi, http://www.jawi.ukm.my/.Yatim, O. M. 1990. Epigrafi islam terawal di Nusantara. Dewan Bahasa dan Pustaka.Yonhendri. 2009. Transliterasi rumi ke jawi berasaskan petua. Master Thesis, Universiti Kebangsaan

Malaysia.

Received March 2013; revised August 2013; accepted October 2013

ACM Transactions on Asian Language Information Processing, Vol. 13, No. 2, Article 6, Publication date: June 2014.


Recommended