+ All Categories
Home > Documents > Arabic Ontology Lack, Reasons and Solutions

Arabic Ontology Lack, Reasons and Solutions

Date post: 04-Feb-2023
Category:
Upload: alquds
View: 0 times
Download: 0 times
Share this document with a friend
11
Arabic Ontology Lack, Reasons and Solutions Samah S. Kareem Badie Sartawi Alquds University Alquds University [email protected] [email protected] AbstractToday ontologies have become common on the World Wide Web and they range from large taxonomies categorizing Web sites to categorizations of different products for sale and their features. In general ontology defines a common vocabulary, i.e. for companies who need to share information in a domain. Ontology describes machine interpretable definitions of basic concepts and the relations among them. This technology supports mostly languages using Latin family scripts. But Arabic is still not well supported, even Arabic is one of the widest spoken language in the world, with over 200 million speakers, utilized by twenty four countries. The need for information in the related language is quite high but there are not many of semantic systems to test whether Arabic characters would give out in the event of using the tools. Because of many reasons, we discuss them in this paper, looking for the reasons and the solutions. Keywords: Ontology, Semantic Web, Morphological, MSA Classical Arabic Introduction Arabic is a semantic language which differs from Indo European languages syntactically, morphologically, and semantically. The term ‘classical Arabic’ refers to the standard form of the language used in all writing and heard on television, radio and in public speeches and religious sermons. The writing system of Arabic has twenty five consonants and three long vowels that are written from right to left and take different shapes according to their position in the word. In addition to the long vowels, Arabic has short vowels. Short
Transcript

Arabic Ontology Lack, Reasons and Solutions

Samah S. KareemBadie Sartawi

Alquds UniversityAlquds University [email protected]

[email protected]

Abstract— Today ontologies havebecome common on the World WideWeb and they range from largetaxonomies categorizing Web sitesto categorizations of differentproducts for sale and theirfeatures. In general ontologydefines a common vocabulary, i.e.for companies who need to shareinformation in a domain. Ontologydescribes machine interpretabledefinitions of basic concepts andthe relations among them. Thistechnology supports mostlylanguages using Latin familyscripts. But Arabic is still notwell supported, even Arabic is oneof the widest spoken language inthe world, with over 200 millionspeakers, utilized by twenty fourcountries. The need forinformation in the relatedlanguage is quite high but thereare not many of semantic systemsto test whether Arabic characterswould give out in the event ofusing the tools. Because of manyreasons, we discuss them in this

paper, looking for the reasons andthe solutions.

Keywords: Ontology, Semantic Web,Morphological, MSA ClassicalArabic

Introduction

Arabic is a semantic languagewhich differs from Indo Europeanlanguages syntactically,morphologically, and semantically.The term ‘classical Arabic’refers to the standard form ofthe language used in allwriting and heard ontelevision, radio and inpublic speeches and religioussermons. The writing system ofArabic has twenty fiveconsonants and three longvowels that are written fromright to left and takedifferent shapes according totheir position in the word. Inaddition to the long vowels,Arabic has short vowels. Short

vowels are not part of thealphabet but rather are writtenas vowel diacritics above orunder a consonant to give it itsdesired sound and hence give aword a desired meaning ,thisreason and others make the lack ofArabic semantic web which dependsbasically at Ontology (collectionof classes and properties, alongwith axioms, such as this one,relating them together). [5]

Research in the Semantic Web hasled to the standardization ofspecific web ontology languages.An ontology language is a mean tospecify at an abstract level –that is, at a conceptual level –what is necessarily true in thedomain of interest. Moreprecisely, we can say that anontology language should be ableto express constraints, whichdeclare what should necessarilyhold in any possible concreteinstantiation of the domain. [7]

The   paper   is   organized   asfollows:   section 2 Right-to-Left Languages and theSemantic Web  section 3presents  How does Arabicdiffer from other languages?,section 4presents Challengesof buildingArabicOntology, section 4 presents Ontology Tools AndArabic, section 5 presentsThe Suggested Solutions and

finally conclusion and futurework.

Right-to-Left Languages and the SemanticWeb

Arabic script constitutes 8.9% ofthe world’s languages, as shown inTable 1. Whatever the numbers ,itis critical to understand thatthis high percentage of the Arabiclanguage, when compared to otherlanguages such as Hebrew, raisesthe following question: Why isthere SW research in the Hebrewlanguage and yet we do not havethe same application for theArabic language? Moreover, why dowe nd Farsi SW applications, asfiFarsi is from the same scriptfamily, and we do not nd the samefiapplication for the Arabic script?Studies on language distributionover the Web content show Englishis the prevailing language fordocuments with 56% presence-share[1], which makes most SWdevelopers design their tools andapplications with Latin-basedscripts in mind (i.e., left-to-right script). Nevertheless, thereare some emerging attempts toprocess right-to-left scripts fromMiddle Eastern languages. Thereare some applications that havebeen built to facilitate SWtechnologies for right-to-leftscripts, such as Farsi and Hebrew.Hasti is an automatic ontologybuilding system (Shamsfard and

Barforoush, 2002). It extractslexical and ontological knowledgefrom Persian (Farsi) texts. Thesystem starts from a smallontology kernel and automaticallyconstructs the ontology through anunderstanding of the text. Theresults of the system have provedits success in handling the Farsilanguage compared to otherontology learning algorithms(Shamsfard and Barforoush, 2004).In the case of Hebrew, Klamma etal. (2002) developed a Talmudictractate that transcribes anoriginal printed edition of TheBabylonian Talmud into astructured electronic version,thus providing much easier ways ofstudying the text with hypertextual annotations and dynamicknowledge level-dependentfeatures.

Arabic is a Semitic language.Like Hebrew, the orthography of theArabic language has two mainparts:characters that represent theconsonant sounds, and diacriticalmarks that represent the shortvowels and cause variations inpronunciation. [10] The diacriticsare written above and below thecharacters. addition, manyresearchs has been done to improveArabic Information Retrieval (IR)through deploying techniques andmethodologies as computationalmorphology among others to improvethe recall and the precision(Hmeidi et al. 1997; Abu-Salem etal. 1999; Alsamara et al. 2003;

Zitouni et al. 2005).Unfortunately, most researches inthe field of Arabic IR did not paymuch attention to the problem ofsearching and retrieving vowelizedtext. Most published works evensuggested removing the diacriticsin the preprocessing step andunifying the content of theinverted list (Buckwalter 2002).This is because the majority of thetext available on the web is not-vowelized (or using few diacriticalmarks). In addition, typing andmatching words with diacritics iscumbersome .

Table1: (The world’s languages scriptpercentage)

How does Arabic differ from otherlanguages?

Arabic contains a variety oflinguistic phenomena unseenin English. Crucially, theconventional orthographicform of MSA text isunvocalized, a property thatresults in a de cientfi

graphical representation. Forhumans, this characteristiccan impede the acquisition ofliteracy. How do additionalambiguities caused bydevocalization affectstatistical learning? Howshould the absence of vowelsand syntactic markersin uence annotation choicesfland grammar development?

Arabic is a morphologicallyrich language with a root-and-pattern system similar toother Semitic languages. Thebasic word order is VSO, butSVO, VOS, and VOconfigurations are alsopossible two Nouns and verbsare created by selecting aconsonantal root (usuallytrilateral or quadrilateral),which bears the semanticcore, and adding af xes andfidiacritics.

Particles are uninfected,diacritics can also be usedto specify grammaticalrelations such as case andgender. But diacritics arenot present in unvocalizedtext, which is the standard

form of, e.g., news mediadocuments. [3]

Arabic is written withcertain cliticizedprepositions, pronouns, andconnectives connected toadjacent words. Arabicletters need to be connectedtogether in a cursive waydepending on the context inwhich they occur. Theseissues used to pose a problemwhen computers were limitedto use of the ASCII system,but with the introduction ofthe Unicode system there isbetter handling of thecharacter set. However, inmany instances, computersstill need to be Arabicenabled in order to viewArabic fonts correctly. [2]

Arabic has a relatively freeword order. Moreover, besidethe regular sentencestructure of verb, subjectand object, Arabic has apredicational sentencestructure of a subject phraseand a predicate phrase, withno verb or copula.

Arabic is rooted in theClassical or Quranic Arabic,but over the centuries, thelanguage has developed towhat is now accepted as MSA.

(MSA is a simplified form ofClassical Arabic) and followsits grammar .The maindifferences between ClassicalArabic and MSA are that MSAhas a larger (more modern)vocabulary, and does not usesome of the more complicatedforms of grammar found inClassical Arabic. Forexample, short vowels areomitted in MSA such thatletters of the Arabic textare written without diacriticsigns. [5]

The Arabic language iswritten from right to left.It has 28 letters, some ofwhich have one form (like "

,("د while others have two

forms "س س�ـ�����������") ;" "), three

forms (" " ;" or (" "; "ه ـھـ������ ھ�ـ������four forms ("14] (" ;" ";"ع ـ������ع

;"ـعـ�������� ع�ـ�������� "" ]. Arabic words aregenerally classified intothree main categories [19]:noun, verb, and particle. [4]

Arabic text is alsocharacterised by the

inconsistent and irregular useof punctuation marks.Punctuation marks have beenintroduced rather recently intothe Arabic writing system, yetthey are not as essential tomeaning nor their use asclosely regulated as is thecase with English.

There is no capitalization inArabic, which makes it hard toidentify proper names,acronyms, and abbreviations.

Arabic is a pro-drop language.The subject can be omittedleaving any syntactic parserwith the challenge to decidewhether or not there is anomitted pronoun in the subjectposition

There are many concepts whichexist only in the Arab Islamicculture like:

o (Ramadan):which is the ninth month of the Islamic calendar.

o Suhur :A light meal before starting a new day of Ramadan

o Mausaharati: A man who beats a drum in the streets (before dawn) towake people up to have (suhur) before they

start a new day offasting.

o Iftar: A meal at the endof each day of Ramadan, at the sunset.

o DhuAlHijjah:lishLanguage"Dhu al-Hijja is the twelfth and final month in the Islamic Calendar

o IhramClothing :Special &%Muslim clothing, worn during Pilgrimage ceremonies

o IhramPeriod: Special Muslim practices including the type of clothing, hair cutting/shaving and behaviour prior to and during Pilgrimage ceremonies.

o Umrah: A pilgrimage to Meccca performed by Muslims that can be undertaken at any time of the year.

o Zakat: The third of the Five Pillars of Islam and refers to spending at least 2.5% of one's wealth each year for thepoor or needy

o EidAlFitr :Socioreligious event in which Muslimscelebrate their end of fasting at the end of the Holy month of Ramadan

o Udhiyah Ritual: A ritualin which a lamb is

killed as sacrifice on the day of The Greater Eid (Eid Aladha)

o and many other concepts can be found at (http://sigmakee.cvs.sourceforge.net/viewvc/sigmakee/KBs/ArabicCulture.kif)

Challenges of building Arabic Ontology

Our preliminary researching in theeld of the SW revealed a numberfi

of issues that suggest reasons forthe lack of Arabic research,namely the lack of technologysupport and adequate resources.

1) Lack of Arabic support inexisting Semantic Web tools.A specific problem with SWtools processing Arabic textconcerns encoding. Differentencoding of Arabic scriptexists on the Web; dominantencodings include UTF-8,Windows-1256 and ISO-8859-6,most of the SW tools werebuilt using Java, whichsupportsinternationalization.Therefore, there is a strongneed to consolidate thedifferent Arabic encoding orsimply adhere to one encodingschema when representingArabic text (Unicode).Typical SW developers’ toolsuse Unicode throughout

(Carroll, 2005); hence thismight solve part of thesupport problem. [7]

2) Lack of Arabic Semantic Webapplications. Anotherevidence of the lack ofArabic in the Semantic Webworld is a recent statisticprovided by the (OntoSelect)ontology library, which showsthat 49% of the ontologies inthe library are created inEnglish. This problem couldbe attributed to the lack oftools and softwaredevelopment environments thatprocess Arabic script in allsteps of the semanticannotation process. [2]

3) Limited support for Arabicresearch in the eld offiSemantic Web technologies.Most of the SW research is aresult of investment fromboth grant bodies andacademic research centers. SWtools, such as Protégé, GATE,and Jena to name just a feware all products ofsuccessful investment in theSW eld. For the particularficase of Arabic, the limitedresearch problem can beattributed to the lack ofadequate resources in termsof skills, funding andinterest in this emergingeld of Web research. Thefi

allocation of researchfunding, the provision ofresources, and interest froma committed practicecommunity are essential if weare to overcome this problem.[7]

In terms of nancial resources, afimajor concern -particularly forcommunities with a small user basesuch as Arabic users- is the costof SW application development andmaintenance. In some wellstructured areas such asscienti c, commercial andfigovernment applications, thepotential bene ts in productivityfigain and pro t will outweigh theficost of developing and maintainingan ontology/application. The cost,in terms of time and effortrequired, will decrease as theuser base increases. [3]

Arabic Support in Ontology Tools

A series of studies were conductedto evaluate the Arabic frameworksand the systems available(Theseinclude, but are not limitedto ,the workshop on ArabicLanguage Resources and Evaluation(LREC 2002), a special session onArabic processing in TreatmentAutomatic du Language Natural(TALN 2004), the Workshop onComputational Approaches to ArabicScript-based Languages (COLING2004), and the NEMLAR Arabic

Language Resources and ToolsConference in Cairo, Egypt(2004).). The following resultsand discussions are presentedabout tools like the Protégé andJena, Sesame, and KOAN, all ofthem showed weak support of theArabic language but somewherebetter than other.Protégé can basically create &display ontology in Arabic,jambalaya plug in is successfulsolve problem displaying Arabictext in in figure 1. Protégé is anapplicable tool to build andmanage conceptual terminology inontology [8].This system uses theRDF standard that also utilizestheUTF-8 encoding, that iscompatible with null-terminatedstrings. However, it could displaynumeral literal instead of Arabiccharacters. The use of the WordNetis available only for Englishwhich would indicate the use oflexical resources in order thatthe connectivity is made fasterand accessibility is wider. Oncethere is enough connectivity ofthe lexical resources such as thesynsets, the availability of theneeded resources can more likelygive relevant results[8].

Protégé Query tools support Arabictext query; however, withoutdiacritics or stemming, Arabiclanguage would not be wellsupported unlike the processing inEnglish text. The Jena system can

also build RDF/OWL File in Arabic.Many APIs can integrate with Jenaquery engine for English languageprocessing but nothing isavailable as yetto support Arabic, so we can queryJena only by exact Arabic word.Sesame, on the other hand, woulduse numeral literals to storeArabic characters but it is unableto read or query Arabic ontology.KAON2 does not support Arabic atall, although UTF-8 encoding isalready being used.All evaluated systems do notsupport Arabic language processingor diacritics. There are certainconditions that are required inorder to attain this goal. In thegiven systems being evaluated,there is no Arabic language orcharacters that are supported.Most of the studies were actuallyfocusing on the tools to breakdownand filter all the necessarysemantics in order to retrieve theneeded data [Hammo,2009],withoutany attention to the problem ofsearching and retrievingdiacritized text [9].

Figure1-Arabic Ontology in Protégé

Table1: (Arabic Support in OntologyTools)

Tool Creator Functionality

Standards

ArabicSupport

Protégé

StanfordCenterforBiomedicalInformaticsResearch

Ontologyeditor andknowledgebaseframework

for ontologymanipulation& query

RDF Support

RDFS Support

OWL LimitedSupport

SPARQL

LimitedSupport

Jena

Hewlett-PackardDevelopmentCompany

Frameworkfor ontologymanipulationand query

RDF Support

RDFS Support

OWL Support

SPARQL

Limited Support

Sesame

Adunain cooperationwithNLnetFoundation

Frameworkfor storage,inferenceand queryingof RDF data

RDF Limited Support

RDFS Limited Support

OWL LimitedSupport

SPARQL

NOSupport

KAON2 Suiteof

ontologymanagement

RDF NO

Research CenterforInformationTechnologies

(Create,Manipulate,Infer) tools

Support

RDFSOWL

NOSupport NOSupport

The Suggested Solutions

Build Arabic Ontology SUMO ArabHigher Committee composed ofuniversity teachers and researchers whospecialize in Arabic language processing& encoding, to encourage all Arabresearchers to develop their domainontology.

Build the Arabic text rendering, that allArabic fonts are unreadable This is shown in Figure 1 for the

sentence in (1). ٌاب� ك�� ِ ب� ا ب� ٌ⇐ ك�ت� un b ā t i k /kitābun/ ‘a book’

We need to integrate all systems withArabic render.

Build Arabic Resonar to make ontologybuilding in Arabic fonts readable thewriting system, taking in account thatArabic writing direction is from right toleft.

The need of Arabic Ontology library atthe same level of(http://www.daml.org/ontologies/) , toshare common understanding of theinformation structure among people orsoftware agents.

All Arabic ontology must haveMultilanguage classes to be integratedwith exist international ontology andreadable and enabling reuse of domainknowledge.

All Arabic ontology's must be as simpleas developers can, that Arabic unspokencan understand them(with someexplanation about special Arabic Islamicentities).

Making domain assumptions explicit, bySeparating domain knowledge from theoperational knowledge , that can bedone by the cooperation ofmorphological specialist.

CONCLUSION Finally we want to say that thecreation and design of ontology isnot the main goal in itself. Anontology or a knowledge base builtfrom ontology as data is developedto be used by other programs andapplications. Also we must knowthat all ontology trees areconnected together with somereference number have the samevalue for the same concept even if

it’s in different languages whichwill create in the future big gapbetween Arab world and the others,that in the near future all webwill be semantic web and allcomputer applications will bedepends on ontology, so we are asArab must be ready with ourcommunity ontology's to be able tointegrate with future software'sto keep up with the rapid globaldevelopment in the field oftechnology.

REFERENCES

[1] Arabic WordNet: Current State and Future Extensions.Horacio Rodrguez, David Farwell,Javi Farreres,Manuel Bertra, Musa Alkhalifa, M. Antonia Mart, William Black, Sabri Elkateb, JamesKirk, AdamPease, Piek Vossen, and Christiane Fellbaum

[2] Better Arabic Parsing: Baselines,Evaluations, and Analysis. SpenceGreen and Christopher D.Manning,Computer Science Department,Stanford University. K. Elissa, “Titleof paper if known,” unpublished.

[3] Rule-based Approach in Arabic NaturalLanguage Processing ,Khaled Shaalan Y.Yorozu, M.

[4] Bulding a Formal ArabicOntology,Mustafa Jarrar

[5] Handling Arabic Morphological andSyntactic Ambiguity within the LFGFramework with a View to MachineTranslation ,Mohammed A. Attia

[6] A survey of Arabic language SupportiSemantic web ,Majdi Beseiso,

Abdulrahim Ahmad, Roslan Ismail ,Km 7,Jalan

[7] The Arabic language and the semanticweb: Challenges and opportunities,Hend S. Al-Khalifa ,Areej S. Al-Wabil

[8] http://www.fileformat.info/info/unicode/utf8.htm

[9] Hammo, B. 2009. Towards enhancingretrieval effectiveness of search

engines for diacritisized Arabicdocuments

[10] Enhancing Retrieval Effectivenessof Diacritisized Arabic Passages UsingStemmer and Thesaurus ,Bassam HammoAzzam Sleit Mahmoud El-Haj


Recommended