Arabic Ontology Lack, Reasons and Solutions
Samah S. KareemBadie Sartawi
Alquds UniversityAlquds University [email protected]
Abstract— Today ontologies havebecome common on the World WideWeb and they range from largetaxonomies categorizing Web sitesto categorizations of differentproducts for sale and theirfeatures. In general ontologydefines a common vocabulary, i.e.for companies who need to shareinformation in a domain. Ontologydescribes machine interpretabledefinitions of basic concepts andthe relations among them. Thistechnology supports mostlylanguages using Latin familyscripts. But Arabic is still notwell supported, even Arabic is oneof the widest spoken language inthe world, with over 200 millionspeakers, utilized by twenty fourcountries. The need forinformation in the relatedlanguage is quite high but thereare not many of semantic systemsto test whether Arabic characterswould give out in the event ofusing the tools. Because of manyreasons, we discuss them in this
paper, looking for the reasons andthe solutions.
Keywords: Ontology, Semantic Web,Morphological, MSA ClassicalArabic
Introduction
Arabic is a semantic languagewhich differs from Indo Europeanlanguages syntactically,morphologically, and semantically.The term ‘classical Arabic’refers to the standard form ofthe language used in allwriting and heard ontelevision, radio and inpublic speeches and religioussermons. The writing system ofArabic has twenty fiveconsonants and three longvowels that are written fromright to left and takedifferent shapes according totheir position in the word. Inaddition to the long vowels,Arabic has short vowels. Short
vowels are not part of thealphabet but rather are writtenas vowel diacritics above orunder a consonant to give it itsdesired sound and hence give aword a desired meaning ,thisreason and others make the lack ofArabic semantic web which dependsbasically at Ontology (collectionof classes and properties, alongwith axioms, such as this one,relating them together). [5]
Research in the Semantic Web hasled to the standardization ofspecific web ontology languages.An ontology language is a mean tospecify at an abstract level –that is, at a conceptual level –what is necessarily true in thedomain of interest. Moreprecisely, we can say that anontology language should be ableto express constraints, whichdeclare what should necessarilyhold in any possible concreteinstantiation of the domain. [7]
The paper is organized asfollows: section 2 Right-to-Left Languages and theSemantic Web section 3presents How does Arabicdiffer from other languages?,section 4presents Challengesof buildingArabicOntology, section 4 presents Ontology Tools AndArabic, section 5 presentsThe Suggested Solutions and
finally conclusion and futurework.
Right-to-Left Languages and the SemanticWeb
Arabic script constitutes 8.9% ofthe world’s languages, as shown inTable 1. Whatever the numbers ,itis critical to understand thatthis high percentage of the Arabiclanguage, when compared to otherlanguages such as Hebrew, raisesthe following question: Why isthere SW research in the Hebrewlanguage and yet we do not havethe same application for theArabic language? Moreover, why dowe nd Farsi SW applications, asfiFarsi is from the same scriptfamily, and we do not nd the samefiapplication for the Arabic script?Studies on language distributionover the Web content show Englishis the prevailing language fordocuments with 56% presence-share[1], which makes most SWdevelopers design their tools andapplications with Latin-basedscripts in mind (i.e., left-to-right script). Nevertheless, thereare some emerging attempts toprocess right-to-left scripts fromMiddle Eastern languages. Thereare some applications that havebeen built to facilitate SWtechnologies for right-to-leftscripts, such as Farsi and Hebrew.Hasti is an automatic ontologybuilding system (Shamsfard and
Barforoush, 2002). It extractslexical and ontological knowledgefrom Persian (Farsi) texts. Thesystem starts from a smallontology kernel and automaticallyconstructs the ontology through anunderstanding of the text. Theresults of the system have provedits success in handling the Farsilanguage compared to otherontology learning algorithms(Shamsfard and Barforoush, 2004).In the case of Hebrew, Klamma etal. (2002) developed a Talmudictractate that transcribes anoriginal printed edition of TheBabylonian Talmud into astructured electronic version,thus providing much easier ways ofstudying the text with hypertextual annotations and dynamicknowledge level-dependentfeatures.
Arabic is a Semitic language.Like Hebrew, the orthography of theArabic language has two mainparts:characters that represent theconsonant sounds, and diacriticalmarks that represent the shortvowels and cause variations inpronunciation. [10] The diacriticsare written above and below thecharacters. addition, manyresearchs has been done to improveArabic Information Retrieval (IR)through deploying techniques andmethodologies as computationalmorphology among others to improvethe recall and the precision(Hmeidi et al. 1997; Abu-Salem etal. 1999; Alsamara et al. 2003;
Zitouni et al. 2005).Unfortunately, most researches inthe field of Arabic IR did not paymuch attention to the problem ofsearching and retrieving vowelizedtext. Most published works evensuggested removing the diacriticsin the preprocessing step andunifying the content of theinverted list (Buckwalter 2002).This is because the majority of thetext available on the web is not-vowelized (or using few diacriticalmarks). In addition, typing andmatching words with diacritics iscumbersome .
Table1: (The world’s languages scriptpercentage)
How does Arabic differ from otherlanguages?
Arabic contains a variety oflinguistic phenomena unseenin English. Crucially, theconventional orthographicform of MSA text isunvocalized, a property thatresults in a de cientfi
graphical representation. Forhumans, this characteristiccan impede the acquisition ofliteracy. How do additionalambiguities caused bydevocalization affectstatistical learning? Howshould the absence of vowelsand syntactic markersin uence annotation choicesfland grammar development?
Arabic is a morphologicallyrich language with a root-and-pattern system similar toother Semitic languages. Thebasic word order is VSO, butSVO, VOS, and VOconfigurations are alsopossible two Nouns and verbsare created by selecting aconsonantal root (usuallytrilateral or quadrilateral),which bears the semanticcore, and adding af xes andfidiacritics.
Particles are uninfected,diacritics can also be usedto specify grammaticalrelations such as case andgender. But diacritics arenot present in unvocalizedtext, which is the standard
form of, e.g., news mediadocuments. [3]
Arabic is written withcertain cliticizedprepositions, pronouns, andconnectives connected toadjacent words. Arabicletters need to be connectedtogether in a cursive waydepending on the context inwhich they occur. Theseissues used to pose a problemwhen computers were limitedto use of the ASCII system,but with the introduction ofthe Unicode system there isbetter handling of thecharacter set. However, inmany instances, computersstill need to be Arabicenabled in order to viewArabic fonts correctly. [2]
Arabic has a relatively freeword order. Moreover, besidethe regular sentencestructure of verb, subjectand object, Arabic has apredicational sentencestructure of a subject phraseand a predicate phrase, withno verb or copula.
Arabic is rooted in theClassical or Quranic Arabic,but over the centuries, thelanguage has developed towhat is now accepted as MSA.
(MSA is a simplified form ofClassical Arabic) and followsits grammar .The maindifferences between ClassicalArabic and MSA are that MSAhas a larger (more modern)vocabulary, and does not usesome of the more complicatedforms of grammar found inClassical Arabic. Forexample, short vowels areomitted in MSA such thatletters of the Arabic textare written without diacriticsigns. [5]
The Arabic language iswritten from right to left.It has 28 letters, some ofwhich have one form (like "
,("د while others have two
forms "س س�ـ�����������") ;" "), three
forms (" " ;" or (" "; "ه ـھـ������ ھ�ـ������four forms ("14] (" ;" ";"ع ـ������ع
;"ـعـ�������� ع�ـ�������� "" ]. Arabic words aregenerally classified intothree main categories [19]:noun, verb, and particle. [4]
Arabic text is alsocharacterised by the
inconsistent and irregular useof punctuation marks.Punctuation marks have beenintroduced rather recently intothe Arabic writing system, yetthey are not as essential tomeaning nor their use asclosely regulated as is thecase with English.
There is no capitalization inArabic, which makes it hard toidentify proper names,acronyms, and abbreviations.
Arabic is a pro-drop language.The subject can be omittedleaving any syntactic parserwith the challenge to decidewhether or not there is anomitted pronoun in the subjectposition
There are many concepts whichexist only in the Arab Islamicculture like:
o (Ramadan):which is the ninth month of the Islamic calendar.
o Suhur :A light meal before starting a new day of Ramadan
o Mausaharati: A man who beats a drum in the streets (before dawn) towake people up to have (suhur) before they
start a new day offasting.
o Iftar: A meal at the endof each day of Ramadan, at the sunset.
o DhuAlHijjah:lishLanguage"Dhu al-Hijja is the twelfth and final month in the Islamic Calendar
o IhramClothing :Special &%Muslim clothing, worn during Pilgrimage ceremonies
o IhramPeriod: Special Muslim practices including the type of clothing, hair cutting/shaving and behaviour prior to and during Pilgrimage ceremonies.
o Umrah: A pilgrimage to Meccca performed by Muslims that can be undertaken at any time of the year.
o Zakat: The third of the Five Pillars of Islam and refers to spending at least 2.5% of one's wealth each year for thepoor or needy
o EidAlFitr :Socioreligious event in which Muslimscelebrate their end of fasting at the end of the Holy month of Ramadan
o Udhiyah Ritual: A ritualin which a lamb is
killed as sacrifice on the day of The Greater Eid (Eid Aladha)
o and many other concepts can be found at (http://sigmakee.cvs.sourceforge.net/viewvc/sigmakee/KBs/ArabicCulture.kif)
Challenges of building Arabic Ontology
Our preliminary researching in theeld of the SW revealed a numberfi
of issues that suggest reasons forthe lack of Arabic research,namely the lack of technologysupport and adequate resources.
1) Lack of Arabic support inexisting Semantic Web tools.A specific problem with SWtools processing Arabic textconcerns encoding. Differentencoding of Arabic scriptexists on the Web; dominantencodings include UTF-8,Windows-1256 and ISO-8859-6,most of the SW tools werebuilt using Java, whichsupportsinternationalization.Therefore, there is a strongneed to consolidate thedifferent Arabic encoding orsimply adhere to one encodingschema when representingArabic text (Unicode).Typical SW developers’ toolsuse Unicode throughout
(Carroll, 2005); hence thismight solve part of thesupport problem. [7]
2) Lack of Arabic Semantic Webapplications. Anotherevidence of the lack ofArabic in the Semantic Webworld is a recent statisticprovided by the (OntoSelect)ontology library, which showsthat 49% of the ontologies inthe library are created inEnglish. This problem couldbe attributed to the lack oftools and softwaredevelopment environments thatprocess Arabic script in allsteps of the semanticannotation process. [2]
3) Limited support for Arabicresearch in the eld offiSemantic Web technologies.Most of the SW research is aresult of investment fromboth grant bodies andacademic research centers. SWtools, such as Protégé, GATE,and Jena to name just a feware all products ofsuccessful investment in theSW eld. For the particularficase of Arabic, the limitedresearch problem can beattributed to the lack ofadequate resources in termsof skills, funding andinterest in this emergingeld of Web research. Thefi
allocation of researchfunding, the provision ofresources, and interest froma committed practicecommunity are essential if weare to overcome this problem.[7]
In terms of nancial resources, afimajor concern -particularly forcommunities with a small user basesuch as Arabic users- is the costof SW application development andmaintenance. In some wellstructured areas such asscienti c, commercial andfigovernment applications, thepotential bene ts in productivityfigain and pro t will outweigh theficost of developing and maintainingan ontology/application. The cost,in terms of time and effortrequired, will decrease as theuser base increases. [3]
Arabic Support in Ontology Tools
A series of studies were conductedto evaluate the Arabic frameworksand the systems available(Theseinclude, but are not limitedto ,the workshop on ArabicLanguage Resources and Evaluation(LREC 2002), a special session onArabic processing in TreatmentAutomatic du Language Natural(TALN 2004), the Workshop onComputational Approaches to ArabicScript-based Languages (COLING2004), and the NEMLAR Arabic
Language Resources and ToolsConference in Cairo, Egypt(2004).). The following resultsand discussions are presentedabout tools like the Protégé andJena, Sesame, and KOAN, all ofthem showed weak support of theArabic language but somewherebetter than other.Protégé can basically create &display ontology in Arabic,jambalaya plug in is successfulsolve problem displaying Arabictext in in figure 1. Protégé is anapplicable tool to build andmanage conceptual terminology inontology [8].This system uses theRDF standard that also utilizestheUTF-8 encoding, that iscompatible with null-terminatedstrings. However, it could displaynumeral literal instead of Arabiccharacters. The use of the WordNetis available only for Englishwhich would indicate the use oflexical resources in order thatthe connectivity is made fasterand accessibility is wider. Oncethere is enough connectivity ofthe lexical resources such as thesynsets, the availability of theneeded resources can more likelygive relevant results[8].
Protégé Query tools support Arabictext query; however, withoutdiacritics or stemming, Arabiclanguage would not be wellsupported unlike the processing inEnglish text. The Jena system can
also build RDF/OWL File in Arabic.Many APIs can integrate with Jenaquery engine for English languageprocessing but nothing isavailable as yetto support Arabic, so we can queryJena only by exact Arabic word.Sesame, on the other hand, woulduse numeral literals to storeArabic characters but it is unableto read or query Arabic ontology.KAON2 does not support Arabic atall, although UTF-8 encoding isalready being used.All evaluated systems do notsupport Arabic language processingor diacritics. There are certainconditions that are required inorder to attain this goal. In thegiven systems being evaluated,there is no Arabic language orcharacters that are supported.Most of the studies were actuallyfocusing on the tools to breakdownand filter all the necessarysemantics in order to retrieve theneeded data [Hammo,2009],withoutany attention to the problem ofsearching and retrievingdiacritized text [9].
Figure1-Arabic Ontology in Protégé
Table1: (Arabic Support in OntologyTools)
Tool Creator Functionality
Standards
ArabicSupport
Protégé
StanfordCenterforBiomedicalInformaticsResearch
Ontologyeditor andknowledgebaseframework
for ontologymanipulation& query
RDF Support
RDFS Support
OWL LimitedSupport
SPARQL
LimitedSupport
Jena
Hewlett-PackardDevelopmentCompany
Frameworkfor ontologymanipulationand query
RDF Support
RDFS Support
OWL Support
SPARQL
Limited Support
Sesame
Adunain cooperationwithNLnetFoundation
Frameworkfor storage,inferenceand queryingof RDF data
RDF Limited Support
RDFS Limited Support
OWL LimitedSupport
SPARQL
NOSupport
KAON2 Suiteof
ontologymanagement
RDF NO
Research CenterforInformationTechnologies
(Create,Manipulate,Infer) tools
Support
RDFSOWL
NOSupport NOSupport
The Suggested Solutions
Build Arabic Ontology SUMO ArabHigher Committee composed ofuniversity teachers and researchers whospecialize in Arabic language processing& encoding, to encourage all Arabresearchers to develop their domainontology.
Build the Arabic text rendering, that allArabic fonts are unreadable This is shown in Figure 1 for the
sentence in (1). ٌاب� ك�� ِ ب� ا ب� ٌ⇐ ك�ت� un b ā t i k /kitābun/ ‘a book’
We need to integrate all systems withArabic render.
Build Arabic Resonar to make ontologybuilding in Arabic fonts readable thewriting system, taking in account thatArabic writing direction is from right toleft.
The need of Arabic Ontology library atthe same level of(http://www.daml.org/ontologies/) , toshare common understanding of theinformation structure among people orsoftware agents.
All Arabic ontology must haveMultilanguage classes to be integratedwith exist international ontology andreadable and enabling reuse of domainknowledge.
All Arabic ontology's must be as simpleas developers can, that Arabic unspokencan understand them(with someexplanation about special Arabic Islamicentities).
Making domain assumptions explicit, bySeparating domain knowledge from theoperational knowledge , that can bedone by the cooperation ofmorphological specialist.
CONCLUSION Finally we want to say that thecreation and design of ontology isnot the main goal in itself. Anontology or a knowledge base builtfrom ontology as data is developedto be used by other programs andapplications. Also we must knowthat all ontology trees areconnected together with somereference number have the samevalue for the same concept even if
it’s in different languages whichwill create in the future big gapbetween Arab world and the others,that in the near future all webwill be semantic web and allcomputer applications will bedepends on ontology, so we are asArab must be ready with ourcommunity ontology's to be able tointegrate with future software'sto keep up with the rapid globaldevelopment in the field oftechnology.
REFERENCES
[1] Arabic WordNet: Current State and Future Extensions.Horacio Rodrguez, David Farwell,Javi Farreres,Manuel Bertra, Musa Alkhalifa, M. Antonia Mart, William Black, Sabri Elkateb, JamesKirk, AdamPease, Piek Vossen, and Christiane Fellbaum
[2] Better Arabic Parsing: Baselines,Evaluations, and Analysis. SpenceGreen and Christopher D.Manning,Computer Science Department,Stanford University. K. Elissa, “Titleof paper if known,” unpublished.
[3] Rule-based Approach in Arabic NaturalLanguage Processing ,Khaled Shaalan Y.Yorozu, M.
[4] Bulding a Formal ArabicOntology,Mustafa Jarrar
[5] Handling Arabic Morphological andSyntactic Ambiguity within the LFGFramework with a View to MachineTranslation ,Mohammed A. Attia
[6] A survey of Arabic language SupportiSemantic web ,Majdi Beseiso,
Abdulrahim Ahmad, Roslan Ismail ,Km 7,Jalan
[7] The Arabic language and the semanticweb: Challenges and opportunities,Hend S. Al-Khalifa ,Areej S. Al-Wabil
[8] http://www.fileformat.info/info/unicode/utf8.htm
[9] Hammo, B. 2009. Towards enhancingretrieval effectiveness of search
engines for diacritisized Arabicdocuments
[10] Enhancing Retrieval Effectivenessof Diacritisized Arabic Passages UsingStemmer and Thesaurus ,Bassam HammoAzzam Sleit Mahmoud El-Haj