Word normalization in Indian languages
by
Prasad Pingali, Vasudeva Varma
in
the proceeding of 4th International Conference on Natural Language Processing (ICON 2005). December2005.
Report No: IIIT/TR/2008/81
Centre for Search and Information Extraction LabInternational Institute of Information Technology
Hyderabad - 500 032, INDIAJune 2008
Word normalization in Indian languages
Prasad Pingali Vasudeva VarmaLanguage Technologies Research Centre Language Technologies Research Centre
IIIT, Hyderabad, India IIIT, Hyderabad, India
[email protected] [email protected]
Abstract
Indian language words face spellingstandardization issues, thereby resulting inmultiple spelling variants for the same word.The major reasons for this phenomenon canbe attributed to the phonetic nature of Indianlanguages and multiple dialects,transliteration of proper names, wordsborrowed from foreign languages, and thephonetic variety in Indian language alphabet.Given such variations in spelling it becomesdifficult for web Information Retrievalapplications built for Indian languages, sincefinding relevant documents would requiremore than performing an exact string match.In this paper we examine the characteristics ofsuch word spelling variations and explorehow to computationally model suchvariations. We compare a set of languagespecific rules with many approximate stringmatching algorithms in evaluation.
1 Problem statement
India is rich in languages, boasting not only theindigenous sprouting of Dravidian and IndoAryan
tongues, but of the absorption of MiddleEasternand European influences as well. This richness isalso evident in the written form of the language. Aremarkable feature of the alphabets of India is themanner in which they are organised. It isorganised according to phonetic principle, unlikethe Roman alphabet, which has a random sequenceof letters. This richness has also led to a set ofproblems over a period of time. The variety in thealphabet, different dialects and influence offoreign languages has resulted in spellingvariations of the same word. Such variationssometimes can be treated as errors in writing,while some are very widely used to be called aserrors. In this paper we consider all types ofspelling variations of a word in the language.
This study on Indian language words is part of aweb search engine project for Indian languages.When dealing with real web data, the data couldbe really problematic. A lot of InformationRetrieval systems, web search systems rarelyexplicitly mention the problems of the real worlddata on the web. While comparing strings on realweb, they assume data to be homogeneous andcomparable across different sources. But inpractice when one looks at real web data, there
could be lot of variations in strings which need tobe handled. Especially in the case of Indianlanguages such variations tend to occur a lot moredue to various reasons. Some of such reasons thatwe could identify are the phonetic nature of Indianlanguages, larger size of alphabet, lack ofstandardization in the use of such alphabet, wordsentering from foreign languages such as Englishand Persian languages and last but not least tomention the variations in transliteration of propernames. In order to quantize these issues, werandomly picked 10 hindi and 10 telugu newsarticles. We manually counted the number ofproper names and words borrowed from English inthese news articles. We found that an average of5.19% of words were proper names in Hindidocuments and 4.8% words were proper names inTelugu documents. We also found an average of5.73% of words were borrowed from English inHindi documents while this number was 6.9% forTelugu documents. Therefore apart from theIndian language words we should also be able tohandle proper names and English wordstransliterated in Indian languages since they formsubstantial percentage of words. To give an idea ofthe data problem, the following words were foundon various websites.
अगँरजेी, अगँरजेी, अगँेजी, अगँेजी, अगंरेजी, अगंरेजी, अगंेजी, अगंजेीअनतरराषर ीय, अनतरराषर ीय, अनतरारिषर य, अनतरारषर ीय, अतंरराषर ीय,
अतंरारषर ीय, अतंरराषर ीय, अनतरािषर य, अनतराषर ीयIt has been empirically found that there is lot ofdisagreement among website authors with regardto spellings of words. We found that 65,774 wordshad variations out of 278,529 words. These 65,774words belong to 28,038 words. Therefore about23.61% of Indian language words found atleastone variant word. The average number ofvariations a word would contain is about 2.34words. It was found that more the number ofwebsites being studied, more is the amount ofdisagreement. This phenomenon was observed in
other Indian languages as well, such as telugu,tamil and bengali. Given such huge percentage ofwords it becomes important to study what are thecharacteristics of such spelling variations and seeif we can computationally model such variations.
We propose two solutions for the above saidproblem and compare them. One solution is tocome up with a set of rules specific to languagewhich can handle such variations, which couldresult in more precise performance. However sucha solution is not scalable for new languages since aseparate program will need to be written for eachIndian language. Another solution could be to tryapproximate string matching algorithms. Suchalgorithms are easily extensible to other languagesbut may not perform as well as language specificrules in terms of precision.
2 Rule based algorithm
In this section we discuss an algorithm using a setof language specific rules by taking Hindi as anexample. In this algorithm we achievenormalization of words by mapping the alphabetof the given language L into another alphabet L'where L' L⊂ . Before discussing the actual ruleswe would like to introduce chandrabindu, bindu,nukta, halanth, maatra and chandra in Hindialphabet which are being referred in the rules. Achandrabindu is a halfmoon with a dot, whichhas the function of vowel nasalization. A bindu(also called anusvar) is a dot written on top ofconsonants which achieves consonant nasalization.A nukta is a dot under a consonant which achievessounds mostly used in words of persian and arabiclanguages. A halanth is a consonant reducer. Amaatra is vowel character that occurs incombination with a consonant. A chandra is aspecial character which achieves the function ofvowel rounding, such as the sound of 'o' in theword 'documentary'. The following rules are
applied on words before comparison of two wordsto achieve normalization.
if found map to Examples
chandrabindu bindu अगँजे, अगंजे
consonant +nukta
correspondingconsonant
अगंजे, अगंेज
consonant +halanth
correspondingconsonant
अगँरेज, अगँजे
longer vowelmaatra
equivalent shortervowel maatra
अनतरारिषर य,अनतरारषर ीय
character +chandra
correspondingcharacter
डॉकयमुटेर ी,डाकयमुटेर ी
Table 1: Rules applied to achieve normalizationin Hindi.
While we employed these basic rules, we alsotried using unaspirated consonants in the placetheir respective aspirated ones. We found that thisoperation did not yield much in recall anddeteriorated precision. Therefore we dropped thisfeature in our algorithm.
3 Approximate string matchingalgorithms
We used a set of approximate string matchingalgorithms from the secondstring (found athttp://secondstring.sourceforge.net) project toevaluate to what extent would they help solve theproblem of normalizing Indian language words.We shall briefly discuss about each of thesealgorithms in this section before proceeding toexperimental results. Approximate string matchingalgorithms decide whether two given strings areequal by using a distance function between the twostrings. Distance functions map a pair of strings sand t to a real number r, where a smaller value ofr indicates greater similarity between s and t.Similarity functions are analogous, except that
larger values indicate greater similarity; at somerisk of confusion to the reader, we will use theseterms interchangably, depending on whichinterpretation is most natural. One important classof distance functions are edit distances, in whichdistance is the cost of best sequence of editoperations that convert s to t. Typical editoperations are character insertion, deletion, andsubstitution, and each operation much be assigneda cost. We will consider two editdistancefunctions. The simple Levenstein distance assignsa unit cost to all edit operations. As an example ofa more complex welltuned distance function, wealso consider the MongeElkan distance function(Monge & Elkan 1996), which is an affine1variant of the SmithWaterman distance function(Durban et al. 1998) with particular costparameters, and scaled to the interval [0,1]. Abroadly similar metric, which is not based on aneditdistance model, is the Jaro metric (Jaro 1995;1989; Winkler 1999). In the recordlinkageliterature, good results have been obtained usingvariants of this method, which is based on thenumber and order of the common charactersbetween two strings. Given strings s = a1 . . . aK
and t = b1 . . . bL , define a character ai in s to becommon with t there is a bj = ai in t such that i H<= j <= i + H , where H = min(|s|.|t|) / 2 . Let s'= a'1 . . . a'K be the characters in s which arecommon with t (in the same order they appear ins) and let t = b1 . . . bL be analogous; now define atransposition , for s', t' to be a position i such thatai not equals to bi . Let Ts',t' for s', t' be half thenumber of transpositions for s and t . The Jaro similarity metric for s and t is
where
A variant of this metric due to Winkler (1999) alsouses the length P of the longest common prefix ofs and t. Letting P' = max(P, 4) we define JaroWinkler(s, t) =
Jaro (s, t) + (P' /10) x (1 Jaro (s, t))
4 Experiments
We picked 350 words from the total set of wordsin the web search engine index which havespelling variations. We selected these words insuch a way that the frequency of each of thesevariations is above a threshold value. Now wedefine the experiment task as identifying'matching words' from the list of given words. Awordpair is set to be a matching pair if both thewords semantically meant the same entity. Nowthat these words are preclassified into clusters, weemployed various approximate string matchingalgorithms from the secondstring project alongwith our own language specific rules. Since mostof the approximate string matching algorithms aredependent on a distance threshold, for an arbitrarydistance threshold θ, we predict “same entity” forall words A, B such that dist(A,B)<θ ;where dist isthe distance computing function. We predict thetwo words A, B to be “different” otherwise. Wethen create plots as shown below by varying θfrom ∞ to +∞.
Figure 1: Comparative analysis of variousapproximate stringmatching algorithms withRecall on xaxis and Precision on yaxis.
As shown in figure 1, we find that the IndianLanguage Normalizer algorithm which is the set oflanguage specific rules, performs very well interms of precision when compared to otherapproximate string matching algorithms. Here wehave compared the rules with Character basedJaccard algorithm, Dirichlet Mixture modeling,Jaro, JaroWinkler, Levenstein, Monge Elkan,NeedlemanWunsch and Smith Watermanalgorithms.
References [Cohen, W. W., Pradeep Ravikumar, Stephen E.
Fienberg, 2003]. A Comparison of String DistanceMetrics for NameMatching Tasks. AmericanAssociation of Aritificial Intelligence 2003.
[Durban R, Eddy S R, Krogh A, Mitchison G 1998].Biological sequence analysis Probabilistic modelsof proteins and nucleic acids. Cambridge: CambridgeUniversity Press.
[Jaro, M. A. 1989]. Advances in recordlinkagemethodology as applied to matching the 1985 censusof Tampa, Florida. Journal of the AmericanStatistical Association 84:414420.
[Jaro, M. A. 1995]. Probabilistic linkage of large publichealth data files (disc: P687689). Statistics inMedicine 14:491498.
[Monge, A., and Elkan, C. 1996]. The fieldmatchingproblem: algorithm and applications. SecondInternational Conference on KDD.
[Monge, A., and Elkan, C. 1997]. An efficient domainindependent algorithm for detecting approximatelyduplicate database records. SIGMOD 1997workshop on data mining and knowledge discovery.
[Ristad, E. S., and Yianilos, P. N. 1998]. Learningstring edit distance. IEEE Transactions on PatternAnalysis and Machine Intelligence 20(5):522532.
[Winkler, W. E. 1999]. The state of record linkage andcurrent research problems. Statistics of IncomeDivision, Internal Revenue Service PublicationR99/04.