+ All Categories
Home > Documents > The Challenges of Persian User-generated Textual Content ...

The Challenges of Persian User-generated Textual Content ...

Date post: 07-Nov-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
12
STUTTGART UNIVERSITY 1 The Challenges of Persian User-generated Textual Content: A Machine Learning-Based Approach Mohammad Kasra Habib Stuttgart University ISTE/Empirical Software Engineering E-mail: [email protected] Abstract—Over recent years a lot of research papers and studies have been published on the development of effective approaches that benefit from a large amount of user-generated content and build intelligent predictive models on top of them. This research applies machine learning-based approaches to tackle the hurdles that come with Persian user- generated textual content. Unfortunately, there is still inadequate research in exploiting machine learning approaches to classify/cluster Persian text. Further, analyzing Persian text suffers from a lack of resources; specifically from datasets and text manipulation tools. Since the syntax and semantics of the Persian language is different from English and other languages, the available resources from these languages are not instantly usable for Persian. In addition, recognition of nouns and pronouns, parts of speech tagging, finding words’ boundary, stemming or character manipulations for Persian language are still unsolved issues that require further studying. Therefore, efforts have been made in this research to address some of the challenges. This presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language. Finally, the dataset has been rehearsed with different classifiers and feature engineering approaches. The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts; the best classifier was Support Vector Machines which achieved a precision of 91.22%, recall of 91.71%, and F1 score of 91.46%. Index Terms—Machine Learning, User-generated Content, Sentiment Analysis, Feature Engineering, Support Vector Machine (SVM), Logistic Regression (LR), Random Forest Classifier (RND), Linear Discriminant Analysis (LDA), Naive Bayes, K-Means and Ensemble Learning 1 I NTRODUCTION R ECENTLY, structured and unstructured user-generated content throughout the internet has been dramatically increased. Unstructured data can be easily perceived and analyzed by humans but are very hard for machines to understand. Likewise, extracting what other people think out of the generated content is an important task for decision- making [1] in business, politics, or beyond these domains. Utilizing and analyzing raw user-generated content and deriving models based on them makes it more valuable, as per the The- Guardian [2] : “Derivatives of data (from user-generated content), which includes predictive models, or clusters of the population in psychological groupings, can be highly valuable to companies involved in micro- targeting advertisements to voters”; involving politics is beyond the scope of this research. All types of data (i.e., including images, text, or videos), which is created by users of an unknown system or service on the internet, is called to be user-generated content [3], [4]. After all, users have different languages, and the textual contents can be generated with different syntax and semantics. A huge amount of researches has been conducted for analyzing languages such as English [5] and is still actively continuing. arXiv:2101.08087v1 [cs.CL] 20 Jan 2021
Transcript
Page 1: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 1

The Challenges of Persian User-generatedTextual Content: A Machine Learning-Based

ApproachMohammad Kasra Habib

Stuttgart UniversityISTE/Empirical Software Engineering

E-mail: [email protected]

Abstract—Over recent years a lot of research papers and studies have been published on the development of effectiveapproaches that benefit from a large amount of user-generated content and build intelligent predictive models on topof them. This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content. Unfortunately, there is still inadequate research in exploiting machine learning approaches toclassify/cluster Persian text. Further, analyzing Persian text suffers from a lack of resources; specifically from datasetsand text manipulation tools. Since the syntax and semantics of the Persian language is different from English and otherlanguages, the available resources from these languages are not instantly usable for Persian. In addition, recognitionof nouns and pronouns, parts of speech tagging, finding words’ boundary, stemming or character manipulations forPersian language are still unsolved issues that require further studying. Therefore, efforts have been made in thisresearch to address some of the challenges. This presented approach uses a machine-translated datasets to conductsentiment analysis for the Persian language. Finally, the dataset has been rehearsed with different classifiers and featureengineering approaches. The results of the experiments have shown promising state-of-the-art performance in contrastto the previous efforts; the best classifier was Support Vector Machines which achieved a precision of 91.22%, recall of91.71%, and F1 score of 91.46%.

Index Terms—Machine Learning, User-generated Content, Sentiment Analysis, Feature Engineering, Support VectorMachine (SVM), Logistic Regression (LR), Random Forest Classifier (RND), Linear Discriminant Analysis (LDA), NaiveBayes, K-Means and Ensemble Learning

F

1 INTRODUCTION

R ECENTLY, structured and unstructureduser-generated content throughout the

internet has been dramatically increased.Unstructured data can be easily perceived andanalyzed by humans but are very hard formachines to understand. Likewise, extractingwhat other people think out of the generatedcontent is an important task for decision-making [1] in business, politics, or beyondthese domains.

Utilizing and analyzing raw user-generatedcontent and deriving models based on themmakes it more valuable, as per the The-Guardian [2] : “Derivatives of data (fromuser-generated content), which includes

predictive models, or clusters of the populationin psychological groupings, can be highlyvaluable to companies involved in micro-targeting advertisements to voters”; involvingpolitics is beyond the scope of this research.

All types of data (i.e., including images,text, or videos), which is created by users ofan unknown system or service on the internet,is called to be user-generated content [3], [4].After all, users have different languages, andthe textual contents can be generated withdifferent syntax and semantics. A huge amountof researches has been conducted for analyzinglanguages such as English [5] and is stillactively continuing.

arX

iv:2

101.

0808

7v1

[cs

.CL

] 2

0 Ja

n 20

21

Page 2: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 2

Moreover, this research targets analysis ofPersian user-generated textual content andthe challenges that come with it, regrettably,there is an inadequate number of researches inexploiting machine learning approaches for thePersian language. On the other hand, analyzingPersian textual content also suffers from a lackof resources [6], [7] such as datasets and textmanipulation tools.

Since the syntax and semantics of Persianare different from languages such as English,recognition of nouns and pronouns, parts ofspeech tagging, finding words’ boundaries,stemming, or character manipulations are alsodifferent and are still unsolved issues thatrequire further studying.

Therefore, efforts have been made in thisresearch to address the main challenges. Thispresented approach conducts a case study ofsentiment analysis for the Persian language.The results of this empirical approach haveshown promising state-of-the-art performancein contrast to the previous efforts. Assuredly,this research makes the following contributions:

1) A dataset2) In-depth identification of the main chal-

lenges in applying machine learning toPersian text

3) Presenting a state-of-the-art perfor-mance for Persian text classification

4) Demonstrates tools impotence for pre-processing Persian text

5) Exhibits that classical machine learn-ing approaches can proffer similar orsometimes even better performance toneural networks at the presence of anideal dataset

2 RELATED WORKS

Although the applicability of machine learningin natural language processing is extensivelystudied by scholars for some languages such asEnglish, nevertheless, some suffer lacking it.

As of related work, this work picks the mostrelevant ones to its case study for comparison,since some effort have been devoted to otherapplications of machine learning to Persian.

Elham et al. [5] question whether they canautomatically analyze the sentiment of individ-ual tweets in Persian. Their goal is to determinethe individual tweets changing sentiment overtime concerning the number of trending politi-cal topics. They conclude the challenges of theirwork in three cases:

1) lack of a sentiment lexicon and part-of-speech taggers,

2) frequent use of colloquial words,3) and, unique orthography and morphol-

ogy characteristics.

For this work, they have collected over1 million tweets of political domains in thePersian language, with an annotated dataset ofover 3,000 tweets. They deployed Naive Bayesand Support Vector Machines. Based on theirfinding, SVM outperformed Naive Bayes withan average accuracy of 56% and as higher as70%.

Ehsan Basiri et al. [8] addresses theproblems that come with sentiment analysisand builts three new resources, SPerSent, whichcontains customers’ comments from the Web,CNRC, a lexicon corpus, and a new stop-wordlist.

Finally, they evaluate the resources withNaive Bayes. They conclude, “the performance,with regard to all evaluation measures, arebetter when CNRC is used as the lexiconfor labeling the SPerSent”. The best-observedprecision, recall, and F1 score are 92%, 87%,and 89% respectively.

An important step for any machinelearning related task is feature engineering.The following papers took a step forwardto investigate the impact of different featureengineering methodologies for Persian text.

Page 3: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 3

Ayoub Bagheri et al. [9], investigated fourfeature selection approaches for sentimentclassification; Document Frequency; TermFrequency Variance; Mutual Information;and Modified Mutual Information. Next,Naive Bayes is fit to evaluate features’performance. The highest score is attained withModified Mutual Information features. Thefinal precision, recall, and F1 score are 90.72%,85.26%, and 87.84% each respectively.

Kia Dashtipour et al. [10], proposes a novelsentiment analysis framework for the Persianlanguage. Different feature engineering andtheir combinations are evaluated. As a result,the combination of unigram, bigram, andtrigram presented the best performance withsucceeding 88.36% accuracy.

Besides traditional machine learningalgorithms for natural language processing,one can apply neural networks to Persian text.

Kia Dashtipour et al. [11], tend toexploit Deep Learning for Persian sentimentclassification. They compared, state-of-the-art shallow MLP based machine learningmodel with deep autoencoders and deepCNNs. Finally, the proposed CNNs modelpresents better performance than MLP andautoencoders with an achieved precision,recall, and F1 score of 84%, 83%, and 83%respectively.

Behnam Roshanfekr et al. [12], studies neu-ral networks to accomplish sentiment analysisfor Persian text. They conclude that deep learn-ing models outperform other models with aprecision of 59.1%, recall of 52.2%, and F1 scoreof 55.4%.

3 THE CHALLENGES OF PROCESSINGTHE PERSIAN USER-GENERATED TEX-TUAL CONTENT

Before discussing the challenges that comewith Persian user-generated textual content, itis better to establish a basic understanding of

the language.

Farsi-e-Dari1 is known as Dari inAfghanistan; one of two official languages[14], Farsi in Iran; the only official language[15], Tajiki in Tajikistan; the only officiallanguage [16] and called Persian in the Englishlanguage; which all refers to Farsi-e-Dari [17].Each of these names (Dari, Farsi, or Tajiki)refer to a different accent of Persian, it isimportant to notice its vocabulary is shapedby its environment and broader culture. Yetstill, a purer (less influenced) form is spokenby Afghans than other speakers [17].

This beautiful language belongs to theIndo-European language family [17]. Histori-cally, the extent this language spoken rangesfrom the borders of India in the east, Russiain the north, southern shores of the PersianGulf to Egypt, and the Mediterranean in thewest [17]. Currently, Persian also understoodin parts of Armenia, Azerbaijan, India, Iraq,Kazakhstan, Pakistan, Turkmenistan, Uzbek-istan, China, and Turkey [18]. Persian orig-inates from the Great Khorasan [19] (whichAfghanistan’s major current Persian speakingterritories formed the major portion of Kho-rasan [19]).

3.1 Challenges in Adopting Tools Build forEnglish and Arabic Language ProcessingOver recent years plenty of text processing toolsare built for English [5]. One might think ex-ploiting them for Persian would be an advan-tage. Unfortunately, these tools are not adopt-able due to the variance in their grammar, syn-tax, and semantics;

• one can notice the syntax as the biggestdifference, e.g., Persian is right-to-left,where English is left-to-right;

• next, parts of speech tagging is another;• and ambiguity in word morphology

and character manipulation are anotherbarrier to be considered.

1. Dari means “Darbari” (which in English means the lan-guage of the royal court) [13].

Page 4: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 4

On the other hand, applying tools build forArabic appears a good option, since thislanguage adopts the Arabic character setand add four more to it [20]. However, bothlanguages might look similar when it comes towriting, yet they are two different languages,and the syntax and semantics of both languagesare different.

Hence, the Persian language vocabularyis exposed by Arabic grammar. For example,the words with the Arabic root bring irregularplural forms, while Persian uses a suffix tobuild plural forms [15]. Thereupon, not toforget Arabic language suffers from lack oftools and research like Persian. Besides, theinfluence do not imply that the tools built forArabic text processing are instantly usable forPersian. Even they add-up to the complexityof the language, which will be discussed in theforthcoming sections.

All in all, the differences between Dari, Ara-bic, and English languages cause to engineer ordevelop new tools from scratch.

3.2 Challenges in Persian Text ProcessingThe written structure of Persian itself iscomplex than the languages like English. Forinstance, the appearance of the homographicword (the once which look alike, but havedifferent meanings2) and use of irregular(which comes from Arabic) and with-suffixplural form which needs to be addressed [7],[21].

Moreover, there are many suffixes, prefixes,pronouns, and other parts that can be writtenseparately or connected, which all are open tofurther research [15]. However, there are someresearch conducted to apply machine learningto Persian text, which are not adequate.

The main challenges to exploit machinelearning models to process Persian text are

2. The word “ éKA

��” —/Shana/ can mean “Shoulder” or

“Comb” depending its usage at the sentance, there are manyexamples of this condition.

concluded in lack of resources, ambiguities incharacter manipulation, morphology, identify-ing words boundary, and syntax analysis.

3.2.1 The Challenges of Character EncodingFrequently one can think that Persian is avariant of the Arabic language. It is explicitthat Persian and Arabic are two distinctlanguages, even they belong to differentlanguage families. Therefore, it is natural thatthey have some similarities in terms of syntaxby the cause of alphabet adoption.

Fundamentally, computers deal withnumbers. They store letters and othercharacters by assigning a number to eachof them [22]. Before the invention ofUnicode, there were hundreds of systems[22] and finding a unique way to rep- resentinformation was relatively difficult [23].Resuscitation of Unicode is an effort tosoftware internationalization, especially on theWeb. Unicode system is designed to assignone unique code for each character even if thecharacter is used in multiple languages [24].Persian scripts are written based on Arabiccharacters (〈U+0600—U+06FF〉 block) withsome extra and modified characters [22], [24].The current Unicode framework for Persian isinsufficient [25].

It is important to know that the designprinciple in Unicode to represent the relevantshapes are characters not glyphs [24]. InPersian or Arabic a character can take fourdifferent shapes (glyph) depending on theirposition in the sequence (Table 1).

TABLE 1: Persian character shapes

It is noteworthy that for each four visualform (glyph) of a character there is only onesingle code. Therefore, an algorithm is charged

Page 5: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 5

to handle four visual form of a character ina sequence [26], [27]. This algorithm attachesspecial characters such as Zero Width Joiner(ZWJ), Zero Width Non-Joiner (ZWNJ) andRight- to-Left Override (RLO) [25].

Take ZWNJ for instance. Using it after acode means that the character before ZWNJmust appeare in one of its final forms (glyph),a character after ZWNJ forces the characterto appeare in one its initial forms (glyph),and similarly characters after RLO should berepresented as strong right-to-leftt character[23]. There are also other standards proposedto use for character representation, e.g.,ISIRI 6219:2002 (usual in Iran). Despite thesestandards, Persian keyboard layout is usingdifferent codes and many of Persian users donot use the same encoding standards [23]. Inaddition, using different encodings paves theway for more challenges.

Furthermore, if one is asked to write theplural form of “ �

�m�'

.” (which means “Section”

in English), a suffix “ Aë” —/ha/ will beadded at the end of the word. Therefore, theplural forms based on different standardsrepresenting the words deferentially as follow:

(1) Aë�

�m�'

. = Aë + SPACE +�

�m�'

.

(2) Aë�

�m�'

. = Aë + ZWNJ +�

�m�'

.

(3) AîD��m�'

. = Aë +�

�m�'

.

Moreover, such examples in the corpus ef-fects on measuring precision and recall for clas-sification or clustering while the same featurecan be represented as two or more differentvectors when word frequencies are calculate[23].

3.2.2 Ambiguities in Character Manipulation

Starting Persian text analytics means tacklingwith lots of challenges. It is quite possible toinput Arabic characters instead of the standardPersian ones [28]. A common mistake whichusually happens is utilizing of “ø” instead of

“ø

” or “¼” instead of “¸”. This mix-up causesa problem while one looks dictionaries orcalculates word frequencies whereof differentencoding strings. On the other hand, even ifone is used to input this (combination of mixedUnicode characters) on the Google searchengine, that one can end up with differentresults, since pages are ranked based ondifferent words (the same appearing word withdifferent Unicode in background are treated asdifferent words) [23].

Another character which causes the sameproblem is short vowels. A short vowel inPersian transcriptions never appears alone[15]. If one is used, then they will be codedindependently [23], which can also raisethe problem of same appearing word widthdifferent Unicode.

Moreover, other problem of this kind canhappen in regards to using TATWEEL charac-ter; a visual character which helps Persian andArabic words to appear in different widths [29].To tackle this challenge, [23] proposes to builda standardized procedure such us using a map-ping between Persian and Arabic characters.

3.2.3 Ambiguity at Words’ BoundaryTokenization as part of preprocessing fortext classification or clustering directly effectson the performance of machine learningalgorithm. To convert documents into tokensone should simply find the word boundaries.

Tokenization of Persian documentsare challenging due to different usage ofdelimitations; for example, Persian compoundand light words are written in delimitedform with ZWNJ. Besides, one can use thespace character to form these words, whichis not respected by users; even by the officialorganizations [23]. Furthermore, to tokenizedefining space or ZWNJ as a boundary are notadequate.

Basically, one can think of using four visualforms (beginning, middle, end, stand-alone) asword boundaries, which the final form is a

Page 6: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 6

strong indication to end of a word. However,[30], [31] shows that this technique with theUnicode system is not applicable.

3.2.4 Ambiguity in Morphology

Morphological ambiguities can arise based ontwo reasons [23], (1) homograph words, and(2) word boundaries.

Take (1) for instance. The word “QêÓ” canhave different pronunciation and meaningswith respect to the usage of short vowels whichdoes not appear in the written text: with oneusage of short vowels the word “QêÓ” meansLove, with another means to Seal and is usedto indicate Mahr3.

For (2) remember the example from section3.2.1; that how a word can be treated as 3 (morethan 3 are possible) different words due toUnicode mix-ups.

Therefore, similar problems can arise whenlexical elements such as preposition, postposi-tion or conjunctions appear separately or at-tached [23], [35]. A solution to this challengewould be to follow the official Persian’s orthog-raphy which recommends writing them sepa-rately, which is hard to guarantee. Therefore, apromising solution is to build a text normalizer.

3.2.5 Ambiguity While Detecting ProperNouns

Importance of using part-of-speech taggingin text analytics is obvious. To tag nouns andpronouns, Arabic transcripts do not enjoycapitalization like English. Therefore, thischaracteristic is inherited by Persian’s Arabiccharacter adoption.

To solve this challenge, [36] offers someheuristics to distinguishes proper-nouns fromnouns.

3. In Islam, Mahr is an arbitrary payment, in the form ofmoney or possessions paid by groom, to the bride at the timeof marriage to appreciate her [32], [33], [34].

3.2.6 Ambiguity Syntax AnalysisAnother important ambiguity arises whenone wants to construct possessives [23]. Toconstruct possessives a short vowel /e/ is used,which does not appear in the writing text. [23]recommends to add this short vowel in writingscripts; for this regards as it is discussed insubsection 3.2.2, adding this short vowel cancause challenges for character manipulation.

Mainly the problems which may occurwhile Persian text analytics can be summarizedto the inconsistency in its characterrepresentation and special orthography.

Furthermore, if one does not consider thementioned challenges while applying machinelearning to Persian transcripts it is quite pos-sible to achieve unsatisfied results, which arefar away from the expected performance. Toremove the ambiguities, [7] proposes to usea combination of orthography and use stan-dards which are defined in [23]. Handling thechallenges laid in Persian scripts are the majorobstacles of this research.

3.2.7 FEnglishAnother challenge to apply machine learningto Persian transcripts is that one can useEnglish alphabets to write Persian wordspronunciation. This style of writing is calledFEnglish which became very usual in socialnetworking platforms. There are some tools4

to convert the written pronunciations fromEnglish to Persian alphabet. This conversionis insufficient for textual analysis since it isjust a mapping between English to Persiancharacters. For example, there are two or morecharacters which can generate almost the samephoneme (e.g., “e” and “a”) and can be usedto write a Persian word with it. Since it is notan officially writing standard; writing a wordwith different selected characters can differfrom one to another.

4. The following links are the tools for mapping FEnglishto Persian text: http://www.dictionary-farsi.com/pinglish.asp, http://syavash.com/portal/pinglish2farsi/convertor-en,https://lingojam.com/FarsitoFinglish

Page 7: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 7

Therefore, this can cause to form a highdimension feature space. Yet still, this isconsidered an open challenge and needsfurther attention.

From the aforementioned sections, it is nowclear that Persian has a complex orthographystructure. This complex structure attaches morechallenges to the generally available and thelaid challenges in social networking platforms.Therefore, to achieve better performance onemust consider these challenges.

4 EXPERIMENTS AND RESULTS

This research uses a machine translated dataset,i.e., observations are originally written in En-glish language which each is an assortment ofreviews form Rotten Tomatoes. This dataset isoriginally collected by Pang and Lee [37], fortheir work on sentiment treebanks, and like-wise utilized by Socher et al. [38] for Amazon’sMechanical Turk to create fine-grained labelsfor all parsed phrases into corpus. This researchconsiders a bulk of only positive and nega-tive observations. Therefore, the dataset sizeis shrunken to 16278 instances which 3256 arekept for test purpose.

4.1 Preprocessing and Feature Engineer-ing

Each instance form the culled dataset is applieda four step cleaning; unnecessary charactersremoval; normalization and stemming; TF-IDFis applied to assign equal weight to morefrequent tokens; PCA is used for dimensionreduction to keep 0.99% of explained varianceratio to boost the training process.

Once the preprocessing step is satisfied,vector features are built based on word n-gram(i.e., where n = 1, 2 and 3) and charactern-gram (i.e., n = 1) feature extraction methods.Furthermore, word n-gram models are famousfor preserving order which a word appears ina document and if one is required to capture adeeper meaning, i.e., morphological makeups[39], [40], [41] shall go with character n-gram

model.

Subsequently, Logistic Regression (LG) isfit to apprehend a baseline. Clearly, it can beinferred that the trained model with unigramfeatures (word n-gram = 1) has imperceptiblyhigher performance than the rest (TABLE 1).

TABLE 2: Logistic Regression’s performance measure-ment on different feature extraction methods

Precision (%) Recall (%) F1 Score (%)

Word 1-gram 88.12 90.69 89.39Word 2-gram 80.46 94.63 86.98Word 3-gram 70.51 98.45 82.17Char 1-gram 64.87 76.41 70.17

Since the model’s score comes from trainingset they hold suspect to be likely to overfit-ting. Additionally, unigram and bigram fea-tures have approximately similar performance.To make sure which feature set to select learn-ing curves for all features are plotted (Fig. 1).

Fig. 1: Learning Curves

From Fig. 1, it easy to construe that wordbigram and trigram features have a highvariance (overfitting); there is a gap betweeneach’ two curves. It means the models aresignificantly admirable on the training set thanthe validation set. The model with characterunigram is prone to underfitting. Therefore,the winner is the word unigram model.

Page 8: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 8

Eventually, Random Forest Classifier istrained to select features; each feature isselected based on its mean weight, whereeach node’s weight is equal to the numberof training samples associated with it [42]. Ittranspires that the feature selection step did notimprove the performance, which is droppedfrom the preprocessing pipeline.

Finally, KMeans is fitted as a preprocessingstep, and to extract new features. To achievethe best number of related clusters, the BasianGaussian Mixture model was used, which re-sulted in 37 clusters. Three new feature sets arebuilt;

1) Distances: replaced instance with theirdistances to these 37 clusters;

2) Centers: instances replaced with theircluster centers;

3) Combined: the combination of two pre-vious feature assortments and word un-igram features (TABLE 2).

TABLE 3: Logistic Regression’s performance measure-ment on extracted features

Precision (%) Recall (%) F1 Score (%)

Distances 68.54 86.98 76.67Cluster Centers 64.82 86.71 74.19Combined 87.77 90.58 89.15

From Table 2, it can be closed that the newfeatures and even the combination of them withword unigram did not improved the perfor-mance; nothing astounding. Therefore, this re-search will remain with word unigram features.

4.2 Classifiers’ Performance StudyFrom the previous section, Logistic Regressionwith word unigram features showed a betterperformance. Hereabouts, this research willadvance with training more models and fine-tuning the models’ hyperparameters.

The chosen models are Logistic Regression(LG), SVM with a Stochastic Gradient Descentimplementation (SGD SVM), Random ForestClassifier (RND), Linear Discriminant Analysis(LDA), and Multinomial Naive Bayes (MNB)

which is suitable for classification with discretefeatures [42].

Once classifiers’ hyperparameter is fine-tuned with 10 fold of cross-validation (tosave time, LDA was applied 3 fold cross-validation), it resembles that SVM (with SGDimplementation and l2 regularization) shown apromising performance (Table 4).

Consequently, since the number of positiveinstances in the dataset is not scarce, plusan equivalent balance of precision and recallare required, this study presents ROC curveinstead of precision and recall curves.

Fig. 2: Classifiers’ ROC Curves

Eventually, this research ought to attemptensemble learning; Soft Voting, Pasting, andAdaBoost classification models are studied.

First, all the previous algorithms are in-cluded for the Voting Classifier except a Gaus-sian Naive Bayes (GNB) is succeeded LDA con-sidering it to wreck the performance. Second,an ensemble of Voting Classifiers are fitted;each is trained on 200 instances randomly sam-pled from the training set without replacement(Pasting). Finally, 100 Decision Tree Classifier(where max depth = 1) is trained to performAdaBoost (Table 5).

Usually, training ensemble models providebetter performance than training a single

Page 9: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 9

TABLE 4: Classifiers’ performance measurement

Train Score Test Score

Precision (%) Recall (%) F1 Score (%) ROC AUC (%) Precision (%) Recall (%) F1 Score (%) ROC AUC (%)SVM 90.01 90.49 90.25 95.30 91.22 91.71 91.46 95.69LG 88.12 90.69 89.39 94.68 89.06 92.20 90.60 95.11RND 86.79 83.76 85.25 91.34 89.09 84.45 86.71 92.64LDA 83.46 86.79 85.09 87.82 86.70 90.09 88.36 91.72MNB 84.76 93.85 89.07 95.11 86.51 94.47 90.32 95.62

classifier. Nevertheless, even with ensemblelearning, better performance is not alwaysguaranteed.

Of Fig. 2, it is evident that SVM, LG, andMNB are the authoritative classifiers and areproffering a comparable performance. Thus,they are influencing the Voting Classifier’s de-cisions.

TABLE 5: Ensemble Learning’s performance comparisonPrecision (%) Recall (%) F1 Score (%) ROC AUC (%)

Train Test Train Test Train Test Train Test

VTN 87.58 89.29 93.59 94.42 90.48 91.79 95.87 96.47PAS 73.12 72.00 97.08 96.97 83.42 82.64 92.69 92.10ADB 74.55 72.40 90.54 89.82 81.77 80.17 87.61 85.22

Moreover, among the applied ensembleclassifiers, the Voting classifier’s scores arecomparable with SVM (Fig. 3).

Fig. 3: ROC curves for SVM and Ensemble Learning

One can trade-off between precision andrecall scores for the Voting Classifier to achievenearly identical once to SVM (Fig. 3). Settingthe decision threshold for recall to 0.91 will

boost the precision to 90% for the Votingclassifier.

Yet, it is not enough, therefore supplemen-tary (with respect to new threshold) ROC andAUC are calculated for this classifier, to makesure this model functions as good as SVM.

Fig. 4: SVM and Tweaked Voting Classifier’s ROC curves

From Fig. 4, it is understandable that thecurve is engineered and the tweaked modelis not as shiny as SVM. Plus, contrasting itsperformance and its complexity, it does notdeserve to replace SVM with it.

This research studied different feature engi-neering methods, classification algorithms, andensemble learning. To sum up, SVM with wordunigram features outperformed the rest, withan/a achieved/balanced precision, recall, andF1 score of 90% on the train and 91% on the testset.

5 EVALUATION AND FUTURE WORK

The achievment form the section 4, shownpromising state-of-the-art performance in con-

Page 10: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 10

trast to the previous efforts [5], [8], [9], [10], [11],[12] (Table 6).

TABLE 6: Performance measurement comparison amongthis work and the related worksReferences Model Precision (%) Recall (%) F1 Score (%)

This Study SGD SVM 91.22 91.71 91.46E. Basiri et al. [8] NB5 92.00 87.00 89.00A. Bagheri et al. [9] NB 90.72 85.26 87.84Kia D. et al. [11] CNN6 84.00 83.00 83.00Behnam R. et al. [12] NN7 59.10 52.20 55.40

E. Basiri et al. [8], confers a precision scoreof 0.92; a slight difference of 0.0078 in contrastto this work. On the other hand, it presents alower recall score, which is 0.87 (i.e., it can notdetect 13% of positive instance) in contrast tothis work, which is 0.91. A convenient methodto compare two classifiers is to embed theirprecision and recall scores in one metric calledF1 score (the harmonic mean). Therefore, lookingat the F1 score of both studies, it is clear thatthis research outperforms E. Basiri et al. [8]with a difference of 2.46% while trading-offbetween the precision and recall.

Differently, Elham et al. [5] and KiaDashtipour et al. [10] provides only theAccuracy score instead of precission, recall andF1 score, which is not the preferred metric ofevaluation. Thus, this work exhibits a higher(i.e., 93%) accuracy than the previous efforts.

Eventually, this research assumes that betterperformance can be reached if the followingrecommendations are satisfied:

1) Existence of an ideal dataset rather than(this) machine-translated one; mistakesin translation were observed in thedataset, as machine translation itself re-quires further study [43].

2) The existence of sophisticated tools forpreprocessing is required; this researchapplied Hazm for stemming. Hence,Hamz is the state-of-the-art preprocess-ing tool for the Persian language itneeds further improvements. Take forinstance, the word “Ð@P@” (—which in

5. Naive Bayes6. Convolutional Neural Networks7. Neural Networks

English it means “Quiet”) was wrongstemmed to “ @P@” (—which in English

it means “Vote”) or the word “Q�

¢� úG

.”

was miss tokenized in two tokens“Q�

¢

�” and “H. ”.

From the conducted research it is understand-able that wielding the Persian text is a challeng-ing responsibility. Furthermore, ensuing poten-tial research for the future is to work on build-ing rich resources and efficient preprocessingtools. The current solutions are based on tradi-tional machine learning approaches, as well asconducting sentiment classification of Persiantext with deep learning will be further fascinat-ing to work in the future.

6 CONCLUSION

This work started by addressing the openchallenges to apply machine learning tohandle Persian user-generated textual content.Though there is plenty of support for English,unfortunately, adapting these resources isnot a solution due to complexity in thesyntactical and semantical structure of Persianlanguage. Notwithstanding, several effortsare accomplished to develop preprocessingtools and employ machine learning to classifyPersian sentiments, which are not adequate,therefore, in-depth studies are demanding.

First, this study applied a four-steppreprocessing, features vectors are constructedbased on word and character n-gramtechniques. Later, Random Forest Treesare utilized for feature selection. Then KMeansis applied to create three new feature sets.From the results, it was apprehended thatword unigram features without featureselection outperformed the rest. Second, fiveclassifiers (Support Vector Machines, LogisticRegression, Random Forest Classifier, LinearDiscriminant Analysis, and Multinomial NaiveBayes) and three ensemble learning methods(Voting, Pasting and AdaBoost) are trained andevaluated.

Page 11: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 11

Finally, word unigram features and SVMwith gradient descent implementation outper-formed the rest, with an achieved precision,recall, and F1 score of 90% on the train and 91%on the test set.

REFERENCES

[1] B. Pang and L. Lee, “Opinion mining and sentimentanalysis,” Found. Trends Inf. Retr., vol. 2, no. 1-2, pp. 1–135,Jan. 2008. [Online]. Available: http://dx.doi.org/10.1561/1500000011

[2] P. Lewis, D. Pegg, and A. Hern, “Cambridge analyticakept facebook data models through us election,” 2018.[Online]. Available: https://www.theguardian.com/uk-news/2018/may/06/cambridge analytica keptfacebook data models through us election

[3] M.-F. Moens, J. Li, and T.-S. Chua, Mining user generatedcontent. CRC Press, 2014.

[4] R.-H. Chen and S.-C. Chang, “Modeling content andmembership growth dynamics of user-generated contentsharing networks with two case studies,” IEEE Access,vol. 6, pp. 4779–4796, 2018.

[5] E. Vaziripour, C. G. Giraud-Carrier, and D. Zappala,“Analyzing the political sentiment of tweets in farsi.” inICWSM, 2016, pp. 699–702.

[6] B. Sarrafzadeh, N. Yakovets, N. Cercone, and A. An,“Cross-lingual word sense disambiguation for languageswith scarce resources,” in Canadian Conference on ArtificialIntelligence. Springer, 2011, pp. 347–358.

[7] M. Shamsfard, “Challenges and open problems in persiantext processing,” Proceedings of LTC, vol. 11, 2011.

[8] M. E. Basiri and A. Kabiri, “Sentence-level sentimentanalysis in persian,” in 2017 3rd International Conferenceon Pattern Recognition and Image Analysis (IPRIA). IEEE,2017, pp. 84–89.

[9] A. Bagheri and M. Saraee, “Persian sentiment analyzer: Aframework based on a novel feature selection method,”arXiv preprint arXiv:1412.8079, 2014.

[10] K. Dashtipour, M. Gogate, A. Adeel, A. Hussain,A. Alqarafi, and T. Durrani, “A comparative study ofpersian sentiment analysis based on different feature com-binations,” in International Conference in Communications,Signal Processing, and Systems. Springer, 2017, pp. 2288–2294.

[11] K. Dashtipour, M. Gogate, A. Adeel, C. Ieracitano, H. Lari-jani, and A. Hussain, “Exploiting deep learning for persiansentiment analysis,” in International Conference on BrainInspired Cognitive Systems. Springer, 2018, pp. 597–604.

[12] B. Roshanfekr, S. Khadivi, and M. Rahmati, “Sentimentanalysis using deep learning on persian texts,” in 2017Iranian Conference on Electrical Engineering (ICEE). IEEE,2017, pp. 1503–1508.

[13] C. Nolle-Karimi. (2018, jun) No differences betweenfarsiand dari. [Online]. Available: https://derstandard.at/1308680777512/Landessprache als Politikum KeineUnterschiede zwischen Farsi und Dari

[14] CIA. (2018, jun) The world fact book. [Online].Available: https://www.cia.gov/library/publications/the-world-factbook/geos/af.html

[15] M. Zanjani, A. Baraani-Dastjerdi, E. Asgarian, A. Shahri-yari, and A. Akhavan Kharazian, “A new experience inpersian text clustering using farsnet ontology,” vol. 31, pp.315–330, 01 2015.

[16] CIA. (2018, jun) The world fact book. [Online].Available: https://www.cia.gov/library/publications/the-world-factbook/geos/ti.html

[17] B. Spooner et al., “Persian, farsi, dari, tajiki: Languagenames and language policies,” Language Policy and Lan-guage Conflict in Afghanistan and Its Neighbors: The ChangingPolitics of Language Choice, pp. 89–120, 2012.

[18] BBC. (2018, jun) A guide to persian. [Online]. Avail-able: http://www.bbc.co.uk/languages/other/persian/guide/facts.shtml

[19] H. W. Alikuzai, From Aryana-Khorasan to Afghanistan:Afghanistan History in 25 Volumes. Trafford Publishing,2011.

[20] M. H. Shirali-Shahreza and M. Shirali-Shahreza, “Ara-bic/persian text steganography utilizing similar letterswith different codes,” The Arabian Journal For Science AndEngineering, vol. 35, no. 1b, 2010.

[21] B. Baluch, “Persian orthography and its relation to lit-eracy,” Handbook of orthography and literacy, pp. 365–376,2006.

[22] “Unicode consortium.” [Online]. Available: https://unicode.org/

[23] B. QasemiZadeh, S. Rahimi, and M. S. Ghalati, “Chal-lenges in persian electronic text analysis,” arXiv preprintarXiv:1404.4740, 2014.

[24] B. Esfahbod, “Persian computing with unicode,” in 25thInternationalization and Unicode Conference, Washington, DC,2004.

[25] B. QasemiZadeh, “Transcription of the persian languagein the electronic format.”

[26] A. Odeh and K. Elleithy, “Steganography in arabic text us-ing zero width and kashidha letters,” International Journalof Computer Science & Information Technology, vol. 4, no. 3,p. 1, 2012.

[27] H. M. S. Alshahrani and G. Weir, “Hybrid arabic textsteganography,” International Journal of Computer and In-formation Technology, vol. 6, no. 6, pp. 329–338, 2017.

[28] B. Qasemizadeh, “Farsi e-orthography: An example ofe-orthography concept,” in Improving Non-English WebSearching (iNEWS07) SIGIR07 Workshop, 2007, pp. 62–64.

[29] R. Ibrahim, Z. Eviatar, and J. Aharon-Peretz, “The char-acteristics of arabic orthography slow its processing.”Neuropsychology, vol. 16, no. 3, p. 322, 2002.

[30] K. Megerdoomian and R. Zajac, Processing Persian text:Tokenization in the Shiraz project. Computing ResearchLaboratory, New Mexico State University, 2000.

[31] M. Hassel and N. Mazdak, “Farsisum: a persian text sum-marizer,” in Proceedings of the Workshop on ComputationalApproaches to Arabic Script-based Languages. Associationfor Computational Linguistics, 2004, pp. 82–84.

[32] N. B. Oman, “Bargaining in the shadow of god’s law:Islamic mahr contracts and the perils of legal specializa-tion,” Wake Forest L. Rev., vol. 45, p. 579, 2010.

[33] L. E. Blenkhorn, “Islamic marriage contracts in americancourts: interpreting mahr agreements as prenuptials andtheir effect on muslim women,” S. Cal. L. Rev., vol. 76, p.189, 2002.

[34] R. Freeland, “The islamic institution of mahr and americanlaw,” Gonz. J. Int’l L., vol. 4, p. 31, 2000.

[35] J. R. Perry, “Language reform in turkey and iran,” Inter-national Journal of Middle East Studies, vol. 17, no. 3, pp.295–311, 1985.

[36] M. Steinbach, G. Karypis, V. Kumar et al., “A comparisonof document clustering techniques,” in KDD workshop ontext mining, vol. 400, no. 1. Boston, 2000, pp. 525–526.

Page 12: The Challenges of Persian User-generated Textual Content ...

STUTTGART UNIVERSITY 12

[37] L. PaNgB, “Exploitingclassrelationshipsforsentimentcategorizationwithrespectratingsales,” IN: ProceedingsofACLr05, 2005.

[38] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning,A. Ng, and C. Potts, “Recursive deep models for semanticcompositionality over a sentiment treebank,” in Proceed-ings of the 2013 conference on empirical methods in naturallanguage processing, 2013, pp. 1631–1642.

[39] A. Kulmizev, B. Blankers, J. Bjerva, M. Nissim, G. vanNoord, B. Plank, and M. Wieling, “The power of charactern-grams in native language identification,” in Proceedingsof the 12th Workshop on Innovative Use of NLP for BuildingEducational Applications, 2017, pp. 382–389.

[40] G. W. Lesher, B. J. Moulton, D. J. Higginbotham et al.,“Effects of ngram order and training text size on wordprediction,” in Proceedings of the RESNA’99 Annual Confer-ence. Citeseer, 1999, pp. 52–54.

[41] M. K. Habib, “Machine learning-based text classificationand clustering: The challenges of user-generated content,”Master’s thesis, 09 2018.

[42] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa,A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gram-fort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt,and G. Varoquaux, “API design for machine learning soft-ware: experiences from the scikit-learn project,” in ECMLPKDD Workshop: Languages for Data Mining and MachineLearning, 2013, pp. 108–122.

[43] J. Slocum, “A survey of machine translation: its history,current status, and future prospects,” Computational lin-guistics, vol. 11, no. 1, pp. 1–17, 1985.


Recommended