Data Mining final report

Hierarchical words detection especially new words in Chinese language

Chengyuan ZhuangUniversity of North Texas, Denton, TX

[email protected]

Abstract

Word segmentation is the first step for almost all Natural Language Processing (NLP )tasks in Chinese language, since there is no empty space between Chinese characters. It relieson annotated data corpus, and could not do well with Out of vocabulary (OOV) word, (e.g.domain specific terms, names and newly produced words). We present a hierarchicalstatistical method to automatically detect Chinese words especially new words. And we showhow effective our method is, without any human labor.

Introduction

Word segmentation is an important, pre-step for many further analysis in Chinese NLP.This is because Chinese language, unlike English, has no empty space between Chinesecharacters to mark clear word-boundary. The quality of word segmentation will highlyinfluence the performance of further analysis.Word segmentation is trained using Machine Learning technique, which is typically to

learn from annotated data corpus. But this is quite contradictory here: we want wordsegmentation to separate all words from their surrounding characters, including Out ofvocabulary (OOV) words, (e.g. domain specific terms, names and newly produced words);However, word segmentation requires these OOV words to be in-dictionary words, so that itcan work correctly. The result is, no doubt, bad performance when OOV words appear.Studies (Sproat and Emerson, 2003) (Chen, 2003) have shown that more than 60% of word

segmentation errors result from OOV words. Statistics show that more than 1000 newChinese words appear every year (Thesaurus Research Center, 2003), and undergo anexponential explosion nowadays in this internet era. New names are emerging everyday, newevents, domain terms, and new produced words... It is extremely important to detect theseOOV words to add into dictionary, and better do it with no or less human labor, due to thisendless tremendous workload.Using statistical information for computers to automatically detect words especially OOV

words is a possible method, and that's exactly what we do in this project. The dataset wechoose is actually NBA web pages from a main Chinese website, www.ifeng.com. We havecollected over 600 news report in its NBA news channel. The raw data is just HTML files.After crawling all these pages, we process these HTML files to get their text parts, whichcontain punctuation and Chinese characters we are interested in. With these texts, we furtherreplace punctuation with empty space, leaving Chinese characters to detect possible words.

The total Chinese characters of this dataset are over 240, 000, so average 400 characters foreach article. Our goal is to detect bi-gram and tri-gram Chinese words, so using empty spaceas boundaries, for each string between two empty space, we generate all possible bi-gram,tri-gram, and quadri-gram words candidates from that string. Then we count the frequence ofeach bi-gram, tri-gram, and quadri-gram, and also compute the probability for each item.With these texts, our goal is to detect possible bi-gram and tri-gram Chinese words (for the

time limit, more gram brings more complexity and requires much more computation time).Our plan is to explore two parts of each word candidate, based on independent theory, to seewhether two incidents (two parts) have any close relationship (whether independent or not).Based on this method, we are able to detect over 900 bi-gram Chinese words, and over 100tri-gram Chinese words, and a large part of them are OOV words (domain specific terms,recently emerging person names, and newly produced words). we almost detect all of them.

Related Work

Chinese word segmentation is challenging partly due to the fact that it is often difficult todefine what constitutes a word in Chinese. With different granularity, fine-grained orcoarse-grained, there are different words segmentation which both make sense. In computerapplications, we are more concerned with segmentation units than words. While words aresupposed to be unambiguous and static linguistic entities, segmentation units are expected tovary from application to application. In fact, different Chinese NLP-enabled applications mayhave different requirements that request different granularity of word segmentation. Forexample, automatic speech recognition (ASR) systems prefer "longer words" to achievehigher accuracy whereas information retrieval (IR) systems prefer "shorter words" to obtainhigher recall rates, etc. (Wu 2003).Therefore, we do not assume that there exists a universal word segmentation standard

which is application independent. Instead, we argue for the existence of multiplesegmentation standards, each for a specific application.Various methods have been proposed to address the word segmentation problem in

previous studies. Noting that linguistic information, syntactic information in particular, canhelp identify words, [Gan, 1995] and [Wu and Jiang, 1998] treated word segmentation asinseparable from Chinese sentence understanding as a whole. As a result, the success of theword segmentation task is tied to the success of the sentence understanding task, which is justas difficult as the word segmentation problem, if not more difficult.Most word segmentation systems reported in previous studies are stand-alone systems and

they fall into three main categories, depending on whether they use statistical information andelectronic dictionaries. These are purely statistical approaches [Sproat and Shih, 1990; Sun,Shen, and Tsou, 1998; Ge, Pratt, and Smyth, 1999; Peng and Schuurmans, 2001 ],non-statistical dictionary-based approaches [Liang, 1993; Gu and Mao, 1994] and statisticaland dictionary-based approaches [Sproat et al., 1996]. More recently work on Chinese word

segmentation also includes supervised machine-learning approaches [Palmer, 1997;Hockenmaier and Brew, 1998; Xue, 2001].1. Dictionary-based approachesGiven an input character string, only words that are stored in the dictionary can be

identified. One of the most popular methods is Maximum Matching (MM), usuallyaugmented with heuristics to deal with ambiguities in segmentation. Papers that use thismethod or minor variants include (Chen et al. 1999; Nie, Jin and Hannan 1994; etc.). Theperformance of these methods thus depends to a large degree upon the coverage of thedictionary, which unfortunately may never be complete because new words appear constantly.Therefore, in addition to the dictionary, many systems also contain special components forunknown word identification(1) The Maximum Matching Algorithm and Its VariantsDifferent studies differ in their ambiguity resolution algorithms. A very simple one which

has been demonstrated to be very effective is the maximum matching algorithm (Chen & Liu,1992). Maximum matching can take many forms.Simple maximum matching. The basic form is to resolve the ambiguity of a single word

(Yi-Ru Li, personal communication, January 14, 1995). For example, suppose C1, C2,... Cnrepresent characters in a string. We are at the beginning of the string and want to know wherethe words are. We first search the lexicon to see if _C1_ is a one-character word, then search_C1C2_ to see if it is a two-character word, and so on, until the combination is longer thelongest words in the lexicon. The most plausible word will be the longest match. We take thisword, then continue this process until the last word of the string is identified.Complex maximum matching. Another variant of maximum matching done by Chen and

Liu (1992) is more complex than the basic form. Their maximum matching rule says that themost plausible segmentation is the three-word chunk with maximum length. Again, we are atthe beginning of a string and want to know where the words are. If there are ambiguoussegmentation (e.g., _C1_ is a word, but _C1C2_ is also a word, and so on), then we lookahead two more words to find all possible three-word chunks beginning with _C1_ or_C1C2_. For example, if these are possible three-word chunks:The scaning direction is also very important. Different scaning direction will have different

results. Actually there are three different variants:Maximum obverse matchingMatching from left to right of a sentence. This is first developed.Maximum reverse matchingMatching from right to left of a sentence. It is usually reported to produce better results

than Maximum obverse matching in Chinese texts.Maximum bi-direction matchingMatching from both sides of a sentence. The two single directions both have limitations.

This combines the advantages of the two policies, and become more comprehensive.(2) Minimum word count matching

The right segmentation will chunk close related nearby characters together, which actuallyreduces the final word count in a sentence. This means Minimum word count matching,which minimizes the word number of a sentence, has a high probability towards the rightsegmentation. So this is a simple, but powerful strategy towards the right result.2. Supervised machine-learning approachesThey are developed to address the drawbacks of Dictionary-based approaches. Using a

tagged corpus in which word boundaries are explicitly marked with special annotations,machine learning algorithms build statistical models based on the features of the characterssurrounding the boundaries.In these approaches, the word segmentation problem is formulated as a binary

classification task in which each character in the text string is predicted to be a member ofone of two classes: the beginning of a word (labeled as class "B"), and intra-word characters(labeled as class "I"). Another way is to classify into three classes: the beginning of a word(labeled as class "B"), the middle of a word (labeled as class "M"), and the end of a word(labeled as class "E").Lots of algorithms can make this binary or ternary boundary decision: Naive Bayes (NB),

Decision Tree (DT), Support Vector Machine (SVM), and Conditional Random Field(CRF).Kawtrakul et al. (1997) proposed a language modeling technique based on a tri-gram

Markov Model to select the optimal segmentation path. Meknavin et al. (1997) constructed amachine learning model by using the Part-Of-Speech (POS) features.Recent work by Kruengkrai et al. (2006) used the Conditional Random Field (CRF)

algorithm for training a word segmentation model. The CRF is a recent novel approach whichhas been shown to perform better than other machine learning algorithms for the task oflabeling and segmenting sequence data. This work focused mainly on solving the ambiguityproblem in word segmentation. Two path selection schemes based on confidence estimationand Viterbi were proposed. The feature set used in their model required the POS tagginginformation. Therefore, if the POS tagging is inaccurate, the performance of the wordsegmentation could be effected. In this paper, we construct the feature set based on thecharacter types of the n-gram characters surrounding the word boundary. As shown from theexperiments, the character types in Thai language provide enough effective information forclassifying the character into either the word beginning or word ending class.Recent work on Chinese word segmentation has also used the Transformation-Based

error-driven algorithm [Brill, 1993] and achieved various degrees of success [Palmer, 1997;Hockenmaier and Brew, 1998; Xue, 2001]. The transformation-based error-driven algorithmis a supervised machine-learning routine first proposed by [Brill, 1993] and initially used inPOS tagging as well as parsing. It has been applied to Chinese word segmentation by [Palmer,1997; Hockenmaier and Brew, 1998; Xue, 2001].Although the actual implementation of this algorithm may differ slightly, in general the

Transformation-Based error-driven approaches try to learn a set of n -gram rules from a

training corpus and apply them to segment new text. The input to the learning routine is a(manually or automatically) segmented corpus and its unsegmenteded (or undersegmented)counterpart. The learning algorithm compares the segmented corpus and the undersegmenteddummy corpus at each iteration and finds the rule that achieves the maximum gain if applied.The rule with the maximum gain is the one that makes the dummy corpus most like thereference corpus. The maximum gain is calculated with an evaluation function whichquantifies the gain and takes the largest value. The rules are instantiations of a set ofpre-defined templates. After the rule with the maximum gain is found, it is applied to thedummy corpus, which will better resemble the reference corpus as a result.This process isrepeated until the maximum gain drops below a pre-defined threshold, which indicatesimprovement achieved through further training will no longer be significant. The output ofthe training process would be a ranked set of rules instantiating the predefined set oftemplates. The rules will then be used to segment new text. Like statistical approaches, thisapproach provides a trainable method to learn the rules from a corpus and it is notlabor-intensive. The drawback is that compared with statistical approaches, this algorithm isnot very efficient.3. Unsupervised machine-learning approachesBoth the Dictionary-based approaches and Supervised machine-learning approaches

have limitations: they can do fine with in-dictionary words or words annotated; However,perform badly with OOV words (domain specific terms, recently emerging person names,and newly produced words). Although some Supervised machine-learning approaches canincorporate some syntactic information (like POS tags), due to the flexibility of Chinesegrammar, you will find that’s as difficult as the word segmentation problem, if not moredifficult. Besides, both of them require huge amount of human labor, if you consider thevarious domains and the fast new words produce speed everyday.Unsupervised machine-learning can address this problem. Sproat and Shih(1993)

proposed a method using neither lexicon nor segmented corpus: for input texts, simplygrouping character pairs with high value of mutual information into words. Although thisstrategy is very simple and has many limitations(e.g., it can only treat bi-character words), thecharacteristic of it is that it is fully automatic -- the mutual information between characterscan be trained from raw Chinese corpus directly.Although this method is not accurate -- generate more bi-character words than real ones, it

is fast, powerful in detection, especially OOV words, without requiring any sophisticated andflexible syntactic or grammar knowledge, which could lead to lots of mistakes for machines.Our approachOur approach is unsupervised machine-learning, based on Independent Theory. For two

nearby Chinese characters, we compute P(A ^ B) / [ P(A) * P(B) ] to get the closeness of thetwo characters. Unlike mutual information, we don’t use log function. If this value surpassesa certain threshold, it become a possible word candidate, and we will further check thefreeness on both side of this candidate -- a word should be freely connected to various words.

The difference between our approach and Dictionary-based approaches, Supervisedmachine-learning approaches is that our approach doesn’t require an existing dictionary orlabeled data corpus. They could only recognize words with dictionary, or previously trainedon, but perform badly with unseen words. Conditional Random Field (CRF) couldincorporate some syntactic knowledge, but that is weak due to the flexibility of Chinesesyntax. Previous Mutual Information based work face the over-generation problem ofunreasonable bi-character words, and lack effective and good way to solve that. Also theyfailed to extend this approach to tri-character words, which we did in our Project.

Methodology and result

Our approach is based on Independent Theory. In probability theory, to say that two eventsare independent (alternatively called statistically independent or statistically independent)means that the occurrence of one does not affect the probability of the other.The formal definition is: Two events A and B are independent if and only if their joint

probability equals the product of their probabilities: P(A ^ B) = P(A) * P(B).Another meaning behind this theory is: if Two events A and B are not independent (e.g.

they are closely related), then their joint probability will not be the product of each separateprobability (e.g. significantly higher).This theory is quite powerful in detecting Chinese words: two nearby characters A, B can

be seen as two events, if they are totally independent (e.g. not form a word), then we canexpect P(AB) = P(A) * P(B), (more or less); and, if A, B from a word -- close relationship,not independent anymore -- then we can expect P(AB) >> P(A) * P(B). This is the criteria todetect possible words candidates, and real words will always meet this criteria.For the dataset, we clawed over 600 NBA news reports from a main Chinese website. We

processed the HTML files to get the Chinese texts, and further replaced punctuation withempty space. So using empty space as boundaries, for each string between two empty space,we generate all possible bi-gram, tri-gram, and quadri-gram words candidates from that string.Then we count the frequence of each bi-gram, tri-gram, and quadri-gram, and also computethe probability for each item. With these probabilities, we could detect possible candidates.For each bi-character candidate AB, we compute its probability P(AB), also compute the

probability for each character: P(A) , P(B). Define P(likelihood) = P(AB) / [ P(A) * P(B) ].If character A and character B are totally independent, then P(likelihood) should be 1. Thehigher P(likelihood) is above 1, the less likely they are independent, which means they arenot happen together by chance, they have close relationship.There are 2000 ~ 4000 core Chinese characters, and over 9000 total. As we can expect, in

our dataset, typically, character probability is 1 / 2000 ~ 1 / 9000. From our experience, whenP(likelihood) is above 100 ( one hundred times than totally independent), then A, B is closeenough to be a word. However, a lot of this situation could be an unreasonable word, so youneed a reliable way to remove these unreasonable ones.

Besides high closeness, one word also has another property: outside freeness. That means aword can freely connect to lots of other words on both side. So, on either side, count the totalnumber of Chinese characters a word candidate can connect to. If this value surpasses acertain threshold (e.x 20), then it is quite likely to be a real word. Even punctuation or emptyspace can be considered: Chinese word could not contain punctuation or empty space, it is astrong segmentation mark. If they appear on one side over a certain times, then indicate thatside is free enough.Another important criteria for selecting words in our hierarchical approach is the given

probability. Given probability computes the probability for one event if another event alreadyhappens. The formulas are: P(A | B) = P(AB) / P(B), and P(B | A) = P(AB) / P(A). We candefine P(given) = max [ P(A | B), P(B | A) ]. The larger P(given) is, the more strong inter-connection is. We compute this value for each word candidate.Based on P(likelihood) and freeness, using all these N-gram possibilities calculated before,

we compute P(likelihood) value for each bi-character candidate, and sort by descending order.P(likelihood) can be as low as 20. Freeness on both side under a threshold will be removed,here for our small dataset, we set number of punctuation or empty space to be 5, and numberof different character connected to be 10. This is already good enough to produce over 900bi-character words. For top P(likelihood) (over 100) and top P(given) (over 0.5), we canproduce high quality bi-character words, with more confidence; with lower P(likelihood) andlower P(given), we can generate bi-character words, with not so strong confidence.For tri-character words, previous mutual information based work couldn’t do that, here we

extend the bi-character approach to that: we view tri-character word ABC as two parts (justlike bi-character words), so two possible separation: AB, C or A, BC. Similarly, we computeP1(likelihood) = P(ABC) / [ P(AB) * P(C) ], P2(likelihood) = P(ABC) / [ P(A) * P(BC) ];we define P(likelihood) = min[ P1(likelihood), P2(likelihood) ], compute P(likelihood) foreach tri-character candidate, and sort by descending order. Also make sure it surpasses acertain threshold (here we set 100). Similarly, each tri-character candidate must also meet thefreeness requirement on both side, and same to bi-character case, we set number ofpunctuation or empty space to be 5, and number of different character connected to be 10.This is already good enough to produce over 100 tri-character words.Actually, tri-character words are far more complex than bi-character words: although

P(likelihood) is high, it could be two close words together, instead of one single word. Tosolve this problem, we provide a effective solution. Based on the bi-character words alreadyfound, (we can even lower the threshold of P(likelihood) to be 10 to cover more possiblewords), we split tri-character into two groups: for the first group, each one does not containany bi-character word already found, this is the no-doubt group; for the second group, eachone does contain at least one bi-character word already found, this is the with-doubt group.For the no-doubt group, we set P(likelihood) threshold to be 100, that’s good enough; for

the with-doubt group, must give high threshold, P(likelihood) threshold to be 100, P(given)must be over 0.5.

With our hierarchical approach, we can produce bi-character and tri-character words fromhigh confidence to low confidence. Here is the result of top 20 generated bi-character words:

And here is the result of top 20 generated tri-character words:

We can deal with even English abbreviations in our approach, as you can see: JR (person name),MVP (most valuable person), GDP (three core players’ names together).Together we generated over 900 bi-character words and 100 tri-character words, this could be less

if you require high confidence. The P(likelihood ) threshold for high confidence bi-character is 100,which produces 594 words, and this threshold can be as low as 20, if you suffer several mistakes,which produces another 397 words.Tri-character words detection is based on the bi-character words candidates with P(likelihood )

threshold as low as 10. If a tri-character candidate doesn’t contain any of them, it is pure; otherwiseimpure. This is our hierarchical way, to make further detection based on previous result.The P(likelihood ) threshold for high confidence tri-character is 100, in pure tri-character

candidates, this produces 98 words; for impure tri-character candidates, P(likelihood ) threshold israised to 200, together with P(given ) to be over 0.5, which aims to filter out some bi-character wordscorrelate with another character, but do not form a tri-character words, (so we require high thresholdhere). This produces another 19 words.Compare to Dictionary-based approaches and Supervised machine-learning approaches,

they did poorly with unseen words, either break it up or combine it with other character.Conditional Random Field (CRF) could incorporate some syntactic knowledge, but that isweak due to the flexibility of Chinese syntax. PreviousMutual Information based work lackeffective and good way to deal with the over-generation unreasonable bi-character words.Also they failed to extend this approach to tri-character words, which we did here.We use one famous and high-quality Chinese word segmentation open source toolkit --

jieba -- to generate words from the same NBA texts, to see how well it performs on domainterms. It generates 2500 tri-character words, and 10,000 bi-character words, much more thanus (we have removed digits, also candidates which appear less than 10 to check freeness).However, their results contain lots of unreasonable words, (words wrongly be cut through,and connected to a half of another words). This situation stands for a high percentage. It alsoperforms not well on OOV words.

Conclusion

We present a hierarchical statistical based approach to detect words especially Out ofvocabulary (OOV) words, (e.g. domain specific terms, names and newly produced words). Itis powerful and accurate. It neither require any Chinese dictionary or annotated data corpus,nor require complex syntactic or grammar knowledge. With just statistical computation, wecan obtain high words coverage and accuracy. The hierarchical idea is that, we separateoutput words based on their confidence: high probability first.Compare to one famous, high quality open-source toolkit, our words’ quality is great. their

results contain lots of unreasonable words, (words wrongly be cut through, and connected toa half of another words). This situation stands for a high percentage. It also performs not wellon OOV words.With our approach, you can make trade-off between coverage and confidence based on

your demand. Also our approach is domain adaptive -- easy to apply to different domainswithout any changes.From this project, we gain deep understanding of the independent theory, and the

correlation detection similar to beer-diaper detection. We also have deep understanding on thegiven probability -- the probability when another event already happens. This project ischallenging, fun, and it is very useful in quite a lot of applications.

Future Work

Future work would incorporate syntactic knowledge and high quality dictionary, to helpmaking decisions. Annotated data corpus could also be considered. More information willbenifit from each other, if well designed, and gives better result. Provided with more time, wewould also like to extend our approach to quadri-character words or even more-characterwords.

ReferencesAitao Chen. 2003. Chinese word segmentation using minimal linguistic knowledge.In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing -Volume 17, SIGHAN '03, pages 148--151, Stroudsburg, PA, USA. Association forComputational Linguistics.

Wu, Andi. 2003. Customizable segmentation of morphologically derived words in Chinese.In: International Journal of Computational Linguistics and Chinese Language Processing,8(1): 1-27.

Gan, Kok-Wee, 1995. Integrating Word Boundary Disambiguation with SentenceUnderstanding . Ph.D. thesis, National University of Singapore.

Wu, Andi and Zixin Jiang, 1998. "Word segmentation in sentence analysis". In Proceedingsof the 1998 International Conference on Chinese Information Processing. Beijing, China.

Sproat, R. and C. L. Shih, 1990. "A statistical method for finding word boundaries in ChineseText". Computer Processing of Chinese and Oriental Languages, 4(4):336–351.

Sun, Maosong, Dayang Shen, and Benjamin K. Tsou, 1998. "Chinese word segmentationwithout using lexicon and hand-crafted training data". In Proceedings ofCOLING-ACL' 98.

Ge, Xianping, Wanda Pratt, and Padhraic Smyth, 1999. "Dicovering Chinese words fromunsegmented text". In SIGIR'99.

Peng, Fuchun and Dale Schuurmans, 2001. "Self-supervised Chinese word segmentation". InF. Hoffman et al (ed.), Advances in Intelligent Data Analysis, Proceedings ofthe FourthInternational Conference (IDA-01). Heidelberg. Springer-Verlag.

Liang, Nanyuan, 1993. "shumian hanyu zidong fenci xitong cdws". Journal of ChineseInformation Processing, 1(1):44–52.

Gu, Ping and Yuhang Mao, 1994. "Hanyu zidong fenci de jinlin pipei suanfa jiqi zai qhfyhanying jiqi fanyi xitong zhong de shixian". the adjacent matching algorithm of Chineseautomatic word segmentation and its implementation in the qhfy Chinese-english system.In International Conference on Chinese Computating. Singapore.

Sproat, R., Chilin Shih, William Gale, and Nancy Chang, 1996. "A stochastic finite-stateword-segmentation algorithm for Chinese". Computational Linguistics, 22(3):377–404.

Palmer, David, 1997. "A trainable rule-based algorithm to word segmentation". InProceedings of the 35th Annual Meeting of the Association of Computational Linguistics.Madrid, Spain.

Hockenmaier, Julia and Chris Brew, 1998. "Error-driven segmentation of Chinese".Communications of COLIPS, 1(1):69–84.

Xue, Nianwen, 2001. Defining and Automatically Identifying Words in Chinese. Ph.D. thesis,University of Delaware.

Chen, S. F. and J. Goodman. 1999. An empirical study of smoothing techniques for languagemodeling. Computer Speech and Language, 13:359-394, October.

Nie, Jian-Yun, Wanying Jin and Mareie-Louise Hannan. 1994. A hybrid approach to unknown

word detection and segmentation of Chinese. In: International Conference on ChineseComputing, pp 326-335. Singapore.

Chen, K. J., & Liu, S. H. (1992). Word identification for Mandarin Chinese sentences.Proceedings of the Fifteenth International Conference on Computational Linguistics,Nantes: COLING-92.

Kawtrakul and C. Thumkanon, "A Statistical Approach to Thai Morphological Analyzer,"Proc. of the 5th Workshop on Very Large Corpora, pp. 289–286, 1997.

Meknavin, P. Charoenpornsawat, and B. Kijsirikul, "Feature-Based Thai WordSegmentation," Proc. of NLPRS’97, pp. 289–296, 1997.

Kruengkrai and H. Isahara, "A conditional random field framework for thaimorphological analysis," Proc. of the Fifth Int. Conf. on Language Resources andEvaluation (LREC-2006), 2006.

Brill, Eric. 1995. Transformation-based error-driven learning and natural language processinga case study in Part-of-Speech tagging. Computational Linguistics, 21(4): 543-565.

Sproat R., Shih C.L., "A Statistical Method for Finding Word Boundaries in Chinese Text",Computer Processing of Chinese and Oriental Languages, No.4, 1993

Date post:	19-Aug-2015
Category:	Documents
Upload:	chengyuan-zhuang
View:	154 times
Download:	5 times

Data Mining final report

Documents