+ All Categories
Home > Documents > Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov,...

Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov,...

Date post: 12-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
49
ACL-IJCNLP 2015 Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction July 30, 2015 Beijing, China
Transcript
Page 1: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

ACL-IJCNLP 2015

Proceedings of the ACL 2015 Workshop onNovel Computational Approaches to Keyphrase Extraction

July 30, 2015Beijing, China

Page 2: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

c©2015 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-941643-62-4

ii

Page 3: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Preface

The ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction was held onJuly 30, 2015 in Beijing, China as part of the 53rd annual meeting of the ACL and the 7th InternationalJoint Conference on Natural Language Processing.

The workshop’s goal was to bring together researchers addressing a wide-range of questions pertainingto the keyphrase extraction task as well as domain-specific applications involving the use of keyphrases.

The workshop program included two invited talks by researchers who are experts in the fields of datamining, information retrieval, and natural language processing: Prof. Min-Yen Kan from NationalUniversity of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute.

After a rigorous review process, two long papers and three short papers were selected for inclusion intothe workshop proceedings by the Program Committee. We hope that these papers summarize novelfindings related to keyphrase extraction and invoke further research interest on this exciting topic.

We thank the authors, invited speakers, program committee members, and participants for sharing theirresearch ideas and valuable time to be part of ACL Keyphrase!

–Organizers: Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, C. Lee Giles

iii

Page 4: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review
Page 5: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Organizers:Sujatha Das Gollapalli, Institute for Infocomm Research, A*STAR, SingaporeCornelia Caragea, University of North Texas, USAXiaoli Li, Institute for Infocomm Research, A*STAR, SingaporeC. Lee Giles, The Pennsylvania State University, USA

Program Committee:

Marina Danilevsky, IBM Almaden Research CenterFei Liu, Carnegie Mellon UniversityDoina Caragea, Kansas State UniversityRada Mihalcea, University of MichiganShibamouli Lahiri, University of MichiganSaurabh Kataria, Palo Alto Research CenterAni Nenkova, University of PennsylvaniaKazi Hasan, IBMYang Song, Microsoft ResearchOlena Medelyan, EntopixMin-Yen Kan, National University of SingaporeFeifan Liu, Nuance Inc.Niket Tandon, Max-Planck-Institut für InformatikPreslav Nakov, Qatar Computing Research InstituteFang Yuan, Institute for Infocomm Research, A*STARPucktada Treeratpituk, Ministry Of Science and Technology, ThailandMadian Khabsa, The Pennsylvania State University

Invited Speakers:

Min-Yen Kan, National University of SingaporePreslav Nakov, Qatar Computing Research Institute

v

Page 6: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review
Page 7: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Table of Contents

Keywords, Phrases, Clauses and Sentences: Topicality, Indicativeness and Informativeness at ScalesInvited talk by Min-Yen Kan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Technical Term Extraction Using Measures of NeologyChristopher Norman and Akiko Aizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Counting What Counts: Decompounding for Keyphrase ExtractionNicolai Erbs, Pedro Bispo Santos, Torsten Zesch and Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . 10

The Web as an Implicit Training Set: Application to Noun Compounds Syntax and SemanticsInvited talk by Preslav Nakov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Program-ming

Florian Boudin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

TwittDict: Extracting Social Oriented Keyphrase Semantics from TwitterSuppawong Tuarob, Wanghuan Chu, Dong Chen and Conrad Tucker . . . . . . . . . . . . . . . . . . . . . . . . . 25

Identification and Classification of Emotional Key Phrases from Psychological TextsApurba Paul and Dipankar Das . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii

Page 8: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review
Page 9: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Workshop Program

Thursday, July 30, 2015

9.10-9.30 Opening Remarks

9.30-10.30 Keywords, Phrases, Clauses and Sentences: Topicality, Indicativeness and Infor-mativeness at ScalesInvited Talk by Min-Yen Kan

10.30-11.00 Coffee Break

11.00-11.30 Technical Term Extraction Using Measures of NeologyChristopher Norman and Akiko Aizawa

11.30-12.00 Counting What Counts: Decompounding for Keyphrase ExtractionNicolai Erbs, Pedro Bispo Santos, Torsten Zesch and Iryna Gurevych

12.00-14.00 Lunch

14.00-15.00 The Web as an Implicit Training Set: Application to Noun Compounds Syntax andSemanticsInvited Talk by Preslav Nakov

15.00-15.30 Reducing Over-generation Errors for Automatic Keyphrase Extraction using IntegerLinear ProgrammingFlorian Boudin

15.30-16.00 Coffee Break

16.00-16.30 TwittDict: Extracting Social Oriented Keyphrase Semantics from TwitterSuppawong Tuarob, Wanghuan Chu, Dong Chen and Conrad Tucker

16.30-17.00 Identification and Classification of Emotional Key Phrases from Psychological TextsApurba Paul and Dipankar Das

17.00-17.30 Closing Remarks and Discussion

ix

Page 10: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review
Page 11: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, page 1,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

(Invited Talk) Keywords, Phrases, Clauses and Sentences: Topicality,Indicativeness and Informativeness at Scales

Min-Yen KanNational University of [email protected]

About the Speaker:Min-Yen Kan (BS;MS;PhD Columbia University) is an associate professor at the National University ofSingapore. He is a senior member of the ACM and a member of the IEEE. Currently, he is an associateeditor for the journal “Information Retrieval” and is the Editor for the ACL Anthology, the computationallinguistics community’s largest archive of published research. His research interests include digital librariesand applied natural language processing. Specific projects include work in the areas of scientific discourseanalysis, full-text literature mining, machine translation and applied text summarization. More informationabout him and his group can be found at the WING homepage: http://wing.comp.nus.edu.sg/.

1

Page 12: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 2–9,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

Technical Term Extraction Using Measures of Neology

Christopher NormanRoyal Institute of Technology

The University of [email protected]

Akiko AizawaNational Institute of Informatics

The University of [email protected]

Abstract

This study aims to show that frequency ofoccurrence over time for technical termsand keyphrases differs from general lan-guage terms in the sense that technicalterms and keyphrases show a strong ten-dency to be recent coinage, and that thisdifference can be exploited for the auto-matic identification and extraction of tech-nical terms and keyphrases. To this end,we propose two features extracted fromtemporally labelled datasets designed tocapture surface level n-gram neology. Ouranalysis shows that these features, cal-culated over consecutive bigrams, arehighly indicative of technical terms andkeyphrases, which suggests that both tech-nical terms and keyphrases are strongly bi-ased to be surface level neologisms. Fi-nally, we evaluate the proposed features ona gold-standard dataset for technical termextraction and show that the proposed fea-tures are comparable or superior to a num-ber of features commonly used for techni-cal term extraction.

1 Introduction

Keyphrases are terms assigned to documents, con-ventionally by its authors, that are intended chieflyas an aid in searching large collections of docu-ments, as well as to give a brief overview of thedocument’s contents. Technical terms are wordsor phrases that hold a specific meaning in spe-cific domains or communities. Keyphrases areclosely related to technical terms in the sense thatthe keyphrases assigned to a document are gen-erally selected from the terminology of the doc-ument’s domain. Keyphrases and technical termsshow considerable conceptual overlap, and by ex-tension, so do keyphrase and technical term ex-traction. As a consequence, these two are closely

related research topics. In this study we will seetechnical term extraction and keyphrase extractionas distinct but related. We will take the view thatthe technical terms in a scientific article are likelycandidates to be keyphrases for the document andconsequently that technical term extraction meth-ods might also be useful in keyphrase extraction.

We will show that features that capture the neol-ogy of term candidates can be used to extract tech-nical terms, and that the basic assumptions that en-able this extraction also hold true for keyphrases.

This paper is organized as follows: We first dis-cuss how technical terms and keyphrases differfrom general language terms in terms of neology.We then define features that capture this differ-ence and analyze these features statistically usingthe SemEval-2010 dataset (Kim et al., 2010) and agold-standard for technical term extraction derivedfrom the same dataset (Chaimongkol and Aizawa,2013). Our analysis shows that the proposed fea-tures reliably separate positive from negative ex-amples, both of technical terms and of keyphrases.Furthermore, the histograms for the proposed fea-tures are very similar when calculated for techni-cal terms and keyphrases, suggesting that techni-cal terms and keyphrases have very similar neo-logical properties. Finally, we demonstrate thatthis statistical bias can be used to reliably extracttechnical terms in a gold-standard dataset, and thatthe proposed features are comparable or superiorto other features used in technical term extraction,with an F-score of 0.509 as compared to 0.593,0.367, 0.361, and 0.204 for affix patterns, tf-idf,word shape, and POS tags respectively.

We argue that, given the high performance ofthe proposed features on technical term extrac-tion, and given that we can show that the statisti-cal properties that enable us to use them to extracttechnical terms also extend to keyphrases, the pro-posed features should also be useful in keyphraseextraction.

2

Page 13: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

2 Related works

Most technical term extraction systems work fairlysimilarly to keyphrase extraction systems, usingan initial n-gram or POS tag-based filtering toidentify term candidates, then proceeding to nar-row this list down using machine learning algo-rithms on various kinds of document statisticssuch as term frequency or the DICE coefficient(Justeson and Katz, 1995; Frantzi et al., 2000; Pin-nis et al., 2012). For an in-depth summary of thestate-of-the-art in technical term extraction, we re-fer to Vivaldi and Rodrıguez (2007). For a sum-mary of the state-of-the-art in keyphrase extrac-tion, which largely follow the same pattern, we re-fer to Hasan and Ng (2014). The main differencein implementation might simply come down to achoice in top-level machine learning approach: intechnical term extraction it makes sense to viewthe problem as a binary classification problem,whereas in keyphrase extraction it makes moresense to see the problem as a ranking problem.

Approaches based on frequency statistics ex-tracted from the documents themselves are, how-ever, not without their drawbacks. To begin with,for document statistics to be meaningful we willneed a dataset that is large enough, consisting ofdocuments that are large enough individually. Wemight also encounter problems if the documentsare too large, because then the statistics might bedrowned out by noise in the data (Hasan and Ng,2014). We should also be careful about the topi-cal composition of the data set – if the dataset onlycontains documents from a single domain, then wewill have to approach the problem very differentlythan if the dataset contains documents from mul-tiple domains. Preferably, we want methods thatdo not make these kinds of assumptions about thedata set, methods that can be applied to documentsof any size, and to document collections of anysize or of any topical composition. In the best ofworlds, we want methods that can be applied todocument collections consisting of a single docu-ment, or even a single sentence.

One way to go beyond simple document statis-tics is to use external, pregenerated resources. Togive some examples of this, Medelyan and Wit-ten (2006) use a pregenerated domain thesaurus toconflate equivalent terms and to select candidatesthat are thematically related to each other, Hulthet al. (2006) use a pregenerated domain ontologyto select candidates whose synonyms, hypernyms,

and hyponyms also appear in the text, and Lopezand Romary (2010) use a terminology database asone way to measure the salience of term candi-dates for keyphrase extraction in scientific articles.Medelyan et al. (2009) use a somewhat more in-direct external resource by taking the frequencyby which a term candidate appears in Wikipedialinks, divided by the frequency by which it ap-pears in Wikipedia documents. The idea behindusing external resources is that human annotatorsgenerally perform better than automatic systems,and resources produced by human beings are thusmuch more reliable than automatic methods, evenif the resources themselves are only obliquely re-lated to keyphrases.

However, depending on the speed by whichthe terminology of the subject field changes, anypreviously generated resource might become out-dated very quickly. In a subject field such as law,where the terminology only changes impercepti-bly over time (Lemmens, 2011), this is unlikely tobe an issue, but in a quickly changing subject fieldsuch as information science, where the terminol-ogy has been reported to change by as much as 4%per year (Harris, 1979), it is likely that pregener-ated resources will lag behind recent terminolog-ical developments. One selling point of the auto-matic extraction of keyphrases or technical termsis that automatic methods are able to respond tochanges in the terminology of a subject field withthe same speed that the terminology changes, butif we rely on pregenerated resources then we for-sake this advantage, since these are unlikely to in-clude terminology that only recently appeared inthe subject field.

3 Theoretical basis

In this paper we will examine the use of externalcorpora in order to track the frequency of occur-rence of n-grams over time, and use measures ofneology as a way to extract technical terms. Weare not aware of any previous attempts to use ne-ology as a feature for technical term extraction,keyphrase extraction, or other kinds of natural lan-guage processing tasks. We will use the GoogleNgrams dataset (Lin et al., 2012), where this in-formation is already extracted. Although we usethis dataset, mainly because of its conveniencein our initial investigation, there is nothing keep-ing us from using other corpora consisting of rawdocuments, such as Pubmed. In particular, this

3

Page 14: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

would allow us to obtain more recent data thanthe Google Ngrams dataset, which only containsfrequencies of occurrence from before 2008.

Figure 1: Example timelines (frequency of oc-currence) for three technical terms and one non-technical term. The frequencies of occurrence foreach timeline has been normalized to sum to oneto fit all graphs in the same diagram.

In order to develop some intuition about howtechnical terms are adopted, let us look at thetimelines for some technical terms in the GoogleNgrams dataset (Figure 1). To begin with, allthe technical terms (graph problem, motor control,and jet engine) are relatively recent coinage, andnone of them were in use in the 19th century. Inall three cases, there is some point in time at whichthe term gained momentum and began to surge infrequency. This characteristic is fairly typical oftechnical terms, although we can of course findgeneral language terms that exhibit the same pat-tern of adoption. By contrast, the non-technicalterm (clear water) has been in use throughout the19th and 20th century. Unlike technical terms,general language do not have generally observablecharacteristics, and the shape of the timelines varygreatly from case to case. The defining character-istic instead seems to be one of contrast: generallanguage terms seldom have the steep curves thatwe can observe of the technical terms here.

We will formalize this difference and examineit statistically in the later parts of this study.

Of course, our ability to find neologisms by ex-amining the frequency of occurrence of their sur-face forms necessitates that technical terms gener-ally do not share surface forms with general lan-guage terms. If such is the case, then the generallanguage senses of the terms are likely to drownout the technical term senses. For instance, con-sider a term like worm in computer security. Thevast majority of the occurrences of the unigram

worm in the Google Ngrams dataset are likely tobe of the biological variety, and it is consequentlyimpossible to tell from the Google Ngrams datasetalone that the computer security term only ap-peared in the later half of the 20th century1. For-tunately, the case where technical terms coincidewith general language terms is rare, at least whenconsidering terms composed of multiple words.

The overall recency of coinage of technicalterms depends on the subject field – the major-ity of the terminology in e.g. computer scienceconsists of terms whose surface forms were in-troduced no earlier than the middle of the 20thcentury, whereas subject fields such as mathemat-ics and physics include terminology coined in the19th century or earlier. Consequently, if we plotthe frequency of usage of a neologism over timewe would expect to see a curve similar to thosein Figure 1, but we should expect that the curvesmay be shifted to the left or to the right, largelydepending on the subject field.

Why are technical terms so often neologisms?It turns out that general language, which is whatis ordinarily studied in linguistics, and special lan-guage, which is what we actually encounter in thedocuments commonly used for keyphrase extrac-tion, differ quite substantially in linguistic aspects(Sager et al., 1980). One difference is that sur-face level neologisms are seldom created in gen-eral language. Rather, the creation of new sur-face forms generally occur in special language,from which the term might later be transferredto general language (Sager et al., 1980, p. 287).Consequently, terms that have appeared recently,given some specific point in time, are likely to bedomain-specific at that point. The longer that haspassed since the adoption of the term, the morelikely it is that the term has been adopted into gen-eral language.

4 Measures of neology

We have noted that the shape of the timelines seemto indicate whether a given term is recent coinage,but in order to use these as input to machine learn-ing algorithms, we need to distill the high dimen-sional data into low-dimensional features that re-tain the neological information.

1However, the term computer worm is a surface level ne-ology. We might thus observe that new senses of the unigramworm appeared in the later half of the 19th century by exam-ining bigrams.

4

Page 15: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

What we want to extract is of course not neces-sarily the shape of the timelines, but whether theoccurrences of the n-grams predominantly occurin the far right side on the time axis. In otherwords, we want to determine if the timeline ismainly concentrated on the right side. This is sim-ple to do using statistical measures such as themean and the standard deviation of the curves.

Let f iy denote the frequency of an n-gram i in

year y. Then pi(y) = f iy

Σyf iy

constitutes a probabil-ity density function, with the expected value:

µi =∑y

pi(y) · y =∑

y fiy · y∑

y fiy

How this ”mean” should be interpreted mightnot be completely intuitively obvious, but for ourpurposes here it is enough to note that µi indi-cates where the curve is mainly concentrated. Ifthe curve is concentrated around higher values ofy then we have, by definition, a surface level neol-ogism.

We can take the standard deviation of pi in thesame way:

σ2i =

∑y

pi(y) · (y − µi)2 =∑

y fiy · (y − µi)2∑

y fiy

The standard deviation σi then yields a mea-sure of how much the probability density is con-centrated around µi, in other words, how ”steep”the probability density is. A low standard devia-tion consequently indicates that the term has beenadopted or abandoned rapidly. Low standard de-viation should thus in general imply either surfacelevel neologisms or fads. If we are only interestedin how quickly a term has been adopted and nothow quickly it might have been abandoned, thenwe can take a one-sided ”standard deviation”, byseparating out the y for which y < µi, but thisdoes not seem to make much difference for thesake of the separability of technical terms. Thoseterms that have been adopted quickly also appearto be likely to quickly fall into relative disuse.

We should hasten to point out that there doesnot seem to exist any theoretical reasons to use themean and standard deviation in this way. For in-stance, using the peak of the curve (i.e. the modeof the distribution) might have more intuitive ap-peal, since this should correspond to the point intime at which the term was in its most widespreaduse. However, the Google Ngrams dataset is of-

ten plagued by severe noise, in particular for lesscommonly used n-grams such as technical terms,and the peaks of the timelines are thus likely tobe spurious. The mean and the standard deviationmay be crude measures of shape, but they have theadvantage of being robust against noise, and cangenerally be used with good results even for verynoisy timelines.

Other intuitively appealing features, such as thefirst order derivatives of the timelines or the skew-ness of pi have turned out not to be very useful,presumably because of the noise.

5 Statistical properties of technical termsand keyphrases

In this section, we examine statistically the fea-tures we propose, and show that both technicalterms and keyphrases are strongly biased towardscertain values for our features. We also under-line the relationship between technical terms andkeyphrases by showing that these are very similarin terms of neology.

To analyze keyphrases we will use theSemEval-2010 dataset (Kim et al., 2010), oneof the most commonly used gold standards forkeyphrase extraction. To analyze technical termswe will use a dataset consisting of the abstractsfrom the SemEval-2010 dataset manually anno-tated by two annotators such that all the technicalterm spans have been labeled (Chaimongkol andAizawa, 2013). Since this dataset was constructedfrom the abstracts of the SemEval-2010 dataset weassume that these datasets are similar enough thatanalyzing and comparing the statistical propertiesof these two datasets is meaningful.

For the analysis in the following section, weconstruct three classes of data:

• We extract the n-grams that are part ofthe spans labeled as technical terms in theChaimongkol-Aizawa dataset to obtain oneset of positive examples of technical terms.

• We extract the constituent n-grams from thegold-standard keyphrases from the SemEval-2010 training dataset (using the combinedset) to obtain one set of positive examples ofkeyphrases

• We extract the n-grams that are not part ofthe spans labeled as technical terms in theChaimongkol-Aizawa dataset to obtain one

5

Page 16: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Figure 2: Class separation between technical terms, keyphrases, and background terms over µi (left) andσi (right), when considering unigrams (top), and bigrams (bottom). Histogram bin size was set to 2 inall cases. The resulting histograms have been smoothed using Matlab’s default settings (moving averagewith span 5) and normalized to sum to one.

set of negative examples of both technicalterms and keyphrases.

We could of course extract negative examples ofkeyphrases from the SemEval-2010 dataset, but itis not so clear that these would all unquestionablybe negative examples of keyphrases. Given thatinter-annotator agreement is generally very lowfor keyphrases, this set might well contain termsthat could reasonably be considered keyphrasesby other annotators. We here assume that thenegative examples of technical terms also consti-tute negative examples of keyphrases, and that thisset is less likely to contain borderline cases ofkeyphrases. Only using three classes of data alsosimplifies both the exposition and the processing.

For these three classes of n-grams, we ex-tract the corresponding timelines from the GoogleNgrams dataset over the period 1800–2008, andwe calculate the means µ and standard deviationsσ from these as defined in the preceding section.To analyze the features µ and σ we plot the his-tograms of the values for the technical term n-grams and the keyphrase n-grams versus the back-ground n-grams (Figure 2).

In order for the features to be useful for eithertechnical term extraction or keyphrase extraction

we would like to see as little overlap as possi-ble between the histograms of the positive and thenegative examples. This seems to hold true for thebigrams, but not for the unigrams. In the unigramcase we can see only a weak tendential differencein the histogram densities of the positive and neg-ative examples.

We should point out that the mode of the his-togram densities for the negative examples fallvery close to the mean and standard deviation ofa uniform distribution over the period 1800–2008.These would occur around 1904 and 60.48 respec-tively.

In all cases, the histograms for the technicalterms and the keyphrases are very similar. Thismight not seem very surprising given that all thedata is derived from the SemEval-2010 dataset, butit bears mentioning that these were annotated bydifferent people, and more importantly, using verydifferent annotation criteria.

We omit trigrams and higher order n-gramsfrom consideration.

It is unlikely that higher order n-grams wouldhelp in keyphrase and technical term extraction,because the Google Ngrams dataset excludes anyn-gram that has a total frequency of occurrenceless than 50. This means that less frequently used

6

Page 17: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

n-grams, such as technical terms, as well as higherorder n-grams are likely to be missing (see Table1). This problem is not very severe for unigramsand bigrams, but technical term trigrams sufferfrom data sparsity problem severe enough to es-sentially render them useless. Part of this problemmight be due to the large mismatch between thedataset used for evaluation, and the Google Ngramdataset used to identify neology. If we were to usean external dataset from a more similar domain tothe evaluation set, then we would expect to finda greater portion of the n-grams in the externaldataset.

Technical terms BackgroundUnigram 97.5 % 100 %

Bigram 85.0 % 97.0 %Trigram 25.5 % 71.0 %

Table 1: The ratio of n-gram in each class inthe Chaimongkol-Aizawa dataset that occur in theGoogle Ngrams dataset. The percentages havebeen generated by chosing a random sample of200 unigrams of each class, 200 bigrams of eachet c., and checking if the n-gram occur in theGoogle Ngrams dataset using the web interface.

6 Evaluation on technical termextraction

In order to demonstrate that neology, as character-ized by the features µ and σ, can be used to auto-matically extract technical terms, we implement asimple technical term extractor using these as fea-tures. We generally follow the approach taken byChaimongkol and Aizawa (2013) and implementa conditional random field model to BIO-tag thedataset. The major difference in implementationis that we use the neology features extracted fromGoogle Ngrams in the term extractor, and that wedo not use features based on clustering.

For the sake of our CRF model, bigrams andunigrams are sufficient, since what we want to dois to obtain features corresponding to each node(i.e. to each unigram) and features correspond-ing to the links between the nodes (i.e. to eachbigram). We might in theory achieve better per-formance with higher order n-grams, but in real-ity the results would be severely hampered by thesparsity problems for higher order n-grams.

Similarly to Chaimongkol and Aizawa, we im-plement a CRF model using the freely available

state-of-the-art CRF framework CRFSuite2, usingfive different feature sets:

1. POS TAGS using the Stanford POS tagger3.

2. WORD SHAPE features extracted similarly toChaimongkol and Aizawa. These include bi-nary features such as whether the current to-ken is capitalized, uppercased, or alphanu-meric.

3. AFFIXES of length up to 4 characters ex-tracted for all tokens. In other words, for thetoken carbonization we would extract carb-,car-, ca-, c-, -tion, -ion, -on, and -n.

4. TF-IDF for each unigram and bigram in thedataset.

5. NEOLOGY based features, in other words themean and standard deviation of the GoogleNgrams timeline as described in section 4.

The mutual information between each neigh-boring token in the dataset has also been tried, butthis turned out to not have any perceptible effecton the results.

Because CRFSuite cannot handle continuousfeatures, such as tf-idf, µ, or σ, we had to resortto discretizing these by binning. Appropriate binsizes were established experimentally.

We apply the system on the labeled datasetwhere we attempt to binary classify each tokeninto positive and negative examples, where posi-tive examples are those that are part of a technicalterm compound, and negative those that are partof the background. We use the full dataset, andevaluate using 10-fold cross-validation.

Using all features, the system achieves an F-score around 0.7 for the technical term tokens, andan F-score around 0.9 for the non-technical termtokens.

To compare the different features with eachother, we evaluate their performance individually(Table 2). The best performing feature turns out tobe the affixes, although our neology features arequite comparable in performance. Neology per-forms better than all other features except affixes.

2http://www.chokkan.org/software/crfsuite/

3http://nlp.stanford.edu/software/tagger.shtml

7

Page 18: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Technical terms Non-technical termsP R F1 P R F1

POS tags 0.734 0.118 0.204 0.835 0.991 0.906Word shape 0.659 0.248 0.361 0.853 0.971 0.909

Affixes 0.673 0.530 0.593 0.900 0.943 0.921tf-idf 0.600 0.244 0.367 0.852 0.964 0.904

Neology 0.637 0.423 0.509 0.881 0.947 0.913

Table 2: Term extractor performance in terms of correctly labeled tokens. Here, the system is only usinga single feature class in each trial in order to compare the relative performance of each feature class.

Technical terms Non-technical termsP R F1 P R F1

All features 0.728 0.673 0.700 0.929 0.944 0.936− POS tags 0.728 0.656 0.690 0.925 0.946 0.935

− Word shape 0.719 0.671 0.694 0.928 0.942 0.935− Affixes 0.691 0.634 0.661 0.920 0.937 0.929− tf-idf 0.717 0.643 0.678 0.923 0.944 0.933

− Neology 0.715 0.647 0.679 0.923 0.943 0.933

Table 3: Term extractor performance in terms of correctly labeled tokens. Here, the system is using allbut one feature class in each trial in order to compare the relative performance drop when each featureclass is removed from the classifier.

It might be mentioned that these results are calcu-lated over all tokens, even those where the neol-ogy features are missing because the correspond-ing n-grams do not occur in the Google Ngramsdataset. It is likely that the performance of the ne-ology features would be higher if these were ex-cluded from consideration. This might seem likecheating, but we should consider what would hap-pen if we were to use another dataset with greatercoverage, or some future improved version of theGoogle Ngrams dataset with greater coverage.

We also perform an ablation experiment to seehow much the performance drops when exclud-ing individual feature classes (Table 3). Similarly,the biggest drop occurs when excluding affixes.In this case, however, the differences between thedifferent features are quite modest, which seemsto imply that each single feature does not containmuch information that is not also contained in theother features.

Compared to tf-idf, the neology feature is of-ten able to correctly identify technical term spanscontaining terms which are also frequent in theremainder of the document collection, such astechnical terms containing words like: network,computer, function, algorithm, complexity, data,server, model, or vector. It is much less easy tosummarize where neology works well compared

to the other features besides tf-idf.Neology features are much less effective when

the technical terms coincide with general languageterms, for instance worm, precision, or MAP. Thisis generally only a problem in the unigram case,and bigrams such as computer worm or averageprecision generally do not have this problem.

7 Discussion

In this paper we have shown that technical termstend to be recently coined, and that this statisticaltendency is strong enough that it allows us to ex-tract technical terms with reasonable accuracy. Wehave also shown that this statistical feature of tech-nical terms also seem to hold true for keyphrases,and we therefore maintain that it is reasonable thatsimilar features might also be useful in keyphraseextraction. We should not expect equally high per-formance in keyphrase extraction, however, sincein keyphrase extraction we are not only interestedin whether the output keyphrases are terms in therelevant domain, but also that they are significantin the document under consideration. What we dosuggest is that neology can be useful for keyphraseextraction when used in concert with other fea-tures such as tf-idf that indicate significance ortopicality.

The extraction of either technical terms or

8

Page 19: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

keyphrases fundamentally depends upon the as-sumption that these are biased in certain ways.For instance, a common assumption taken inkeyphrase extraction is that keyphrases are biasedto occur more frequently at certain positions in thedocument. Another common assumption is thattechnical terms and keyphrases are biased to occurwith different frequencies in certain communities,or that the contexts in which keyphrases and tech-nical terms appear differ between different com-munities. Similarly to the position and communitybias of keyphrases, we suggest that keyphrasesalso have a time bias, that the keyphrases of adocument are skewed to be overrepresented in thecontemporary and subsequent literature, but likelyto be absent or severely underrepresented in theprecedent literature.

The very high values of the means, and thevery low values of the standard deviations ob-served for the technical terms in section 5 sug-gests that the majority of the technical term bi-grams studied in this paper come from terms thatonly appeared after 1950. This might be explainedby the fact that the datasets we use here are de-rived from the SemEval-2010 dataset, which isstrongly biased towards computer science litera-ture. It seems reasonable that the separation be-tween the classes should generally be stronger insubject fields where the terminology tends to bevery recent coinage than in fields with more ma-ture terminology. If this is true, then the ap-proach we propose here should work well for sub-ject fields where the terminology is rapidly chang-ing, and where the need for automatic extractionmethods is arguably the greatest.

Acknowledgements

This work was supported by the Grant-in-Aid forScientific Research (B) (15H02754) of the JapanSociety for the Promotion of Science (JSPS).

ReferencesPanot Chaimongkol and Akiko Aizawa. 2013. Uti-

lizing LDA Clustering for Technical Term Extrac-tion. Proceedings of the Nineteenth Annual Meetingof the Association for Natural Language Processing,Nagoya, pages 686–689.

Katerina Frantzi, Sophia Ananiadou, and HidekiMima. 2000. Automatic recognition of multi-wordterms: The C-value/NC-value method. Interna-tional Journal on Digital Libraries, 3:115–130.

Jessica Harris. 1979. Terminology change: Effect onindex vocabularies. Information Processing & Man-agement, 15(2):77–88.

Kazi S. Hasan and Vincent Ng. 2014. AutomaticKeyphrase Extraction : A Survey of the State of theArt. Acl, pages 1262–1273.

Anette Hulth, Jussi Karlgren, and Anna Jonsson. 2006.Automatic keyword extraction using domain knowl-edge. Computational Linguistics and IntelligentText Processing, pages 472–482.

John S. Justeson and Slava M. Katz. 1995. Technicalterminology: some linguistic properties and an al-gorithm for identification in text. Natural LanguageEngineering, 1:9–27.

Su Nam Kim, Olena Medelyan, Min-Yen Kan, andTimothy Baldwin. 2010. Semeval-2010 task 5: Au-tomatic keyphrase extraction from scientific articles.Proceedings of the 5th International Workshop onSemantic Evaluation, pages 21–26.

Koen Lemmens. 2011. The slow dynamics of legallanguage: Festina lente? Terminology, 17:74–93.

Yuri Lin, Jean-baptiste Michel, Erez L. Aiden, JonOrwant, Will Brockman, and Slav Petrov. 2012.Syntactic Annotations for the Google Books NgramCorpus. Proc. of the Annual Meeting of the Associa-tion for Computational Linguistics, pages 169–174.

Patrice Lopez and Laurent Romary. 2010. HUMB :Automatic Key Term Extraction from Scientific Ar-ticles in GROBID. Proceedings of the 5th Inter-national Workshop on Semantic Evaluation, pages248–251.

Olena. Medelyan and Ian H. Witten. 2006. Thesaurusbased automatic keyphrase indexing. Proceedingsof the 6th ACM/IEEE-CS joint conference on Digitallibraries, pages 6–7.

Olena Medelyan, Eibe Frank, and Ian H. Witten.2009. Human-competitive tagging using automatickeyphrase extraction. Proceedings of the 2009 Con-ference on Empirical Methods in Natural LanguageProcessing, 3:1318–1327.

Marcis Pinnis, Nikola Ljubesic, Dan Stefanescu, In-guna Skadina, Marko Tadic, and Tatiana Gornos-tay. 2012. Term Extraction, Tagging, and Map-ping Tools for Under-Resourced Languages. Pro-ceedings of the 10th Conference on Terminology andKnowledge Engineering, pages 193–208.

Juan C. Sager, David Dungworth, and Peter F. McDon-ald. 1980. English special languages: principlesand practice in science and technology. John Ben-jamins Publishing Company.

Jorge Vivaldi and Horacio Rodrıguez. 2007. Evalua-tion of terms and term extraction systems: a practi-cal approach. Terminology, 13:225–248.

9

Page 20: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 10–17,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

Counting What Counts: Decompounding for Keyphrase Extraction

Nicolai Erbs�‡, Pedro Bispo Santos�, Torsten Zesch§, Iryna Gurevych�‡

� UKP Lab, Technische Universitat Darmstadt‡ UKP Lab, German Institute for Educational Research§ Language Technology Lab, University of Duisburg-Essen

http://www.ukp.tu-darmstadt.de

Abstract

A core assumption of keyphrase extractionis that a concept is more important if itis mentioned more often in a document.Especially in languages like German thatform large noun compounds, frequencycounts might be misleading as concepts“hidden” in compounds are not counted.We hypothesize that using decompound-ing before counting term frequencies maylead to better keyphrase extraction. Weidentified two effects of decompounding:(i) enhanced frequency counts, and (ii)more keyphrase candidates. We createdtwo German evaluation datasets to test ourhypothesis and analyzed the effect of ad-ditional decompounding for keyphrase ex-traction.

1 Introduction

Most approaches for automatic extraction ofkeyphrases are based on the assumption that themore frequent a term or phrase is mentioned, themore important it is. Consequently, most extrac-tion algorithms apply some kind of normaliza-tion, e.g. lemmatization or noun chunking (Hulth,2003; Mihalcea and Tarau, 2004), in order to ar-rive with accurate counts. However, especiallyin Germanic languages the frequent use of nouncompounds has an adverse effect on the relia-bility of frequency counts. Consider for exam-ple a German document that talks about Lehrer(Engl.: teacher) without ever mentioning theword “Lehrer” at all, because it is always partof compounds like Deutschlehrer (Engl.: Ger-man teacher) or Gymnasiallehrer (Engl.: gram-mar school teacher). Thus, we argue that the prob-lem can be solved by splitting noun compounds inmeaningful parts, i.e. by performing decompound-ing. Figure 1 give an example for decompounding

Deutschlehrer

Deutsch Lehrer

Figure 1: Decompounding of German termDeutschlehrer (Engl.: German teacher).

in German. The compound Deutschlehrer consistsof the parts Deutsch (Engl.: German) and Lehrer(Engl.: teacher).

In this paper, we propose a comprehensive de-compounding architecture and analyze the perfor-mance of four state-of-the-art algorithms. We thenperform experiments on three German datasets,of which two have been created particularly forthese experiments, in order to analyze the impactof decompounding on standard keyphrase extrac-tion approaches. Decompounding has previouslybeen successfully used in other applications, e.g.in machine translation (Koehn and Knight, 2003),information retrieval (Hollink et al., 2004; Alfon-seca et al., 2008b; Alfonseca et al., 2008a), speechrecognition (Ordelman, 2003), and word predic-tion (Baroni et al., 2002). Hasan and Ng (2014)have shown that infrequency errors are a majorcause for lower keyphrase extraction results . Tothe best of our knowledge, we are the first to exam-ine the influence of decompounding on keyphraseextraction.

2 Decompounding

Decompounding is usually performed in twosteps: (i) a splitting algorithm creates candidates,and (ii) a ranking function decides which candi-dates are best suited for splitting the compound.For example, Aktionsplan has two splitting can-didates: Aktion(s)+plan (Engl.: action plan) andAkt+ion(s)+plan (Engl.: nude ion plan).1 After

1The additional ‘s’ is a linking morpheme (Langer, 1998)

10

Page 21: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

generating the candidates, the ranking function as-signs a score to each splitting candidate, includingthe original compound. We will now take a closerlook on possible splitting algorithms and rankingfunctions.

2.1 Splitting algorithmsLeft-to-Right grows a window over the inputfrom left to right. When a word from a dictionaryis found a split is generated. The algorithm is thenapplied recursively to the rest of the input.

JWord Splitter2 performs a dictionary look-upfrom left to right, but continues this process if theremainder of the word is not right), it creates a splitand stops. Banana Splitter3 searches for the wordfrom the right to the left, and if there is more thanone possibility, the one with the longest split onthe right side is taken as candidate. Data Drivencounts the number of words in a dictionary, whichcontain a split at this position as prefix or suffix forevery position in the input. A split is made at theposition with the largest difference between pre-fix and suffix counts (Larson et al., 2000). ASVToolbox4 uses a trained Compact Patricia Tree torecursively split parts from the beginning and endof the word (Biemann et al., 2008). Unlike theother algorithms, it generates only a single splitcandidate at each recursive step. For that reason,it does not need a ranker. It is also the only super-vised (using lists of existing compounds) approachtested.

2.2 Ranking functionsAs stated earlier, the ranking functions are as im-portant as the splitting algorithms, since a rankingfunction is responsible for assigning scores to eachpossible decompounding candidate. For the rank-ing functions, Alfonseca et al. (2008b) use a geo-metric mean of unigram frequencies (Equation 1),and a mutual information function (Equation 2).

rFreq() =

(N∏i

f(wi)

) 1N

(1)

rM.I.() =

{−f(c) log f(c) if N = 1

1N−1

∑N−1i log bigr(wi,wi+1)

f(wi)f(wi+1)

(2)

2github.com/danielnaber/jwordsplitter3niels.drni.de/s9y/pages/bananasplit.

html4wortschatz.uni-leipzig.de/˜cbiemann/

software/toolbox/

Splitter Ranker Pcomp Rcomp Psplit

Left-to-right Freq. .64 .58 .71M.I. .26 .08 .33

JWord Splitter Freq. .67 .63 .79M.I. .59 .20 .73

Banana Splitter Freq. .70 .40 .83M.I. .66 .16 .81

Data Driven Freq. .49 .18 .70M.I. .40 .04 .58

ASV ToolBox .80 .75 .87

Table 1: Evaluation results of state-of-the-art de-compounding systems.

In these equations, N is the number of fragmentsthe candidate has, w is the fragment itself, f(w)is the relative unigram frequency for that fragmentw, bigr(wi, wj) is the relative bigram frequencyfor the fragment wi and wj , c is the compounditself without being split.

2.3 Decompounding experiments

For evaluation, we use the corpus createdby Marek (2006) as a gold standard to evalu-ate the performance of the decompounding meth-ods. This corpus contains a list of 158,653 com-pounds, stating how each compound should bedecompounded. The compounds were obtainedfrom the issues 01/2000 to 13/2004 of the Ger-man computer magazine c’t5 in a semi-automaticapproach. Human annotators reviewed the list toidentify and correct possible errors. For calculat-ing the required frequencies, we use the Web1Tcorpus6 (Brants and Franz, 2006).

Koehn and Knight (2003) use a modified ver-sion of precision and recall for evaluating decom-pounding performance. Following Santos (2014),we decided to apply these metrics for measuringthe splitting algorithms, and ranking the functions’performance. The following counts were used forevaluating the experiments on the compound level:correct split (cs), a split fragment which was cor-rectly identified and wrong split (ws), a split frag-ment which was wrongly identified. Pcomp andRcomp evaluate decompounding on the level ofcompounds, and we propose to use Psplit = cs

cs + wsto evaluate on the level of splits.

As we focus in this work on the influence ofdecompounding on improving the accuracy of fre-

5www.heise.de/ct/6German version (see https://catalog.ldc.

upenn.edu/LDC2009T25).

11

Page 22: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Dataset peDOCS MedForum Pythag.

Number of doc. 2,644 102 60∅ doc. length 14,016 135 277Median doc. length 809 104 68

# keyphrases 30,051 853 622∅ key / doc. 11.37 8.41 10 .37∅ tokens / key 1.15 1.07 1.30∅ characters / key 13.27 10.28 12 .22

Table 2: Corpus statistics of datasets.

quency counts, Psplit is the best metric in our case.We can see in Table 1 that the ASV Toolbox split-ting algorithm is the best performing system in re-spect to Psplit. Thus, we select it as the decom-pounding algorithm in our keyphrase extractionexperiments described in the next section.

3 Experiments

3.1 Datasets

For our evaluation, we could not rely on Englishdatasets, as there is only very little compoundingand thus the expected effect of decompounding issmall. German is a good choice, as it is infamousfor its heavy compounding, e.g. the well-knownDonaudampfschifffahrtskapitan (Engl.: captain ofa steam ship on the river Danube). For Germankeyphrase extraction, we can use the peDOCSdatasets described in Erbs et al. (2013) and wecreated two additional datasets consisting of sum-maries of lesson transcripts (Pythagoras) and postsfrom a medical forum (MedForum). Table 2 sum-marizes their characteristics.

peDOCS consists of peer-reviewed articles,dissertations, and books from the educational do-main published by researchers. The gold standardfor this dataset was compiled by professional in-dexers and should thus be of high quality. Wepresent two novel keyphrase datasets consisting ofGerman texts. MedForum is composed of postsfrom a medical forum.7 To our knowledge, it isthe first dataset with keyphrase annotations fromuser-generated data in German. Two German an-notators with university degrees identified a setof keyphrases for every document and followingNguyen and Kan (2007), the union of both sets arethe final gold keyphrases. The Pythagoras datasetcontains summaries of lesson transcripts compiledin the Pythagoras project.8 Two annotators iden-

7www.medizin-forum.de/8www.dipf.de/en/research/projects/

pythagoras

tified keyphrases after a training phase with dis-cussion of three documents. As in the MedForumdataset, the gold standard consists of the union oflemmatized keyphrases by both annotators. Alldatasets contain a unranked list of keyphrases.

The peDOCS dataset is by far the largest of thesets, since it has been created over the course ofseveral years. MedForum and Pythagoras containfewer documents but each document is annotatedby a fixed pair of human annotators. The aver-age number of keyphrases is highest for peDOCSand lowest for MedForum. The length of the doc-ument also influences the number of keyphrasesas short documents have fewer keyphrase candi-dates. Keyphrases in all three datasets are on av-erage very short. The example in Figure 1 givesan example of a rather specific keyphrase which,however, consists of only one token. We believethat keyphrase extraction approaches benefit fromdecompounding more in cases of short documents.Longer documents provide more statistical datawhich reduces the need for additional statisticaldata obtained with decompounding.

3.2 Experimental Setup

For preprocessing, we rely on components fromthe DKPro Core framework (Eckart de Castilhoand Gurevych, 2014) and on DKPro Lab (deCastilho and Gurevych, 2011) for building ex-perimental pipelines. We use the Stanford Seg-menter9 for tokenization, TreeTagger (Schmid,1994; Schmid, 1995) for lemmatization and part-of-speech tagging. Finally, we perform stopwordremoval and decompounding as described in Sec-tion 2. It should be noted that in most preprocess-ing pipelines, decompounding should be the laststep, as it heavily influences POS-tagging. We ex-tract all lemmas in the document as keyphrase can-didates and rank them according to basic rankingapproaches based on frequency counts and the po-sition in the document. We do not use more so-phisticated extraction approaches, as we want toexamine the influence of decompounding as di-rectly as possible. However, it has been shownthat frequency-based heuristics are a very strongbaseline (Zesch and Gurevych, 2009), and evensupervised keyphrase extraction methods such asKEA (Witten et al., 1999) use term frequency andposition as the most important features and will be

9nlp.stanford.edu/software/segmenter.shtml

12

Page 23: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

heavily influenced by decompounding.We evaluate the following ranking methods: tf-

idfconstant ranks candidates according to their termfrequency f(t, d) in the document. tf-idf de-creases the impact of words that occur in mostdocuments. The term frequency count is normal-ized with the inverse document frequency in thetest collection (Salton and Buckley, 1988).

tf-idf = f(t, d) log|D|

|d ∈ D : t ∈ d| (3)

In this formula |D| is the number of documentsand |d ∈ D : t ∈ d| is the number of documentsmentioning term t. As some document collec-tions may be too small to allow computing reliablefrequency estimates, we also evaluated tf-idfweb.Again, the document frequency is approximatedby the frequency counts from the Web1T corpus.We take the position of a candidate as a baseline.The closer the keyword is to the beginning of thetext, the higher it is ranked. This is not dependenton frequency counts, but decompounding can alsohave an influence if a compound that appears earlyin the document is split into parts that are now alsopossible keyphrase candidates. We test each of theranking methods with (w) and without (w/o) de-compounding.

3.3 Evaluation metricsFor the keyphrase experiments, we compare re-sults in terms of precision and recall of the top-5 keyphrases (P@5), Mean Average Precision(MAP), and R-precision (R-p).10 MAP is theaverage precision of extracted keyphrases from1 to the number of extracted keyphrases, whichcan be much higher than ten. R-precision11 isthe ratio of true positives in the set of extractedkeyphrases when as many keyphrases as there aregold keyphrases are extracted.12

4 Results and discussion

In order to assess the influence of decompoundingon keyphrase extraction, we evaluate the selectedextraction approaches with (w/) and without (w/o)decompounding. The final evaluation results willbe influenced by two factors:

10Using the top-5 keyphrases reflects best the averagenumber of keyphrases in our evaluation datasets and is com-mon practice in related work (Kim et al., 2013).

11This is commonly in information retrieval and first usedfor keyphrase identification in Zesch and Gurevych (2009)

12Refer to Buckley and Voorhees (2000) for an overviewof evaluation measures and their characteristics.

Method ∆ P@5 ∆ R@5 ∆ R-p. ∆ MAP

Position .000 .000 .000 .000tf-idfconstant .039 .030 .022 .012tf-idf .031 .024 .025 .015tf-idfweb .035 .021 .024 .012

Table 3: Difference of results with decompound-ing on the MedForum dataset.

Enhanced frequency counts: As we havediscussed before, the frequency counts will bemore accurate, which should lead to higher qual-ity keyphrases being extracted. This affectsfrequency-based rankings.

More keyphrase candidates: The number ofkeyphrase candidates might increase, as it is pos-sible that some of the parts created by the decom-pounding were not mentioned in the document be-fore. This is the special case of a enhanced fre-quency count going up from 0 to 1.

We perform experiments to investigate the in-fluence of both effects, first, the enhanced fre-quency counts, and second, the newly introducedkeyphrase candidates.

4.1 Enhanced frequency countsIn order to isolate the effect, we limit the listof keyphrase candidates to those that are alreadypresent in the document without decompounding.We selected the MedForum dataset for this analy-sis, because it contains many compounds and hasthe shortest documents which we believe is bestsuited for an additional decompounding step.

Table 3 shows improvements of evaluation re-sults for keyphrase extraction approaches on theMedForum datasets. The improvement is mea-sured as the difference of evaluation metrics ofusing extraction approaches with decompoundingcompared to not using any decompounding. Thistable does not show absolute numbers, instead itshows the increase of performance. Absolute val-ues are not comparable to other experimental set-tings, because all gold keyphrases that do not ap-pear in the text as lemmas are disregarded. Wecan thus analyze the effect of enhanced frequencycounts in isolation. Results show that for tf-idfconstant, tf-idf, and tf-idfweb our decompound-ing extension increases results on the MedForumdataset considering only candidates that are ex-tracted without decompounding. Decompoundingdoes not affect results for the position baseline asit is not based on frequency counting. For thefrequency-based approaches, the effect is rather

13

Page 24: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

DecompoundingDataset w/o w ∆

peDOCS .614 .632 .018MedForum .592 .631 .038Pythagoras .624 .625 .002

Table 4: Maximum recall for keyphrase extractionwith and without decompounding for the datasets.

small in general, however consistent across allmetrics and methods. The decompounding ex-tension, however, has the effect of adding furtherkeyphrase candidates.

4.2 More keyphrase candidatesThe second effect of decompounding is that newterms are introduced that cannot be found in theoriginal document. Table 4 shows the maximumrecall for lemmas with and without decompound-ing on all German datasets. The maximum recallis obtained by assuming that given a list of can-didates the best possible set of keyphrases are ex-tracted. Keyphrase extraction with decompound-ing increases the maximum recall on all datasetsby up to 3.8% points. It must be noted that theincrease is due to more keyphrase candidates ex-tracted, which increases the importance of the fi-nal ranking. The increase is higher for MedForumwhile it is lower for Pythagoras. Pythagoras com-prises summaries of lesson transcripts for studentsin the ninth grade, thus teachers are less likelyto use complex words which need to be decom-pounded. The smaller increase for peDOCS com-pared to MedForum is due to longer peDOCS doc-uments. The longer a document is, the more likelya part in a compound also appears as an isolatedtoken which limits the increase of maximum re-call. peDOCS shows to have a higher maximumrecall compared to collections with shorter docu-ments because documents with more tokens alsohave more candidates. MedForum comprises fo-rum data, which contains both medical terms andinformal description of such terms. Furthermore,gold keyphrases were assigned to assist others insearching. This leads to having documents con-taining terms like Augenschmerzen (Engl.: eyepain) for which the gold keyphrase Auge (Engl.:eye) was assigned.

4.3 Combined resultsPreviously, we analyzed the effects of decom-pounding in isolation, now we analyze thecombination of enhanced frequency counts and

more keyphrase candidates on the overall results.Table 5 shows the complete results for the Germandatasets, described keyphrase extraction methods,and with and without decompounding.

For the peDOCS dataset, we see a negative ef-fect of decompounding. Only the position base-line and tf-idfconstant benefit from decompound-ing in terms of mean average precision (MAP),while they yield lower results in terms of theother evaluation metrics. The improvement ofthe position baseline in terms of MAP might beto several correctly extracted keyphrases beyondthe top-5 extracted keyphrases. We have previ-ously discussed that peDOCS has on average thelongest documents and most likely contains allgold keyphrases multiple times in the documenttext. For this reason, frequency-based approachesdo not benefit from additional frequency informa-tion obtained from compounds. Many compoundsare composed of common words, which alreadyappear in the document. On the contrary, morecommon keyphrases are weighted higher, whichhurts results in the case of peDOCS with highly-specialized and longer keyphrases. Depending onthe task, this might be an undesired behavior.13

The only dataset for which the decompound-ing yields higher results is the MedForum dataset.Results improve with decompounding for tf-idfconstant and tf-idf. As can be seen in Table 4,enhanced frequency counts improve results, andyield a higher maximum recall. Contrary to theother tf-idf configurations, results for tf-idfweb de-crease with decompounding. This leads to theobservation that, besides the effect of enhancedranking and more keyphrase candidates, a thirdeffect influences results of keyphrase extractionmethods: The ranking of additional keyphrasecandidates obtained from decompounding. Thesecandidates might appear infrequently in isolationand are ranked high if external document fre-quencies (df values) are used. Compound partswhich do not appear in isolation14—hence, nogood keyphrases—are ranked high in case of tf-idfweb because their document frequency from theweb is very low. In case of classic tf-idf they areranked low because they are normalized with doc-

13When searching for documents, highly-specializedkeyphrases might be better suited, while common keyphrasesmight be better suited for clustering of documents.

14The verb begießen (Engl.: to water) can be split into theverb gießen (Engl.: to pour) and the prefix be which does notappear as an isolated word.

14

Page 25: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

DecompoundingPrecision@5 Recall@5 R-precision MAP

Dataset Method w/o w/ ∆ w/o w/ ∆ w/o w/ ∆ w/o w/ ∆pe

DO

CS

Upper bound .856 .864 .012 .393 .403 .010 .614 .632 .018 .614 .632 .018Position .096 .068 -.028 .042 .030 -.012 .092 .080 -.012 .083 .086 .003tf-idfconstant .170 .160 -.010 .075 .070 -.004 .127 .125 -.002 .123 .123 .001tf-idf .137 .117 -.020 .060 .051 -.009 .107 .088 -.019 .112 .099 -.014tf-idfweb .188 .168 -.020 .083 .074 -.009 .139 .126 -.013 .139 .129 -.010

Med

Foru

m Upper bound .867 .890 .023 .397 .422 .025 .592 .631 .038 .592 .631 .038Position .082 .073 -.010 .049 .043 -.006 .101 .090 -.011 .142 .130 -.012tf-idfconstant .149 .161 .012 .089 .096 .007 .144 .145 .001 .165 .162 -.003tf-idf .235 .282 .047 .140 .168 .028 .210 .234 .025 .203 .210 .007tf-idfweb .231 .165 -.067 .138 .098 -.040 .223 .159 -.064 .206 .180 -.027

Pyth

agor

as

Upper bound .941 .942 .001 .344 .344 .001 .624 .625 .002 .624 .625 .002Position .030 .023 -.007 .014 .011 -.003 .044 .022 -.022 .106 .075 -.031tf-idfconstant .137 .087 -.050 .066 .042 -.024 .143 .103 -.040 .153 .121 -.032tf-idf .150 .150 .000 .072 .072 .000 .113 .114 .001 .141 .136 -.005tf-idfweb .187 .100 -.087 .090 .048 -.042 .205 .102 -.103 .191 .136 -.055

Table 5: Results for keyphrase extraction approaches without (w/o) and with (w/) decompounding.

ument frequencies from a corpus where decom-pounding has been applied. In case of tf-idfweb,no decompounding has been applied. The effectof the poor ranking of newly introduced keyphrasecandidates needs to be investigated further by con-ducting a manual analysis of the decompoundingperformance and the creation of non-words.

For the Pythagoras dataset, keyphrase ex-traction approaches yield similar results as forpeDOCS. Decompounding decreases results, onlyresults for tf-idf stay stable. As seen earlier (seeTable 4), decompounding does not raise the max-imum recall much (only by .002). As before inthe case of the MedForum dataset, tf-idfweb is in-fluenced negatively by the decompounding exten-sion. Results for tf-idfweb decrease by .103 interms of R-precision, which is a reduction of morethan 50%. The ranking of keyphrases is hurt bymany keyphrases, which appear as parts of com-pounds. They are ranked high because they in-frequently appear as separate words. Consider-ing the characteristics of keyphrases in Pythago-ras, we see that keyphrases are rather long with12.22 characters per keyphrase. This leads to theobservation that the style of the keyphrases hasan effect on the applicability of decompounding.Datasets with more specific keyphrases are lesslikely to benefit from decompounding.

5 Conclusions and future work

We presented a decompounding extension forkeyphrase extraction. We created two new datasetsto analyze these effects and showed that decom-pounding has the potential to increase results for

keyphrase extraction on shorter German docu-ments. We identified two effects of decompound-ing relevant for keyphrase extraction: (i) enhancedfrequency counts, and (ii) more keyphrase can-didates. We find that the first effect slightly in-creases results when updating the term frequen-cies, while including the second effect in the eval-uation, reduces results for two of three datasets.We thus conclude that the effect of decompound-ing for keyphrases extraction requires further anal-ysis, but may be a useful feature for supervisedsystems (Berend and Farkas, 2010).

In the future, we propose to further analyzecharacteristics of good keyphrases and whetherthey often are compounds. We see the poten-tial for better decompounding approaches as anyimprovements on this task may have positive ef-fects on keyphrase extraction. We would also liketo investigate other effects that make tasks likekeyphrase extraction especially hard. Named en-tity disambiguation might improve results furtheras some concepts are mentioned frequently in atext but always with another surface form. Wemake our experimental framework available to thecommunity to foster future research.

Acknowledgments

This work has been supported by the Volk-swagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/82806,by the Klaus Tschira Foundation under project No.00.133.2008, and by the German Institute for Ed-ucational Research (DIPF) We thank the anony-mous reviewers for their helpful comments.

15

Page 26: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

ReferencesEnrique Alfonseca, Slaven Bilac, and Stefan Phar-

ies. 2008a. Decompounding Query Keywords fromCompounding Languages. In Proceedings of the46th Annual Meeting of the Association for Compu-tational Linguistics on Human Language Technolo-gies: Short Papers, HLT-Short ’08, pages 253–256,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Enrique Alfonseca, Slaven Bilac, and Stefan Pharies.2008b. German Decompounding in a Difficult Cor-pus. In Computational Linguistics and IntelligentText Processing, volume 4919 of Lecture Notes inComputer Science, pages 128–139. Springer BerlinHeidelberg.

Marco Baroni, Johannes Matiasek, and H Trost. 2002.Predicting the Components of German NominalCompounds. ECAI, pages 1–12.

Gabor Berend and Richard Farkas. 2010. SZTER-GAK: Feature Engineering for Keyphrase Extrac-tion. In Proceedings of the 5th International Work-shop on Semantic Evaluation, SemEval ’10, pages186–189, Stroudsburg, PA, USA.

Chris Biemann, Uwe Quasthoff, Gerhard Heyer, andFlorian Holz. 2008. ASV Toolbox: a Modular Col-lection of Language Exploration Tools. In Proceed-ings of the Sixth International Conference on Lan-guage Resources and Evaluation (LREC’08), pages1760–1767, Paris. European Language ResourcesAssociation (ELRA).

Thorsten Brants and Alex Franz. 2006. Web 1T 5-Gram Version 1. In Linguistic Data Consortium,Philadelphia.

Chris Buckley and Ellen M. Voorhees. 2000. Evaluat-ing Evaluation Measure Stability. In Proceedings ofthe 23rd Annual International ACM SIGIR Confer-ence on Research and Development in InformationRetrieval - SIGIR ’00, pages 33–40, New York, NewYork, USA.

Richard Eckart de Castilho and Iryna Gurevych. 2011.A Lightweight Framework for Reproducible Param-eter Sweeping in Information Retrieval. In Pro-ceedings of the 2011 Workshop on Data Infrastruc-tures for Supporting Information Retrieval Evalu-ation, DESIRE ’11, pages 7–10, New York, NY,USA. ACM.

Richard Eckart de Castilho and Iryna Gurevych. 2014.A Broad-coverage Collection of Portable NLP Com-ponents for Building Shareable Analysis Pipelines.In Proceedings of the Workshop on Open Infrastruc-tures and Analysis Frameworks for HLT at COLING2014, pages 1–11.

Nicolai Erbs, Iryna Gurevych, and Marc Rittberger.2013. Bringing Order to Digital Libraries: FromKeyphrase Extraction to Index Term Assignment.D-Lib Magazine, 19(9/10):1–16.

Kazi Saidul Hasan and Vincent Ng. 2014. Automatickeyphrase extraction: A survey of the state of theart. Proceedings of the Association for Computa-tional Linguistics (ACL), Baltimore, Maryland: As-sociation for Computational Linguistics.

Vera Hollink, Jaap Kamps, Christof Monz, andMaarten de Rijke. 2004. Monolingual DocumentRetrieval for European Languages. Information Re-trieval, 7(1/2):33–52.

Anette Hulth. 2003. Improved Automatic KeywordExtraction given more Linguistic Knowledge. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 216–223.

Su Nam Kim, Olena Medelyan, Min-Yen Kan, andTimothy Baldwin. 2013. Automatic KeyphraseExtraction from Scientific Articles. Language Re-sources and Evaluation, 47:723–742.

Philipp Koehn and Kevin Knight. 2003. EmpiricalMethods for Compound Splitting. In Proceedingsof the Tenth Conference on European Chapter ofthe Association for Computational Linguistics - Vol-ume 1, EACL ’03, pages 187–193, Stroudsburg, PA,USA. Association for Computational Linguistics.

Stefan Langer. 1998. Zur Morphologie und Seman-tik von Nominalkomposita. In Tagungsband der4. Konferenz zur Verarbeitung naturlicher Sprache(KONVENS), pages 83–97.

Martha Larson, Daniel Willett, Joachim Koehler, andGerhard Rigoll. 2000. Compound Splitting andLexical Unit Recombination for Improved Perfor-mance of a Speech Recognition System for GermanParliamentary Speeches. In Proceedings of the 6thInternational Conference on Spoken Language Pro-cessing (ICSLP), pages 945–948.

Torsten Marek. 2006. Analysis of German Com-pounds using Weighted Finite State Transducers.Bachelor thesis, University of Tubingen.

Rada Mihalcea and Paul Tarau. 2004. TextRank:Bringing Order into Texts. In Proceedings of Em-pirical Methods for Natural Language Processing,pages 404–411.

Thuy Dung Nguyen and Min-Yen Kan. 2007.Keyphrase Extraction in Scientific Publications. InProceedings of International Conference on AsianDigital Libraries, volume 4822 of Lecture Notes inComputer Science, pages 317–326.

R. J. F. Ordelman. 2003. Dutch Speech Recognitionin Multimedia Information Retrieval. Ph.D. thesis,University of Twente, Enschede, Enschede, October.

Gerard Salton and Christopher Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval.Information Processing & Management, 24(5):513–523.

16

Page 27: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Pedro Bispo Santos. 2014. Using compound listsfor german decompounding in a back-off scenario.In Workshop on Computational, Cognitive, andLinguistic Approaches to the Analysis of ComplexWords and Collocations (CCLCC 2014), pages 51–55.

Helmut Schmid. 1994. Probabilistic Part-of-SpeechTagging Using Decision Trees. In InternationalConference on New Methods in Language Process-ing, pages 44–49, Manchester, UK.

Helmut Schmid. 1995. Improvements in Part-of-Speech Tagging with an Application to German.In Proceedings of the ACL SIGDAT-Workshop, vol-ume 21, pages 1–9.

Ian H Witten, Gordon W Paynter, and Eibe Frank.1999. KEA: Practical Automatic Keyphrase Extrac-tion. In Proceedings of the 4th ACM Conference onDigital Libraries, pages 254–255.

Torsten Zesch and Iryna Gurevych. 2009. Approx-imate Matching for Evaluating Keyphrase Extrac-tion. In Proceedings of the 7th International Confer-ence on Recent Advances in Natural Language Pro-cessing, pages 484–489.

17

Page 28: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, page 18,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

(Invited Talk) The Web as an Implicit Training Set: Application to NounCompounds Syntax and Semantics

Preslav NakovQatar Computing Research Institute

[email protected]

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating withhumans in natural language. This has proven hard, and thus research has focused on sub-problems. Evenso, the field was stuck with manual rules until the early 90s, when computers became powerful enoughto enable the rise of statistical approaches. Eventually, this shifted the main research attention to machinelearning from text corpora, thus triggering a revolution in the field.

Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial re-search on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate forn-gram word frequencies; this has led some researchers to conclude that the Web should be only used as abaseline.

In this talk, I will reveal some of the hidden potential of the Web that lies beyond the n-gram, with focuson the syntax and semantics of English noun compounds. First, I will present a highly accurate lightlysupervised approach based on surface markers and linguistically-motivated paraphrases that yields state-of-the-art results for noun compound bracketing: e.g., “[[liver cell] antibody]” is left-bracketed, while “[liver[cell line]]” is right-bracketed. Second, I will present a simple unsupervised method for mining implicitpredicates that can characterize the semantic relations holding between the nouns in noun compounds, e.g.,“malaria mosquito” is a “mosquito that carries/spreads/causes/transmits/brings/infects with/... malaria”.Finally, I will show how these ideas can be used to improve statistical machine translation.

About the Speaker:Preslav Nakov is a Senior Scientist at the Qatar Computing Research Institute (QCRI). He received his Ph.D.in Computer Science from the University of California at Berkeley in 2007 (supported by a Fulbright grantand a UC Berkeley fellowship). Before joining QCRI, Preslav was a Research Fellow at the National Uni-versity of Singapore. He has also spent a few months at the Bulgarian Academy of Sciences and the SofiaUniversity, where he was an honorary lecturer. Preslav’s research interests include lexical semantics (in par-ticular, multi-word expressions, noun compounds syntax and semantics, and semantic relation extraction),machine translation, Web as a corpus, and biomedical text processing.

Preslav was involved in many activities related to lexical semantics. He is a member of the SIGLEXboard, he is co-chairing SemEval’2014, SemEval’2015, and SemEval’2016, and he has co-organized severalSemEval tasks, e.g., on the semantics of noun compounds, on semantic relation extraction, on sentimentanalysis on Twitter, and on community question answering. He has co-chaired MWE in 2009 and 2010, aswell as other semantics workshops such as RELMS, and he was an area chair of *SEM’2013. He was also aguest co-editor for the 2013 special issue of the journal of Natural Language Engineering on the syntax andsemantics of noun compounds, and he is currently a guest co-editor of a special issue of LRE on SemEval-2014 and Beyond. In 2013, he has published a Morgan & Claypool book on semantic relation extraction; hehas given a tutorial on the same topic at RANLP’2013, and he is giving a similar one at EMNLP’2015.

18

Page 29: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 19–24,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

Reducing Over-generation Errors for Automatic Keyphrase Extractionusing Integer Linear Programming

Florian BoudinLINA - UMR CNRS 6241, Universite de Nantes, France

[email protected]

Abstract

We introduce a global inference modelfor keyphrase extraction that reduces over-generation errors by weighting sets ofkeyphrase candidates according to theircomponent words. Our model can be ap-plied on top of any supervised or unsuper-vised word weighting function. Experi-mental results show a substantial improve-ment over commonly used word-basedranking approaches.

1 Introduction

Keyphrases are words or phrases that capture themain topics discussed in a document. Auto-matically extracted keyphrases have been foundto be useful for many natural language pro-cessing and information retrieval tasks, such assummarization (Litvak and Last, 2008), opin-ion mining (Berend, 2011) or text categoriza-tion (Hulth and Megyesi, 2006). Despite consid-erable research effort, the automatic extraction ofkeyphrases that match those of human experts re-mains challenging (Kim et al., 2010).

Recent work has shown that most errors madeby state-of-the-art keyphrase extraction systemsare due to over-generation (Hasan and Ng, 2014).Over-generation errors occur when a system cor-rectly outputs a keyphrase because it contains animportant word, but at the same time erroneouslypredicts other keyphrase candidates as keyphrasesbecause they contain the same word. One reasonthese errors are frequent is that many unsupervisedsystems rank candidates according to the weightsof their component words, e.g. (Wan and Xiao,2008a; Liu et al., 2009), and many supervised sys-tems use unigrams as features, e.g. (Turney, 2000;Nguyen and Luong, 2010).

While weighting words instead of phrases mayseem rather blunt, it offers several advantages. In

practice, words are usually much easier to extract,match and weight, especially for short documentswhere many phrases may not be statistically fre-quent (Liu et al., 2011).

Selecting keyphrase candidates according totheir component words may also turn out to beuseful for reducing over-generation errors if onecan ensure that the importance of each wordis counted only once in the set of extractedkeyphrases. To do so, keyphrases should be ex-tracted as a set rather than independently. Findingthe optimal set of keyphrases is a combinatorialoptimisation problem, and can be formulated as aninteger linear program (ILP) which can be solvedexactly using off-the-shelf solvers.

In this work, we propose an ILP formulation forkeyphrase extraction that can be applied on topof any word weighting scheme. Through experi-ments carried out on the SemEval dataset (Kim etal., 2010), we show that our model increases theperformance of both supervised and unsupervisedword weighting keyphrase extraction methods.

The rest of this paper is organized as follows.In Section 2, we describe our ILP model forkeyphrase extraction. Our experiments are pre-sented in Section 3. In Section 4, we briefly reviewthe previous work, and we conclude in Section 5.

2 Method

Our global inference model for keyphrase extrac-tion consists of three steps. First, keyphrase can-didates are extracted from the document usingheuristic rules. Second, words are weighted usingeither supervised or unsupervised methods. Third,finding the optimal subset of keyphrase candidatesis cast as an ILP and solved using an off-the-shelfsolver.

2.1 Keyphrase candidate selection

Candidate selection is the task of identifying thewords or phrases that have properties similar to

19

Page 30: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

those of manually assigned keyphrases. First,we apply the following pre-processing steps tothe document: sentence segmentation1, word to-kenization2 and Part-Of-Speech (POS) tagging3.

Following previous work (Wan and Xiao,2008a; Bougouin et al., 2013), we use the se-quences of nouns and adjectives as keyphrase can-didates. Candidates that have less than three char-acters, that contain only adjectives, or that containstop-words4 are filtered out. These heuristic rulesare designed to avoid spurious instances and keepthe number of candidates to a minimum (Hasanand Ng, 2014). All words are stemmed usingPorter’s stemmer (Porter, 1980).

2.2 Word weighting functionsThe performance of our model depends on howword weights are estimated. Here, we ex-periment with three methods for assigning im-portance weights to words. The first twoare unsupervised weighting functions, namelyTF×IDF (Sparck Jones, 1972) and TextRank (Mi-halcea and Tarau, 2004), which have been exten-sively used in prior work (Hasan and Ng, 2010).We also apply a supervised model for predictingword importance based on (Hong and Nenkova,2014).

2.2.1 TF×IDFThe weight of each word t is estimated using itsfrequency tf(t, d) in the document d and howmany other documents include t (inverse docu-ment frequency), and is defined as:

TF× IDF(t, d) = tf(t, d)× log(D/Dt)

where D is the total number of documents and Dt

is the number of documents containing t.

2.2.2 TextRankA co-occurrence graph is first built from the doc-ument in which nodes are words and edges repre-sent the number of times two words co-occur inthe same sentence. TextRank (Mihalcea and Ta-rau, 2004), a graph-based ranking algorithm, isthen used to compute the importance weight ofeach word. Let d be a damping factor5, the Tex-tRank score S(Vi) of a node Vi is initialized to a

1We use Punkt Sentence Tokenizer from NLTK.2We use Penn Treebank Tokenizer from NLTK.3We use the Stanford Part-Of-Speech Tagger (Toutanova

et al., 2003).4We use the english stop-list from NLTK.5We set d to 0.85 as in (Mihalcea and Tarau, 2004).

default value and computed iteratively until con-vergence using the following equation:

S(Vi) = (1− d) +

(d×

∑Vj∈N (Vi)

wji × S(Vj)∑Vk∈N (Vj)

wjk

)

where N (Vi) is the set of nodes connected to Vi

and wji is the weight of the edge between nodesVj and Vi.

TextRank implements the concept of “voting”,i.e. a word is important if it is highly connectedto other words and if it is connected to importantwords.

2.2.3 Logistic regressionWe train a logistic regression model6 for assign-ing importance weights to words in the documentbased on (Hong and Nenkova, 2014). Referencekeyphrases in the training data are used to gener-ate positive and negative examples. For a word inthe document (restricted to adjectives and nouns),we assign label 1 if the word appears in the corre-sponding reference keyphrases, otherwise we as-sign 0. We use the relative position of the first oc-currence, the presence in the first sentence and theTF×IDF weight as features. These features havebeen extensively used in supervised keyphrase ex-traction approaches, and have been shown to per-form consistently well (Hasan and Ng, 2014).

2.3 ILP model definition

Our model is an adaptation of the concept-based ILP model for summarization introducedby (Gillick and Favre, 2009), in which sentence se-lection is cast as an instance of the budgeted max-imum coverage problem7. The key assumption ofour model is that the value of a set of keyphrasecandidates is defined as the sum of the weights ofthe unique words it contains. That way, a set ofcandidates only benefits from including each wordonce. Words are thus assumed to be independent,that is, the value of including a word is not affectedby the presence of any other word in the set ofkeyphrases.

Formally, let wi be the weight of word i, xi

and cj two binary variables indicating the pres-

6We use the Logistic Regression classifier from scikit-learn with default parameters.

7Given a collection S of sets with associated costs and abudget L, find a subset S′ ⊆ S such that the total cost ofsets in S′ does not exceed L, and the total weight of elementscovered by S′ is maximized (Khuller et al., 1999).

20

Page 31: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

ence of word i and candidate j in the set of ex-tracted keyphrases, Occij an indicator of the oc-currence of word i in candidate j and N the max-imum number of extracted keyphrases, our modelis described as:

max∑

i

wixi (1)

s.t.∑

j

cj ≤ N (2)

cjOccij ≤ xi, ∀i, j (3)∑j

cjOccij ≥ xi, ∀i (4)

xi ∈ {0, 1} ∀icj ∈ {0, 1} ∀j

The constraints formalized in equations 3 and 4ensure the consistency of the solution: selecting acandidate leads to the selection of all the words itcontains, and selecting a word is only possible if itis present in at least one selected candidate.

By summing over word weights, this modeloverly favors long candidates. Indeed, given twokeyphrase candidates, one being included in theother (e.g. uddi registries and multiple uddi reg-istries), this model always selects the longest oneas its contribution to the objective function islarger. To correct this bias, a regularization termis added to the objective function:

max∑

i

wixi − λ∑

j

(lj − 1)cj1 + substrj

(5)

where lj is the size, in words, of candidate j,and substrj the number of times cj occurs as asubtring in the other candidates. This regulariza-tion penalizes the candidates that are composed ofmore than two words, and is dampened for can-didates that occur frequently as substrings in othercandidates. Here, we assume that for multiple can-didates of the same size, the one that is less fre-quent in the document should be stressed first.

The resulting ILP is then solved exactly usingan off-the-shelf solver8. The solving process takesless than a second per document on average. TheN candidate keyphrases returned by the solver areselected as keyphrases.

8We use GLPK, http://www.gnu.org/software/glpk/

3 Experiments

3.1 Experimental settings

We carry out our experiments on the SemEvaldataset (Kim et al., 2010), which is composed ofscientific articles collected from the ACM DigitalLibrary. The dataset is divided into training (144documents) and test (100 documents) sets. We usethe set of combined author- and reader-assignedkeyphrases as reference keyphrases.

We follow the common practice (Kim et al.,2010) and evaluate the performance of our methodin terms of precision (P), recall (R) and f-measure(F) at the top N keyphrases9. Extracted and refer-ence keyphrases are stemmed to reduce the num-ber of mismatches.

For each word weighting function, namelyTF×IDF, TextRank and Logistic regression, wecompare the performance of our ILP model (here-after ilp) with that of two word-based weightingbaselines. The first baseline (hereafter sum) sim-ply ranks keyphrase candidates according to thesum of the weights of their component words asin (Wan and Xiao, 2008b; Wan and Xiao, 2008a).The second baseline (hereafter norm) consists inscoring keyphrase candidates by computing thesum of the weights of their component words nor-malized by their length as in (Boudin, 2013).

As a post-processing step, we remove redundantkeyphrases from the ranked lists generated by bothbaselines. A keyphrase is considered redundant ifit is included in another keyphrase that is rankedhigher in the list.

IDF weights are computed on the training set.The regularization parameter λ is set, for all theexperiments, to the value that achieves the bestperformance on the training set, that is 0.3 forTF×IDF, 0.4 for TextRank and 1.2 for Logistic re-gression.

3.2 Results

The performance of our model on top of differ-ent word weighting functions is shown in Table 1.Overall, our model consistently improves the per-formance over the baselines. We observe that theresults for sum are very low. Summing the wordweights favors long candidates and is prone toover-generation errors, as illustrated by the exam-ple in Table 2.

9Scores are computed using the evaluation script providedby the SemEval organizers.

21

Page 32: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Top-5 candidates Top-10 candidates

Weighting + Ranking P R F P R F

TF×IDF + sum 5.6 1.9 2.8 5.3 3.5 4.2+ norm 19.2 6.7 9.9 15.1 10.6 12.3+ ilp 25.4 9.1 13.3† 17.5 12.4 14.4†

TextRank + sum 4.5 1.6 2.3 4.0 2.8 3.3+ norm 18.8 6.6 9.6 14.5 10.1 11.8+ ilp 22.6 8.0 11.7† 17.4 12.2 14.2†

Logistic regression + sum 4.2 1.5 2.2 4.7 3.4 3.9+ norm 23.8 8.3 12.2 18.9 13.3 15.5+ ilp 29.4 10.4 15.3† 19.8 14.1 16.3

Table 1: Comparison of TF×IDF, TextRank and Logistic regression for different ranking strategies whenextracting a maximum of 5 and 10 keyphrases. Results are expressed as a percentage of precision (P),recall (R) and f-measure (F). † indicates significance at the 0.05 level using Student’s t-test.

Normalizing the candidate scores by theirlengths (norm) produces shorter candidates butdoes not limit the number of over-generation er-rors. As we can see from the example in Table 2,9 out of 10 extracted keyphrases are containing theword nugget. Our ILP model removes these redun-dant keyphrases by controlling the impact of eachword on the set of extracted keyphrases. The re-sulting set of keyphrases is more diverse and thusincreases the coverage of the topics addressed inthe document.

Note that the reported results are not on parwith keyphrase extraction systems that use ad-hoc pre-processing, involve structural features andleverage external resources. Rather our goal inthis work is to demonstrate a simple and intuitivemodel for reducing over-generation errors.

4 Related Work

In recent years, keyphrase extraction has attractedconsiderable attention and many different ap-proaches were proposed. Generally speaking,keyphrase extraction methods can be divided intotwo main categories: supervised and unsupervisedapproaches.

Supervised approaches treat keyphrase ex-traction as a binary classification task, whereeach phrase is labeled as keyphrase or non-keyphrase (Witten et al., 1999; Turney, 2000;Kim and Kan, 2009; Lopez and Romary, 2010).Unsupervised approaches usually rank phrasesby importance and select the top-ranked ones askeyphrases. Methods for ranking phrases in-

TF×IDF + sum (P = 0.1)advertis bid; certain advertis budget; key-word bid; convex hull landscap; budget op-tim bid; uniform bid strategi; advertis slot;advertis campaign; ward advertis; searchbasadvertis

TF×IDF + norm (P = 0.2)advertis; advertis bid; keyword; keywordbid; landscap; advertis slot; advertis cam-paign; ward advertis; searchbas advertis; ad-vertis random

TF×IDF + ilp (P = 0.4)click; advertis; uniform bid; landscap; auc-tion; convex hull; keyword; budget optim;single-bid strategi; queri

Table 2: Example of the top-10 extractedkeyphrases for the document J-3 of the SemEvaldataset. Keyphrases are stemmed and whose thatmatch reference keyphrases are marked bold.

clude graph-based ranking (Mihalcea and Tarau,2004; Wan and Xiao, 2008a; Wan and Xiao,2008b; Bougouin et al., 2013; Boudin, 2013),topic-based clustering (Liu et al., 2009; Liu etal., 2010; Bougouin et al., 2013), statistical mod-els (Paukkeri and Honkela, 2010; El-Beltagy andRafea, 2010) and language modeling (Tomokiyoand Hurst, 2003).

The work of (Ding et al., 2011) is perhaps theclosest to our present work. They proposed anILP formulation of the keyphrase extraction prob-

22

Page 33: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

lem that combines TF×IDF and position featuresin an objective function subject to constraints ofcoherence and coverage. In their model, coher-ence is measured by Mutual Information and cov-erage is estimated using Latent Dirichlet Alloca-tion (LDA) (Blei et al., 2003). Their work dif-fers from ours in that (1) it is phrased-based andthus does not penalize redundant keyphrases, and(2) it requires estimating a large number of hyper-parameters which makes it difficult to generalize.

5 Conclusion and Future Work

In this paper, we proposed an ILP formulation forkeyphrase extraction that reduces over-generationerrors by weighting keyphrase candidates as aset rather than independently. In our model,keyphrases are selected according to their compo-nent words, and the weight of each unique wordis counted only once. Experiments show a sub-stantial improvement over commonly used word-based ranking approaches using either supervisedand unsupervised weighting schemes.

In future work, we intend to extend our model toinclude word relatedness through the use of asso-ciation measures. By doing so, we expect to betterdifferentiate semantically related keyphrase can-didates according to the association strength be-tween their component words.

Acknowledgments

This work was partially supported by the GOLEMproject (grant of CNRS PEPS FaSciDo 2015,http://boudinfl.github.io/GOLEM/).We thank the anonymous reviewers, AdrienBougouin and Evgeny Gurevsky for their insight-ful comments.

References

Gabor Berend. 2011. Opinion expression mining byexploiting keyphrase extraction. In Proceedings of5th International Joint Conference on Natural Lan-guage Processing, pages 1162–1170, Chiang Mai,Thailand, November. Asian Federation of NaturalLanguage Processing.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022, March.

Florian Boudin. 2013. A comparison of centralitymeasures for graph-based keyphrase extraction. In

Proceedings of the Sixth International Joint Confer-ence on Natural Language Processing, pages 834–838, Nagoya, Japan, October. Asian Federation ofNatural Language Processing.

Adrien Bougouin, Florian Boudin, and Beatrice Daille.2013. Topicrank: Graph-based topic ranking forkeyphrase extraction. In Proceedings of the Sixth In-ternational Joint Conference on Natural LanguageProcessing, pages 543–551, Nagoya, Japan, Octo-ber. Asian Federation of Natural Language Process-ing.

Zhuoye Ding, Qi Zhang, and Xuanjing Huang. 2011.Keyphrase extraction from online news using binaryinteger programming. In Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing, pages 165–173, Chiang Mai, Thailand,November. Asian Federation of Natural LanguageProcessing.

Samhaa R. El-Beltagy and Ahmed Rafea. 2010. Kp-miner: Participation in semeval-2. In Proceedings ofthe 5th International Workshop on Semantic Evalu-ation, pages 190–193, Uppsala, Sweden, July. Asso-ciation for Computational Linguistics.

Dan Gillick and Benoit Favre. 2009. A scalable globalmodel for summarization. In Proceedings of theWorkshop on Integer Linear Programming for Nat-ural Language Processing, pages 10–18, Boulder,Colorado, June. Association for Computational Lin-guistics.

Kazi Saidul Hasan and Vincent Ng. 2010. Conun-drums in unsupervised keyphrase extraction: Mak-ing sense of the state-of-the-art. In Coling 2010:Posters, pages 365–373, Beijing, China, August.Coling 2010 Organizing Committee.

Kazi Saidul Hasan and Vincent Ng. 2014. Automatickeyphrase extraction: A survey of the state of theart. In Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1262–1273, Baltimore,Maryland, June. Association for Computational Lin-guistics.

Kai Hong and Ani Nenkova. 2014. Improvingthe estimation of word importance for news multi-document summarization. In Proceedings of the14th Conference of the European Chapter of the As-sociation for Computational Linguistics, pages 712–721, Gothenburg, Sweden, April. Association forComputational Linguistics.

Anette Hulth and Beata B. Megyesi. 2006. A study onautomatically extracted keywords in text categoriza-tion. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th An-nual Meeting of the Association for ComputationalLinguistics, pages 537–544, Sydney, Australia, July.Association for Computational Linguistics.

23

Page 34: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Samir Khuller, Anna Moss, and Joseph (Seffi) Naor.1999. The budgeted maximum coverage problem.Information Processing Letters, 70(1):39 – 45.

Su Nam Kim and Min-Yen Kan. 2009. Re-examiningautomatic keyphrase extraction approaches in scien-tific articles. In Proceedings of the Workshop onMultiword Expressions: Identification, Interpreta-tion, Disambiguation and Applications, pages 9–16,Singapore, August. Association for ComputationalLinguistics.

Su Nam Kim, Olena Medelyan, Min-Yen Kan, andTimothy Baldwin. 2010. Semeval-2010 task 5 : Au-tomatic keyphrase extraction from scientific articles.In Proceedings of the 5th International Workshop onSemantic Evaluation, pages 21–26, Uppsala, Swe-den, July. Association for Computational Linguis-tics.

Marina Litvak and Mark Last. 2008. Graph-basedkeyword extraction for single-document summariza-tion. In Coling 2008: Proceedings of the work-shop Multi-source Multilingual Information Extrac-tion and Summarization, pages 17–24, Manchester,UK, August. Coling 2008 Organizing Committee.

Zhiyuan Liu, Peng Li, Yabin Zheng, and MaosongSun. 2009. Clustering to find exemplar terms forkeyphrase extraction. In Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing, pages 257–266, Singapore, Au-gust. Association for Computational Linguistics.

Zhiyuan Liu, Wenyi Huang, Yabin Zheng, andMaosong Sun. 2010. Automatic keyphrase extrac-tion via topic decomposition. In Proceedings of the2010 Conference on Empirical Methods in NaturalLanguage Processing, pages 366–376, Cambridge,MA, October. Association for Computational Lin-guistics.

Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, andMaosong Sun. 2011. Automatic keyphrase extrac-tion by bridging vocabulary gap. In Proceedings ofthe Fifteenth Conference on Computational NaturalLanguage Learning, pages 135–144, Portland, Ore-gon, USA, June. Association for Computational Lin-guistics.

Patrice Lopez and Laurent Romary. 2010. Humb: Au-tomatic key term extraction from scientific articlesin grobid. In Proceedings of the 5th InternationalWorkshop on Semantic Evaluation, pages 248–251,Uppsala, Sweden, July. Association for Computa-tional Linguistics.

Rada Mihalcea and Paul Tarau. 2004. Textrank:Bringing order into texts. In Dekang Lin and DekaiWu, editors, Proceedings of EMNLP 2004, pages404–411, Barcelona, Spain, July. Association forComputational Linguistics.

Thuy Dung Nguyen and Minh-Thang Luong. 2010.Wingnus: Keyphrase extraction utilizing document

logical structure. In Proceedings of the 5th Inter-national Workshop on Semantic Evaluation, pages166–169, Uppsala, Sweden, July. Association forComputational Linguistics.

Mari-Sanna Paukkeri and Timo Honkela. 2010. Likey:Unsupervised language-independent keyphrase ex-traction. In Proceedings of the 5th InternationalWorkshop on Semantic Evaluation, pages 162–165,Uppsala, Sweden, July. Association for Computa-tional Linguistics.

Martin F Porter. 1980. An algorithm for suffix strip-ping. Program, 14(3):130–137.

Karen Sparck Jones. 1972. A statistical interpretationof term specificity and its application in retrieval.Journal of Documentation, 28:11–21.

Takashi Tomokiyo and Matthew Hurst. 2003. A lan-guage model approach to keyphrase extraction. InProceedings of the ACL 2003 Workshop on Multi-word Expressions: Analysis, Acquisition and Treat-ment - Volume 18, MWE ’03, pages 33–40, Strouds-burg, PA, USA. Association for Computational Lin-guistics.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-RichPart-of-Speech Tagging with a Cyclic DependencyNetwork. In Proceedings of the 2003 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnology - Volume 1 (NAACL), pages 173–180,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Peter D Turney. 2000. Learning algorithmsfor keyphrase extraction. Information Retrieval,2(4):303–336.

Xiaojun Wan and Jianguo Xiao. 2008a. Col-labrank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedingsof the 22nd International Conference on Compu-tational Linguistics (Coling 2008), pages 969–976,Manchester, UK, August. Coling 2008 OrganizingCommittee.

Xiaojun Wan and Jianguo Xiao. 2008b. Singledocument keyphrase extraction using neighborhoodknowledge. In Proceedings of the 23rd NationalConference on Artificial Intelligence - Volume 2,AAAI’08, pages 855–860. AAAI Press.

Ian H. Witten, Gordon W. Paynter, Eibe Frank, CarlGutwin, and Craig G. Nevill-Manning. 1999. Kea:Practical automatic keyphrase extraction. In Pro-ceedings of the Fourth ACM Conference on DigitalLibraries, DL ’99, pages 254–255, New York, NY,USA. ACM.

24

Page 35: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 25–31,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

TwittDict: Extracting Social Oriented Keyphrase Semantics from Twitter

Suppawong Tuarob†, Wanghuan Chu‡, Dong Chen§, and Conrad S Tucker♯

†Faculty of Information and Communication Technology, Mahidol University, Thailand‡ Department of Statistics, §Information Sciences and Technology,

♯Industrial and Manufacturing Engineering, Pennsylvania State University, [email protected], {wxc228,duc196,ctucker4}@psu.edu

Abstract

Social media not only carries informationthat is up-to-date, but also bears the wis-dom of the crowd. In social media, newwords are developed everyday, includingslangs, combinations of existing terms, en-tity names, etc. These terms are initiallyused in small communities, which canlater grow popular and become new stan-dards. The ability to early recognize theexistence and understand the meanings ofthese terms can prove to be crucial, espe-cially to emergence detection applications.We present an ongoing research work thatinvestigates the use of topical analysis toextract semantic of terms in social me-dia. In particular, the proposed methodextracts semantically related words asso-ciated with a target word from a corpus oftweets. We provide preliminary, anecdotalresults comprising the semantic extractionof five different keywords.

1 Introduction

Multiple applications built upon social media datahave emerged and recently gained attention froma wide range of research fields. For example,public surveillance systems have shown successin employing Twitter data to detect the emergenceof diseases (Tuarob et al., 2013b; Tuarob et al.,2014), emergency needs during natural disasters(Caragea et al., 2011), and even changes inproduct trends (Tuarob and Tucker, 2015c; Tuaroband Tucker, 2015a). Regardless of such appealingapplications, tremendous challenges exist inemploying traditional natural language processingtechniques to handle social media data. Most ofthe issues with social media involve languagecreativity and noise, such as non-standard termsor symbolic expressions, caused by the users.

Languages in social media evolve rapidly as theusers have the freedom to express their opinionsin colloquial, everyday languages. Some socialmedia services such as Twitter limit the length ofeach message, that even challenge the users toexpress their complete thoughts in a compressedmanner, resulting in creativity that would beconsidered noise by most traditional NLP tech-niques. This language evolution can be classifiedinto two categories: grammatical alteration andword distortion. Grammatical alteration involvesincomplete sentences (e.g. ‘Dance PracticeAll Day Hit Up The iPhone4 (:’),omitting words or part of words (e.g. ‘[Does]Anyone have suggestions for [an]iPhone 4 mic?’), and developing new terms(e.g. ‘I totally fricken agree!’).Word distortion involves modifying existingterms to deviate from the original meanings orto encode a phrase into a single word, such aslooooooove (much love) and lol (laugh outloud). Besides the language evolution, noise isalso considered a norm in social media. Thesources of such noise include the use of symbolicrepresentations (e.g. ‘:)’) and typographicalerrors (both by intention and unintention). Bothlanguage evolution and noises produce non-standard terms, words not defined in a standarddictionary. Moreover, non-standard terms mayrefer to proper nouns, or entity names, e.g. Xbox,Microsoft, and Peking. These non-standard termspose challenges to existing semantic interpreta-tion techniques, especially those dependent ondictionary look-up of terms.

Text normalization techniques such as those uti-lizing noisy channel models (Cook and Steven-son, 2009; Xue et al., 2011) rely on the assump-tion that a non-standard term has its equivalentstandard form (e.g. love ⇒ looooove). Withsuch an assumption, the algorithms aim to reversethe transformation process and seek the original

25

Page 36: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

form of a non-standard term. These algorithms,however, would fail if a term is newly developedand does not have a counterpart standard form(for example, ‘swine flu’, ‘linsanity’,‘Tweeps’, etc.).

In particular, we present TwittDict, a model forsemantic exploration of unknown terms in socialmedia. Specifically, the method first identifies dif-ferent topics discussed in the social media corpus.It assumes that a given term is associated with oneor more topics, which then allows the mapping be-tween such a term with relevant topically repre-sented terms. Though multiple works have shownsuccess on semantic annotation of unknown terms,these works target the domain of traditional docu-ments where noise and language evolution are nottaken into account. A preliminary case study thatuses Twitter data to extract semantically relevantterms from a set of chosen five target terms is pre-sented.

2 Background and Related Work

Use of social media, such as collaboratively editedknowledge databases (Wikipedia1), blogs and mi-croblogs (Biyani et al., 2014), content commu-nities (YouTube2), and social networking sites(Facebook3) (Kaplan and Haenlein, 2009), hasgrown at a prodigious rate. According to Nielsen’sreport4, the total amount of time spent by the U.S.population on social media in 2012 was 520.1 bil-lion minutes, a 21% increase from the previousyear. This results in the creation and diffusion ofa huge amount of information on social media ev-eryday, including news, knowledge, opinions, andemotions. Different groups use social media fordifferent reasons. For instance, companies can usesocial media to gather customer feedback and con-duct market research and reputation management.Governmental organizations can spread news andgather public opinions. Meanwhile, the wealth ofinformation on social media contributes to the col-lective wisdom and can be used to predict real-world outcomes such as stock prices (Bollen et al.,2011), flu trends (Lampos et al., 2010), and prod-uct sales (Tuarob and Tucker, 2013). To realizethe potential of social media, the first step is toselect relevant information, which requires an un-

1http://www.wikipedia.org/2https://www.youtube.com/3https://www.facebook.com/4http://www.nielsensocial.com/

derstanding of language evolution on social me-dia. One aspect of such evolution is the creationand use of new terms aiming at describing timelyevents or new social phenomenon. Many of theseterms are too new to be indexed by standard dic-tionaries or Wikipedia, and the results returned bypopular search engines like Google5 can be ob-scure and unstructured. Therefore, we seek to usesocial community knowledge to extract term se-mantics which provide better understanding on thelanguage evolution.

2.0.1 Semantic Discovery of TermsWeischedel et al. (1993) had success in employingprobabilistic models to discover unknown termsand annotate them with parts of speech. Danielet al. (1999) proposed a named entity recognition(NER) algorithm which categorizes a proper nouninto one of the 3 predefined categories: Location,Person, and Organization. Besides Daniel et al’swork, other NER algorithms such as (Chieu andNg, 2002) achieved similar goals. These solutionsrely on the assumption that a proper noun must fallinto one of the predefined categories, while it isubiquitous to see new categories of terms emergefrom social media. Moreover, these algorithms re-quire the data to adhere to standard English gram-mar. This requirement is hardly satisfied in so-cial media. Fellbaum described Wordnet6 a lexi-cal database for English vocabulary that providesa set of synonyms (synset) for a given word. How-ever, such database is constructed manually andonly contains standard dictionary words, while oursolution is fully automatic and can be applied tostandard and non-standard terms that appear in so-cial media.

2.0.2 Quantifying Unknown Terms in SocialMedia Data

Dealing with non-standard terms can be cumber-some. Dictionary-based approaches tend to failwhen facing such unknown terms since they ba-sically do not exist and cannot be looked up. Cookand Stevenson (2009) identified 10 different waysin which a term can be distorted in mobile textmessaging. They proposed a noisy channel un-supervised model to translate a non-standard terminto its standard version. Xue et al. (2011) pro-posed a similar channel-based model to translatea non-standard term into its standard form in the

5https://www.google.com/6http://wordnet.princeton.edu/

26

Page 37: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Twitter domain. These algorithms assume thatan unknown term can be mapped one-to-one toits standard form. Unfortunately, the presence ofnewly generated terms naturally found in socialmedia violate such an assumption, simply becausethese terms are newly developed and hence do nothave their standard forms. These newly developedterms include social slangs, trending words, andnames of entities.

Lund and Burgess (1996) attempted to explorethe semantic of terms by generating the termoccurrence network. A term is annotated withits highly related terms based on the distancesin the network. Though their algorithm treatsa document as a bag of words (hence does notrely on sentence structures), the algorithm pro-duces meaningful results when the data is high-dimensional and dense. Such properties result in astrong and meaningful co-occurrence relationship.However, each message in social media is usuallyrepresented with a short text, resulting in high-dimensional but sparse data. Consequently, suchdata sparsity would impede the co-occurrence re-lationship.

3 Methodology

Topic models (Blei and Lafferty, 2009) are pow-erful tools to study latent patterns in text. The se-mantic of an unknown word is highly related tothe topics associated with the text that contains it.Moreover, identified topics can be considered asrepresentatives of the semantic. While one docu-ment might only have a limited number of topicsassociated with it, the collection of a large amountof documents containing the unknown term canprovide more thorough and comprehensive un-derstanding. Therefore, topic models can be ap-plied to extract the semantics of unknown termswith large enough collection of documents. So-cial media such as Twitter usually adopts the useof newly developed terms at a very fast rate. So-cial media users tweet about topics related to theunknown terms based on their subjective under-standing. Different tweets may present differentmeanings towards a single term. While a singletweet lacks the information to provide the full se-mantics of the term, a collection of all the tweetscontaining the term would give a much larger andclearer picture of the semantics. Therefore, topicmodels can be applied on social media to extractword semantics in terms of collective wisdom and

social knowledge.In this study, we choose the Latent Dirichlet Al-

location (LDA) (Blei et al., 2003) to model topi-cal variation due to its flexibility and richness inthe results. We use Twitter data as a case study,hence the name TwittDict is devised. Note thatour algorithm can also be applied to other socialmedia such as Facebook and Google+, as long asthe medium of communication is in textual formsand community structures exist. In this section, wefirst briefly review our problem and introduce theLDA model, and then discuss how we filter the re-lated tweets and how we apply the LDA to extractword semantics.

3.1 Problem Definition

Given a query word, TwittDict outputs a list ofrelated words associated with it. The output wordsare ranked according to their relevance to theinput term. Specifically, let Dt = {d1, d2, ..., dn}be the set of tweets, where each tweet di ∈ Dt

is a bag of words, W the vocabulary extractedfrom Dt, and wt the query word. The proposedalgorithm aims to output a ranked list of Kwords which are semantically relevant to wt.For example, given a word ‘Linsanity’, theproposed algorithm would return a ranked list ofsemantically relevant words {basketball,player, insanity, scholarship}(with K = 4) as the output.

3.2 Latent Dirichlet Allocation

In text mining, the Latent Dirichlet Allocation(LDA) is a generative model that allows a docu-ment to be represented by a mixture of topics. Thebasic intuition of LDA for topic modeling is thatan author has a set of topics in mind when writinga document. A topic is defined as a distribution ofterms. The author then chooses a set of terms fromthe topics to compose the document. With such as-sumption, the whole document can be representedusing a mixture of different topics. LDA servesas a means to trace back the topics in the author’smind before the document is written. Mathemati-cally, the LDA model is described as follows:

P (wi|d) =|Z|∑j=1

P (wi|zi = j) ·P (zi = j|d). (1)

P (wi|d) is the probability of term wi being in doc-ument d. zi is the latent (hidden) topic. |Z| is the

27

Page 38: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

number of all topics, which needs to be predeter-mined. P (wi|zi = j) is the probability of term wi

being in topic j. P (zi = j|d) is the probability ofpicking a term from topic j in the document d.

Essentially, the aim of LDA model is to findP (z|d), the topic distribution of document d, witheach topic described by the distribution over allterms P (w|z).

After the topics are modeled, we can assign adistribution of topics to a given document using atechnique called inference. A document then canbe represented by a vector of numbers, each ofwhich represents the probability of the documentbelonging to a topic:

Infer(d, Z) = ⟨z1, z2, ..., zQ⟩; |Z| = Q,

where Z is a set of topics, d is a document, andzi is a probability of the document d falling intotopic i. We use the Latent Dirichlet Allocation al-gorithm to generate topics in our model since it al-lows a topic to be represented by a distribution ofterms, enabling the method to propagate the rele-vance from the target term to the underlying termsthat compose the relevant topics.

3.3 Data Preprocessing

Twitter data is collected using the Twitter API. Thetextual information in each tweet is first lower-cased, then usernames, stopwords, punctuations,numbers, and URLs are removed. While using thewealth of information on Twitter to understand anunknown term, the first step is to filter in tweetsthat are related to such a term. The most intuitivecollection consists of all the tweets that contain thetarget word and treats each single tweet as a docu-ment, which we call the basis setting. However,there are some special characteristics of Twittermessages that we want to consider for modifica-tions and improvements. First, there is limited in-formation within each tweet because of the 140-character restriction, and the average length oftweets is even smaller. This is quite different fromthe traditional uses of the LDA where input doc-uments are rich (e.g., research articles, newspa-per, etc), and hence generated topics are quite intu-itive and meaningful. Second, other information intweets such as retweet (RT), reply (@username)and hashtag (#) exist, which can be used more ap-propriately instead of just being deleted or treatedas a plain word. To overcome the drawbacks andmake better use of Twitter features, we consider

improving the basis setting by expanding the col-lection of tweets using reply and hashtag. Replyrefers to those tweets that start with @usernameand comment on other tweets. For the tweet thatcontains the unknown term, its reply tweets makecomments on the same or other related topics. Al-though these tweets might not contain the targetword, it is reasonable to assume that they shouldbe in similar semantic as the original tweet thusproviding additional information. Therefore, wewill expand the collection of tweets by combiningall the reply tweets to the original one which con-tains the target term. Hashtag can also be used tofind related tweets. People use the hashtag sym-bol # before a relevant keyword or phrase withoutspace in the tweets to facilitate automatic catego-rization and search. These hashtags can be viewedas topical markers, serving as indications to thecontext or the core idea of the tweet. Tweets withthe same hashtag share similar topics. Therefore,we use hashtags in the basis tweet to find all theother tweets that have at least one of these hash-tags, which also enriches the information in thecollection.

3.4 Retrieving Related Words

Mathematically, given a target document corpusDt = ⟨d1, d2, ..., dn⟩ (as described in Section3.3), vocabulary W = ⟨w1, w2, ..., wm⟩, and tar-get word wt, our algorithm outputs a ranked listW ∗

K = ⟨w1, w2, ..., wK⟩, where wi ∈ W , of Kwords relevant to wt.

Our algorithm comprises two main steps:

1. P (w|wt,W,Dt), the likelihood probabilityof the word w being relevant to the targetword wt, is computed for each w ∈ W .

2. Return top K words ranked by the likelihoodprobability.

In general, P (w|wt,W,Dt) is computed byweighted averaging of the posterior probability ofP (w|Z) across the documents in D, where Z isthe set of topics:

P (w|wt,W,Dt) =∑z∈Z

P (z|Dt) · P (w|z), (2)

where P (w|z) is the posterior probability of theword w being in topic z, computed in Equation1. P (z|Dt) serves as the weight of the topic z,

28

Page 39: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

computed by averaging out the topic probabilityP (z|d) across all documents in Dt:

P (z|Dt) =1

|Dt|∑d∈Dt

P (z|d), (3)

where P (z|d) is computed based on Equation 1.Hence:

P (w|wt,W,Dt) =1

|Dt|∑z∈Z

∑d∈Dt

P (z|d)·P (w|z)

(4)

4 Evaluation

TwittDict is evaluated against the baseline whichutilizes a variant of word co-occurrence to re-trieve relevant keywords. Church et al. hadsuccess on using the mutual information to ex-tract semantic related terms (1990). Further-more, Tuarob and Tucker had used the word co-occurrence network to explicate implicit seman-tics in product related tweets (2015b). Here, theword co-occurrence network is constructed fromthe tweet corpus. The co-occurrence network isan undirected graph where each node is a dis-tinct word, and each edge weight represents thefrequency of co-occurrence. The edge weightscan be used directly to compute P (x, y), wherex and y are co-occurred words. Given a targetword wt, a corpus of tweets T , and vocabularyW = ⟨w1, w2, ..., wm⟩, the baseline algorithmoutputs a ranked list WB

K = ⟨w1, w2, ..., wK⟩,where wi ∈ W , of K words relevant to wt. Thealgorithm assigns a co-occurrence based score toeach word, and rank them by such a score. Inthis work, we experiment with three variations ofco-occurrence based scores: Mutual Information(MI), Co-Frequency (CoF), and Co-Frequency In-verse Document Frequency (CoF-IDF):

ScoreMI(wt, w) = log2

P (wt, w)

P (wt) · P (w)(5)

ScoreCoF (wt, w) = P (wt, w) (6)ScoreCoF−IDF (wt, w) = P (wt, w) · IDF (w, T ) (7)

5 Preliminary Case Study

We experiment our methodology with Twitter dataand a set of manually selected words. Twitter datais used due to its ubiquitousness and public avail-ability. Note that, our methodology can expand toother types of social media such as Facebook andGoogle+ if the data is available.

5.1 Twitter Data

Twitter is a microblog service that allows its usersto send and read text messages of up to 140 charac-ters, known as tweets. The Twitter dataset used inthis research study comprises roughly 700 milliontweets in the United States during the period of 19months, from March 2011 to September 2012.

5.2 Anecdotal Results

A set of five target words (Obama, Pandora,Xbox, Glee, and Zombie) are used to test ourproposed algorithm against one of the baselinewith Co-frequency scores. TwittDict employs theLDA implementation in Mallet7, with 100 topicsand runs for 1,000 iterations using Gibb’s Sam-pling. Due to the limitation on the computationaltime, TwittDict currently only models topics froma tweet corpus collected in March 2011. For thebaseline, we first index the whole tweet corpus us-ing Apache Lucene8, then use the same library tocompute word frequency. Table 1 lists the results.

From the preliminary results, TwittDict is ableto extract highly meaningful words related to thetarget words, while the baseline contain a mix-ture of both related and generally spurious words.Note that, TwittDict only uses one month’s worth(5.26%) of the available Twitter data, as opposedto the baseline which uses the whole collection oftweets. It is our belief that, with more Twitter data,TwittDict could even provide a wider variety andhigher in semantics of lexicons.

6 Conclusions and Future Works

By leveraging natural language processing tech-niques and specific features in social media, wehave described our ongoing development of Twitt-Dict, a system to identify the social-oriented se-mantic meaning of unknown words. Such asystem could prove to be useful as a buildingblock for emergence detection systems whereearly recognition of new terms/concepts is cru-cial. We illustrated through anecdotal results us-ing Twitter data to identify semantic meanings offive terms, that our method is not only achiev-ing promising results, but also urging us to ex-plore further into improving our methods alongwith conducting rigorous user and automatic eval-uations such as (Tuarob et al., 2013a; Tuarob et al.,

7http://mallet.cs.umass.edu/8http://lucene.apache.org/

29

Page 40: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Table 1: Preliminary results of 5 test words using both the baseline (CoF scores) and TwittDict.

Co-Freq Twi�Dict Co-Freq Twi�Dict Co-Freq Twi�Dict Co-Freq Twi�Dict Co-Freq Twi�Dict

1 president president flow sta!on live live watching watching apocalypse apocalypse

2 vote libya sta!on radio play play love tonight feel lol

3 michelle people listening listening playing kinect watch episode lol www

4 romney war radio lol got lol tonight watch day movie

5 barack bush love music lol playing episode love dead today

6 lol barack point playing !me game season song mode feel

7 don don tonight song need games project !me sleep movies

8 america news commercials love game !me lol good walking love

9 love pres playing !me add black !me show !me !me

10 speech !me lol songs don back omg lol today back

11 fuck gop song good kinect don cast songs movie zombies

12 got white !me shit buy buy wait don night band

13 dnc america listen listen games good song night zombies dead

14 vo!ng oil songs day fuck day good cast don day

15 people world shit today shit ops don amazing love mode

16 good tcot night play wanna gamertag amazing omg good horror

17 campaign japan music work day follow night week shit good

18 years administra!on jamming flow love win version version rob house

19 win house got tonight controller controller excited awesome walkingdead plays

20 osama gas sleep night haha love week music need atomic

Word

/Rank

Obama Pandora Xbox Glee Zombie

2015). There is plenty of room to improve Twitt-Dict. In the current case study, we only used Twit-ter data during Mar 2011. This specific period oftime may bring about bias towards the result. Toavoid such bias, we need to test data in differenttimes and geographical regions. This will shedlight on how meanings of a term evolve tempo-rally and spatially. When we were conducting thesmall case study, we noticed that the results werehighly dependent on the time period, as Twitterusers usually tweet about the current social phe-nomena. This change reflects the evolvement ofsocial events and community knowledge. We areconsidering giving users the freedom to specifythe time period during which a term is defined.Furthermore, we would explore methods for userevaluations. We would recruit human participantsto give feedback about their experience. Real userexperience is of great value for us to see whetherand how community knowledge from social mediatruly helps them to better understand the unknown,emerging concepts. Finally, we would like to com-pare our method against well-established baselinesuch as (Turney et al., 2010) and (Mikolov et al.,2013).

30

Page 41: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

ReferencesDaniel M. Bikel, Richard Schwartz, and Ralph M.

Weischedel. 1999. An algorithm that learns what’sin a name. Mach. Learn., 34(1-3):211–231, Febru-ary.

Prakhar Biyani, Cornelia Caragea, Prasenjit Mitra, andJohn Yen. 2014. Identifying emotional and infor-mational support in online health communities.

David M Blei and J Lafferty. 2009. Topic models. Textmining: classification, clustering, and applications,10:71.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022, March.

Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011.Twitter mood predicts the stock market. Journal ofComputational Science, 2(1):1–8.

C. Caragea, N. McNeese, A. Jaiswal, G. Traylor, H.W.Kim, P. Mitra, D. Wu, A.H. Tapia, L. Giles, B.J.Jansen, et al. 2011. Classifying text messages forthe haiti earthquake. In ISCRAM ’11.

Hai Leong Chieu and Hwee Tou Ng. 2002. Named en-tity recognition: a maximum entropy approach us-ing global information. COLING ’02, pages 1–7,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Kenneth Ward Church and Patrick Hanks. 1990. Wordassociation norms, mutual information, and lexicog-raphy. Comput. Linguist., 16(1):22–29, March.

Paul Cook and Suzanne Stevenson. 2009. An unsuper-vised model for text message normalization. CALC’09, pages 71–78, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

Andreas M. Kaplan and Michael Haenlein. 2009. Thefairyland of second life: Virtual social worlds andhow to use them. Business Horizons, 52(6):563 –572.

Vasileios Lampos, Tijl De Bie, and Nello Cristianini.2010. Flu detector-tracking epidemics on twitter.In Machine Learning and Knowledge Discovery inDatabases, pages 599–602. Springer.

Kevin Lund and Curt Burgess. 1996. Produc-ing high-dimensional semantic spaces from lexi-cal co-occurrence. Behavior Research Methods,28(2):203–208.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Suppawong Tuarob and Conrad S Tucker. 2013. Fador here to stay: Predicting product market adoptionand longevity using large scale, social media data.In ASME IDETC/CIE ’13.

Suppawong Tuarob and Conrad S Tucker. 2015a. Au-tomated discovery of lead users and latent prod-uct features by mining large scale social media net-works. Journal of Mechanical Design.

Suppawong Tuarob and Conrad S Tucker. 2015b. Aproduct feature inference model for mining implicitcustomer preferences within large scale social medianetworks. In ASME IDETC/CIE ’15.

Suppawong Tuarob and Conrad S Tucker. 2015c.Quantifying product favorability and extracting no-table product features using large scale social mediadata. Journal of Computing and Information Sci-ence in Engineering.

Suppawong Tuarob, Line C Pouchard, and C Lee Giles.2013a. Automatic tag recommendation for meta-data annotation using probabilistic topic modeling.JCDL ’13, pages 239–248.

Suppawong Tuarob, Conrad S Tucker, Marcel Salathe,and Nilam Ram. 2013b. Discovering health-relatedknowledge in social media using ensembles of het-erogeneous features. In CIKM ’13, pages 1685–1690. ACM.

Suppawong Tuarob, Conrad S Tucker, Marcel Salathe,and Nilam Ram. 2014. An ensemble heterogeneousclassification methodology for discovering health-related knowledge in social media messages. Jour-nal of biomedical informatics, 49:255–268.

Suppawong Tuarob, Line C Pouchard, Prasenjit Mi-tra, and C Lee Giles. 2015. A generalized topicmodeling approach for automatic document anno-tation. International Journal on Digital Libraries,16(2):111–128.

Peter D Turney, Patrick Pantel, et al. 2010. Fromfrequency to meaning: Vector space models of se-mantics. Journal of artificial intelligence research,37(1):141–188.

Ralph Weischedel, Richard Schwartz, Jeff Palmucci,Marie Meteer, and Lance Ramshaw. 1993. Copingwith ambiguity and unknown words through proba-bilistic models. Comput. Linguist., 19(2):361–382,June.

Zhenzhen Xue, Dawei Yin, and Brian D Davison.2011. Normalizing microtext. In Proceedings ofthe AAAI Workshop on Analyzing Microtext, pages74–79.

31

Page 42: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction, pages 32–38,Beijing, China, July 30, 2015. c©2015 Association for Computational Linguistics

Identification and Classification of Emotional Key Phrases from Psycho-logical Texts

Apurba Paul Dipankar DasJIS College of Engineering Jadavpur University

Kalyani,Nadia 188, Raja S.C. Mullick Road, KolkataWest Bengal, India West Bengal, India

[email protected] [email protected]

Abstract

Emotions, a complex state of feeling results inphysical and psychological changes that influ-ence human behavior. Thus, in order to extractthe emotional key phrases from psychologicaltexts, here, we have presented a phrase levelemotion identification and classification sys-tem. The system takes pre-defined emotionalstatements of seven basic emotion classes(anger, disgust, fear, guilt, joy, sadness andshame) as input and extracts seven types ofemotional trigrams. The trigrams wererepresented as Context Vectors. Between apair of Context Vectors, an Affinity Score wascalculated based on the law of gravitation withrespect to different distance metrics (e.g.,Chebyshev, Euclidean and Hamming). Thewords, Part-Of-Speech (POS) tags, TF-IDFscores, variance along with Affinity Score andranked score of the vectors were employed asimportant features in a supervised classifica-tion framework after a rigorous analysis. Thecomparative results carried out for four differ-ent classifiers e.g., NaiveBayes, J48, DecisionTree and BayesNet show satisfactory perfor-mances.

1 Introduction

Human emotions are the most complex and uniquefeatures to be described. If we ask someone regard-ing emotion, he or she will reply simply that it is a'feeling'. Then, the obvious question that comesinto our mind is about the definition of feeling. It isobserved that such terms are difficult to define andeven more difficult to understand complete-ly. Ekman (1980) proposed six basic emotions(anger, disgust, fear, guilt, joy and sadness) thathave a shared meaning on the level of facialexpressions across cultures (Scherer, 1997; Scher-

er and Wallbott, 1994). Psychological texts containhuge number of emotional words because psychol-ogy and emotions are inter-wined, though they aredifferent (Brahmachari et.al, 2013). A phrase thatcontains more than one word can be a better wayof representing emotions than a single word. Thus,the emotional phrase identification and their classi-fication from text have great importance in NaturalLanguage Processing (NLP).

In the present work, we have extractedseven different types of emotional statements (an-ger, disgust, fear, guilt, joy, sadness and shame)from the Psychological corpus. Each of the emo-tional statements was tokenized; the tokens weregrouped in trigrams and considered as ContextVectors. These Context Vectors are POS taggedand corresponding TF and TF-IDF scores weremeasured for considering them as important fea-tures or not. In addition, the Affinity Scores werecalculated for each pair of Context Vectors basedon different distance metrics (Chebyshev, Eucli-dean and Hamming). Such features lead to applydifferent classification methods like NaiveBayes,J48, Decision Tree and BayesNet and after that theresults are compared.

The route map for this paper is the RelatedWork (Section 2), Data Preprocessing Framework(Section 3) followed by Feature Analysis and Clas-sification framework (Section 4) and result analy-sis (Section 5) along with the improvement due toranking. Finally, we have concluded the discussion(Section 6).

2 Related Work

Strapparava and Valitutti (2004) developed theWORDNET-AFFECT, a lexical resource that as-signs one or more affective labels such as emotion,mood, trait, cognitive state, physical state, beha-vior, attitude and sensation etc to a number of

32

Page 43: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

WORDNET synsets. A detailed annotation schemethat identifies key components and properties ofopinions and emotions in language has been de-scribed in (Wiebe et al., 2005). The authors in(Kobayashi et al., 2004) also developed an opinionlexicon out of their annotated corpora. Takamuraet al. (2005) extracted semantic orientation ofwords according to the spin model, where the se-mantic orientation of words propagates in twopossible directions like electrons. Esuli and Sebas-tiani’s (2006) approach to develop the SentiWord-Net is an adaptation to synset classification basedon the training of ternary classifiers for decidingpositive and negative (P-N) polarity. Each of theternary classifiers is generated using the Semi-supervised rules.

On the other hand, Mohammad, et al., (2010)has performed an extensive analysis of the annota-tions to better understand the distribution of emo-tions evoked by terms of different parts of speech.The authors in (Das and Bandyopadhyay, 2009,2010) created the emotion lexicon and systems forBengali language. The development of SenticNet(Cambria et al., 2010) was inspired later by (Poriaet al., 2013). The authors developed an enrichedSenticNet with affective information by assigningemotion labels. Similarly, ConceptNet1 is a multi-lingual knowledge base, representing words andphrases that people use and the common-sense re-lationships between them.

Balahur et al., (2012) had shown that the task ofemotion detection from texts such as the one in theISEAR corpus (where little or no lexical clues ofaffect are present) can be best tackled using ap-proaches based on commonsense knowledge. Inthis sense, EmotiNet, apart from being a preciseresource for classifying emotions in such exam-ples, has the advantage of being extendable withexternal sources, thus increasing the recall of themethods employing it. Patra et al., (2013) adoptedthe Potts model for the probability modeling of thelexical network that was constructed by connectingeach pair of words in which one of the two wordsappears in the gloss of the other.

In contrast to the previous approaches, thepresent task comprises of classifying the emotionalphrases by forming Context Vectors and the expe-rimentation with simple features like POS, TF-IDFand Affinity Score followed by the computation of

1 http://conceptnet5.media.mit.edu/

similarities based on different distance metrics helpin making decisions to correctly classify the emo-tional phrases.

3 Data Preprocessing Framework

3.1 Corpus Preparation

The emotional statements were collected from theISEAR7 (International Survey on Emotion Antece-dents and Reactions) database. Each of the emotionclasses contains the emotional statements given bythe respondents as answers based on some prede-fined questions. Student respondents, both psy-chologists and non-psychologists were asked toreport situations in which they had experienced allof the 7 major emotions (anger, disgust, fear, guilt,joy, sadness, shame). The final data set containsreports of 3000 respondents from 37 countries. Thestatements were split in sentences and tokenizedinto words and the statistics were presented in Ta-ble 1. It is found that only 1096 statements belongto anger, disgust sadness and shame classes whe-reas the fear, guilt and joy classes contain 1095,1093 and 1094 different statements, respectively.Since each statement may contain multiple sen-tences, so after sentence tokenization, it is ob-served that the anger and fear classes contain themaximum number of sentences. Similarly, it is ob-served that the anger class contains the maximumnumber of tokenized words.

Emotions Total No.ofStatements

Total No.ofSentences

Total No. ofTokenizedWords

Anger 1096 1760 24301Disgust 1096 1607 20871Fear 1095 1760 22912Guilt 1093 1718 22430Joy 1094 1554 18851Sadness 1096 1606 19480Shame 1096 1609 20948Total 7,666 11,614 1,49,793

Table 1: Corpus Statistics

The tokenized words were grouped to formtrigrams in order to grasp the roles of the previousand next tokens with respect to the target token.Thus, each of the trigrams was considered as aContext Window (CW) to acquire the emotionalphrases. The updated version of the standard wordlists of the WordNet Affect (Strapparava, and Vali-

33

Page 44: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

tutti, 2004) was collected and it is observed that thetotal of 2,958 affect words is present.

It is considered that, in each of the ContextWindows, the first word appears as a non-affectword, second word as an affect word, and thirdword as a non-affect word (<NAW1>, <AW>,<NAW2>). It is observed from the statistics of CWas shown in Table 2 that the anger class containsthe maximum number of trigrams (20,785) and joyclass has the minimum number of trigrams(15,743) whereas only the fear class contains themaximum number of trigrams (1,573) that followthe CW pattern. A few example patterns of theCWs which follows the pattern (<NAW1>, <AW>,<NAW2>) are “advices, about, problems” (Anger),“already, frightened, us” (Fear), “always, joyous,one” (Joy), “acted, cruelly, to” (Disgust), “adoles-cent, guilt, growing” (guilt), “always, sad, for”(sad) , “and, sorry, just” (Shame) etc.

It was observed that the stop words aremostly present in <NAW1, AW, NAW2> patternwhere similar and dissimilar NAWs are appearedbefore and after their corresponding CWs. In caseof fear, a total of 979 stop words were found inNAW1 position and 935 stop words in NAW2 posi-tion. It is observed that in case of fear, the occur-rence of similar NAW before and after of CWs isonly 22 in contrast to the dissimilar occurrences of1551. Table 3 explains the statistics of similar anddissimilar NAWs along with their appearances asstop words.

3.2 Context Vector Formation

In order to identify whether the Context Windows(CWs) play any significant role in classifying emo-tions or not, we have mapped the Context Win-dows in a Vector space by representing them asvectors. We have tried to find out the semantic re-lation or similarity between a pair of vectors usingAffinity Score which in turn takes care of differentdistances into consideration. Since a CW followsthe pattern (NAW1, AW, NAW2), the formation ofvector with respect to each of the Context Win-dows of each emotion class was done based on thefollowing formula,

1 2CW( )

#NAW #NAW#A= , ,

WVectoriza

T T Ttion

Where,

T= Total count of CW in an emotion class

#NAW1 = Total occurrence of a non-affect word in NAW1 position

#NAW2 = Total occurrence of a non-affect word in NAW2 position

#AW = Total occurrence of an affect word inAW position.

It was found that in case of anger emotion,a CW identified as (always, angry, about) corres-ponds to a Vector, <0.29, 10.69, 1.47>

Emotions Total No ofTrigrams

Total no of Tri-grams that follows<NAW1,AW,NAW2>pattern (CW)

Anger 20785 1356Disgust 17661 1283Fear 19392 1573Guilt 18997 1298Joy 15743 1179Sadness 16270 1210Shame 17731 1058Table 2: Trigrams and Affect Words Statistics

Emotions Total no.of NAW 1

appearedas stopwords inCW

Total no.of NAW2

appearedas stopwords inCW

PresenceofsimilarNAWbeforeand afterof CW

PresenceofdissimilarNAWbeforeand afterof CW

Anger 825 871 26 1330Disgust 696 763 11 1272Fear 979 935 22 1551Guilt 695 874 18 1280Joy 734 674 11 1168Sadness 733 753 22 1188Shame 604 647 16 1042NAW1= Non Affect Word1; AW=Affect Word; NAW2=NonAffect Word2

Table 3: Statistics for similar and dissimilar NAWpatterns and stop words

3.3 Affinity Score Calculation

We assume that each of the Context Vectors in anemotion class is represented in the vector space ata specific distance from the others. Thus, theremust be some affinity or similarity exists betweeneach of the Context Vectors. An Affinity Scorewas calculated for each pair of Context Vectors(pu,qv) where u = {1,2,3,.........n} and v ={1,2,3,.......n} for n number of vectors with respectto each of the emotion classes. The final Score is

34

Page 45: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

calculated using the following gravitational formu-la as described in (Poria et al., 2013):

p q,

, q  

*p q

p

Score2dist

The Score of any two context vectors p and q of anemotion class is the dot product of the vectors di-vided by the square of distance (dist) between pand q. This score was inspired by Newton’s law ofgravitation. This score values reflect the affinitybetween two context vectors p and q. Higher scoreimplies higher affinity between p and q.

However, apart from the score values, wealso calculated the median, standard deviation andinter quartile range (iqr) and only those contextwindows were considered if their iqr values aregreater than some cutoff value selected during ex-periments.

3.4 Affinity Scores using Distance Metrics

In the vector space, it is needed to calculate howclose the context vectors are in the space in orderto conduct better classification into their respectiveemotion classes. The Score values were calculatedfor all the emotion classes with respect to differentmetrics of distance (dist) viz. Chebyshev, Eucli-dean and Hamming. The distance was calculatedfor each context vector with respect to all the vec-tors of the same emotion class. The distance for-mula is given below:a. Chebyshev distance (Cd) = max |xi - yi |

where xi and yi represents two vectors.b. Euclidean distance (Ed) = ||x - y||2 for vectors xand y.c. Hamming distance (Hd) = (c01 + c10) / n where cij

is the number of occurrence in the boolean vectorsx and y and x[k] = i and y[k] = j for k < n. Ham-ming distance denotes the proportion of disagree-ing components in x and y.

4 Feature Selection and Analysis

It is observed that the feature selection alwaysplays an important role in building a good patternclassifier. Thus, we have employed different clas-sifiers viz. BayesNet, J48, NaiveBayesSimple andDecisionTree associated in the WEKA tool. Basedon the previous analysis, the following features

were selected for developing the classificationframework.

1. Affinity Scores based on Cd, Ed and Hd

2. Context Window(CW)3. POS Tagged Context Window (PTCW)4. POS Tagged Window (PTW)5. TF and TF-IDF of CW6. Variance and Standard Deviation of CW7. Ranking Score of CW

4.1 POS Tagged Context Windows and Win-dows (PTCW and PTW)

The sentences were POS tagged using the StanfordPOS Tagger and the POS tagged Context Windowswere extracted and termed as PTCW. Similarly,the POS tag sequence from each of the PTCWswere extracted and named each as POS TaggedWindow (PTW). It is observed that “fear” emotionclass has the maximum number of CWs and uniquePTCWs whereas the “anger” class contains themaximum number of unique PTWs. The Figure 1as shown below represents the counts of CW,unique PTCWs and PTWs. It was noticed that thetotal number of CWs is 8967, total number ofunique PTCW is 7609 and of unique PTW is 3117.Obviously, the number of PTCW was less thanCW and number of PTW was less than PTCW,because of the uniqueness of PTCW and PTW. InFigure 2, the total counts of CW, PTCW and PTWhave been shown. Some sample patterns of PTWsthat occur with the maximum frequencies in threeemotion classes are “VBD/RB_JJ_IN” (anger),“NN/VBD_VBN_NN” (disgust) and“VBD_VBN/JJ_IN/NN” (fear).

Figure 1: Count of CW, PTCW and PTW for sevenemotion classes

35

Page 46: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Figure 2:Total Count of CW, PTCW and PTW

4.2 TF and TF-IDF Measure

The Term Frequencies (TFs) and the Inverse Doc-ument Frequencies (IDFs) of the CWs for each ofthe emotion classes were calculated. In order toidentify different ranges of the TF and TF-IDFscores, the minimum and maximum values of theTF and the variance of TF were calculated for eachof the emotion classes. It was observed that guilthas the maximum scores for Max_TF and variancewhereas the emotions like anger and disgust havethe lowest scores for Max_TF as shown in Figure3. Similarly, the minimum, maximum and varianceof the TF-IDF values were calculated for eachemotion class, separately. Again, it is found thatthe guilt emotion has the highest Max_TF-IDF anddisgust emotion has the lowest Max_TF-IDF asshown in Figure 4.

Not only for the Context Windows (CWs),the TF and TF-IDF scores of the POS TaggedContext Windows (PTCWs) and POS TaggedWindows (PTWs) were also calculated withrespect to each emotion. It was observed that,similar results were found. Variance, or secondmoment about the mean, is a measure of thevariability (spread or dispersion) of data. A largevariance indicates that the data is spread out; asmall variance indicates it is clustered closelyaround the mean.The variance for TF_IDF of guiltis 0.0000456874. A few slight differences werefound in the results of PTWs while calculatingMax_TF , Min_TF and variance as shown inFigure 3. It was observed that fear emotion has thehighest Max_TF and anger has the lowest Max_TFwhereas the variance of TF for guilt is0.0002435522. Similarly, Figure 4 shows that fearhas the highest Max_TF_IDF and anger contains

the lowest Max_TF-IDF values and the variance ofTF-IDF of fear is 0.000922226.

Figure 3:Variance,Max_TF,Min_TF of CW, PTCWand PTW

Figure 4: Variance,Max_TF-IDF, Min_TF-IDF of CW,PTCW and PTW

4.3 Ranking Score of CW

It was found that some of the Context Windowsappear more than one time in the same emotionclass. Thus, they were removed and a rankingscore was calculated for each of the context win-dows. Each of the words in a context window wassearched in the SentiWordnet lexicon and if found,we considered either positive or negative or bothscores. The summation of the absolute scores of allthe words in a Context Window is returned. Thereturned scores were sorted so that, in turn, each ofthe context windows obtains a rank in its corres-ponding emotion class.

All the ranks were calculated for eachemotion class, successively. This rank is useful infinding the important emotional phrases from thelist of CWs. Some examples from the list of top 12important context windows according to their rankare “much anger when” (anger), “whom love after”(happy), “felt sad about” (sadness) etc.

36

Page 47: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

5 Result Analysis

The accuracies of the classifiers were obtained byemploying user defined test data and data for 10fold cross validation. It is observed that when Euc-lidean distance was considered, the BayesNetClassifier gives 100% accuracy on the Test dataand gives 97.91% of accuracy on 10-fold crossvalidation data. On the other hand, J48 classifierachieves 77% accuracy on Test data and 83.54%on 10-fold cross validation data whereas the Nai-veBayesSimple classifier obtains 92.30% accuracyon Test data and 27.07% accuracy on 10-fold crossvalidation data. In the Naïve BayesSimple with 10-fold cross validation, the average Recall, Precisionand F-measure values are 0.271, 0.272 and 0.264,respectively. But, the DecisionTree classifier ob-tains 98.30% and 98.10% accuracies on the Testdata as well as 10-fold cross validation data. Thecomparative results are shown in Figure 5. Overall,it is observed from Figure 5 that the BayesNetclassifier achieves the best results on the score datawhich was prepared based on the Euclideandistance. In contrast, the BayesNet achieved99.30% accuracy on the Test data and 96.92% ac-curacy on 10-fold cross validation data when theHamming distance was considered. Similarly, J48and Naïve BayesSimple classifiers produce93.05% and 85.41% accuracies on the Test dataand 87.95% and 39.50% accuracies on 10-foldcross validation data, respectively.

From Figure 6, it is observed that theDecisionTree classifier produces the best accuracyon the score data that was found using Hammingdistance. When the score values are found by usingChebyshev distance, the BayesNet classifierobtains 100% accuracy on Test data and 97.57%accuracy on 10-fold cross validation data.Similarly, J48 achieves 84.82% accuracy on theTest data and 82.75% accuracy on 10-fold crossvalidation data whereas NaiveBayes andDecisionTable achieve 80% , 29.85% and 98.62%,96.93% accuracies on the Test data and 10-foldcross validatation data, respectively.

It has to be mentioned based on Figure 7that the DecisionTree classifier performs better incomparison with all other classifiers and achievesthe best result among the rest of the classifiers onaffinity score data prepared based on the Cheby-shev distance only.

Figure 5: Classification Results on Test data and 10-fold cross validation using Euclidean distance (Ed)

Figure 6: Classification Results on Test data and 10-fold cross validation using Hamming distance (Hd)

Figure 7: Classification Results on Test data and 10-fold cross validation using Chebyshev distance (Cd)

6 Conclusions and Future WorksIn this paper, vector formation was done for eachof the Context Windows; TF and TF-IDF measureswere calculated. The calculated affinity score, de-pending on the distance values was inspired fromNewton's law of gravitation. To classify theseCWs, BayesNet, J48, NaivebayesSimple and Deci-sionTable classifiers.

In future, we would like to incorporatemore number of lexicons to identify and classifyemotional expressions. Moreover, we are planningto include associative learning process to identifysome important rules for classification.

37

Page 48: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

References

Balahur A , Hermida J. 2012.Extending the EmotiNetKnowledge Base to Improve the Automatic Detectionof Implicitly Expressed Emotions from Text. In Irec-conference 2012,pp-1207-1214

Das, D. and Bandyopadhyay, S. 2009. Word to SentenceLevel Emotion Tagging for Bengali Blogs. In ACL-IJCNLP 2009 (Short Paper), pp.149-152

Das, D. and Bandyopadhyay, S. 2010. Developing Ben-gali WordNet Affect for Analyzing Emotion.ICCPOL-2010, pp. 35-40

Ekman, P.1993. Facial expression and emotion. Ameri-can Psychologist, vol. 48(4) 384–392.

Erik Cambria, Robert Speer, Catherine Havasi, AmirHussain.2010. SenticNet: A Publicly Available Se-mantic Resource for Opinion Mining

Kobayashi, N., K. Inui, Y. Matsumoto, K. Tateishi, andT. Fukushima. 2004. Collecting evaluative expres-sions for opinion extraction. IJCNLP.

Mohammad S and Turney P,2010. Emotions Evoked byCommon Words and Phrases: Using MechanicalTurk to Create an Emotion Lexicon. In Proceedingsof the NAACL-HLT 2010 Workshop on Computa-tional Approaches to Analysis and Generation ofEmotion in Text, June 2010, LA, California

Patra B, Takamura H, Das D, Okumura M, and Ban-dyopadhyay S 2013.Construction of Emotional Lex-icon Using Potts Model. In IJCNLP 2013 pp-674-679

Poria S, Gelbukh A, Hussain A, Howard N, Das D,Bandyopadhyay S. 2013. Enhanced SenticNet withAffective Labels for Concept-Based Opinion Mining,IEEE Intelligent Systems, vol. 28, no. 2, pp. 31-38,

Scherer, K. R., & Wallbott, H.G. (1994). Evidence foruniversality and cultural variation of differentialemotion response patterning. Journal of Personalityand Social Psychology, 66, 310-328.

Scherer, K. R. (1997). Profiles of emotion-antecedentappraisal: testing theoretical predictions across cul-tures. Cognition and Emotion, 11, 113-150.

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-tiani.2008. SENTIWORDNET 3.0: An Enhanced Lex-ical Resource for Sentiment Analysis and OpinionMining

Strapparava, C. and Valitutti, A. 2004. Wordnet-affect:an affective extension of wordnet. In 4th LREC, pp.1083-1086

Takamura Hiroya, Takashi Inui, and Manabu Okumura.2005. Extracting semantic orientations of words us-

ing spin model. In Proceedings of the 43rd AnnualMeeting of the Association for Computational Lin-guistics(ACL’05), pages 133–140.

Wiebe, J., Wilson, T. and Cardie, C. 2005. Annotatingexpressions of opinions and emotions in language.LRE, vol. 39(2-3), pp. 165-210.

http://wordnet.princeton.edu

http://www.cs.waikato.ac.nz/ml/weka/

http://emotion-research.net/toolbox/toolboxdatabase.2006-10-13.2581092615

http://www.affective-sciences.org/researchmaterial

38

Page 49: Proceedings of the ACL 2015 Workshop on Novel ... · University of Singapore and Dr. Preslav Nakov, senior scientist at Qatar Computing Research Institute. After a rigorous review

Author Index

Aizawa, Akiko, 2

Boudin, Florian, 19

Chen, Dong, 25Chu, Wanghuan, 25

Das, Dipankar, 32

Erbs, Nicolai, 10

Gurevych, Iryna, 10

Kan, Min-Yen, 1

Nakov, Preslav, 18Norman, Christopher, 2

Paul, Apurba, 32

Santos, Pedro Bispo, 10

Tuarob, Suppawong, 25Tucker, Conrad, 25

Zesch, Torsten, 10

39


Recommended