+ All Categories
Home > Documents > 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and...

5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and...

Date post: 19-Mar-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
8
Transcript
Page 1: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been
Page 2: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been
Page 3: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been

1 2012 31information technologiesand control

The examples below represent the main groups of impersonalsentences in Bulgarian:

a)Sentences with impersonal verb (Ex. 6 a). Verbs fromthis category cannot be part of finite constructs - they areconstantly impersonal;

b)Sentences with verb, which could be used as finite andas impersonal. (Ex. 6 b, c);

c)Sentences with a copula and predicative word (Ex. 6 d).

III Related Research

The distribution of zero pronouns is a subject of investi-gation in some other pro-drop languages - Spanish [12],Portuguese [9] and Romanian [6]. An algorithm for ZP resolutionin Spanish can be found in [10]. The authors apply the idea ofconstraints and preferences; the same idea lies at the root ofMitkov’s knowledge-poor pronoun resolution approach [7].Detection of impersonal clauses, which can improve andcomplement the algorithm in Spanish, is discussed in [12].

Although anaphora resolution has attracted the attention ofmany researchers and many approaches have been developed[7], we found only one work dealing with this subject forBulgarian - [16]. This paper presents an anaphora resolver,which is an adaptation for Bulgarian of Mitkov’s knowledge-poorpronoun resolution approach. It resolves only third-personpersonal pronouns. The problem “zero pronoun resolution” inBulgarian has not been studied there. Our first study on thisproblem is presented in [3,4]. An algorithm for zero pronounresolution based on constraints and preferences is discussedthere. The algorithm takes into account some features ofBulgarian - for instance, noun phrase (NP) can be lexicallyrealized by an adjective with definite article. More rules foridentification of impersonal clauses have been added in [4].One of the goals of the present study is to improve the zeropronoun resolution algorithm with new typically Bulgarianheuristic criteria.

IV Z-corpora - Descriptionand Annotation Criteria

The annotated corpora play important role in most of thenatural language processing applications. Our immediate usageof such corpora is to observe patterns and deduce rules for rule-based anaphora resolver. Further the same corpora will be usedfor machine learning methods.

We had access to the existing annotated corpora de-scribed in [14], created in the Linguistic Modeling Departmentat Bulgarian Academy of Science (BAS). These language re-sources and tools are presented in [17]. Although the existingcorpora are a valuable resource and every word is marked upwith detail linguistic information, we took a decision to createour own corpora especially for the purposes of zero pronominalanaphora. The main features which make the existing corporaunsuitable for our goals are the following:

•Co-referential relations are marked up only within asingle sentence.

Inter-sentential anaphora is not a rare phenomenon. 28%of the ZPs with lexical antecedent in our annotated corpora areinter-sentential.

• Impersonal verbs are marked up, but impersonalclauses are not.

Impersonal constructs can be expressed by finite verbs inBulgarian. In the existing corpora the verb in such clause ismarked up as finite, but in fact the clause is impersonal.

•Verb phrases with modal verb are considered as con-sisting of two verbs. The second verb is marked ashaving omitted subject.

In our opinion, this fact increases the number of zeropronouns unnaturally. Verb phrases with modal verb are a spe-cific case of compound verb predicate. Such predicate “ex-presses unified process of the action”5 [11] and the subject ofthe first verb unconditionally coincides with the subject of thesecond.

· In the existing corpora clauses with verb zero anaphoraare also marked as having omitted subject.

Our goal is to recover the missing pronoun only when theverb is present (but not omitted!). A specific case of the verb zeroanaphora is the omission of the copula. When the compoundnoun predicate consists of a copula plus past participle, the pastparticiple is used as an adjective [1]. If the adjectives are morethan one, the copula is usually used only once, before the firstone. We do not consider the remaining participles as verbphrases with ZPs (as our colleagues did), but as adjectives andwe do not marked up them as having ZPs.

Our final task is to create an application which will recoverthe missing pronouns in unrestricted texts in different genres.According to this goal the corpora consist of full and partial textsretrieved from the web and digitalized books, encompassingseveral genres: legal, literary, news and encyclopedic. TheBulgarian Constitution and the beginning of the Labour Coderepresent legal text. The literary genre contains texts only fromBulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev,an author of children’s books. The texts in the news genre havebeen extracted from articles in web newspapers at the end of2011. Texts with direct speech are avoided. The encyclopedicgenre includes texts from computer, historic and medical litera-ture taken from the web portal BooksBg.org. The corpora contain1029 zero pronouns, more or less evenly distributed in thementioned genres. Direct speech is not annotated.

Annotation criteria are important issue in corpus annota-tion. Different annotation schemes for annotating anaphora arediscussed in [2]. Our annotation scheme is similar to those in[6,9, and 12] with some differences and additions. The authorsof the mentioned papers classify the clauses as main, subordi-nate, coordinate and juxtaposed. Our classification is as mainand subordinate, but we include also the type of sentence asannotation criterion. The type is one of the following: simple,compound, complex, complex-compound [1,11]. Before everyZP we put information for: the omitted pronoun, its antecedent(head noun in the NP), its dependency head (the clause verb onwhich the ZP depends), the relation (anaphora/cataphora), typeof the sentence, type of the clause. The antecedent to which the

5 Translated in English by the author.

Page 4: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been
Page 5: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been

1 2012 33information technologiesand control

Table 2. ZP and impersonal clauses in percentage to total number of clauses

Corpus Clauses with ZP

Anaphoric Cataphoric

Legal 251 251 0 Literary 266 256 10 News 252 247 5 Encyclopedic 260 257 3 Total 1029 1011 18

Table 3. Distribution of anaphoric and cataphoric clauses

Corpus Lexical ant.; Percentage

Exophoric ant.; Percentage

Legal 251; 100% 0; 0% Literary 257; 96.62% 9; 3.38% News 145; 57.54% 107; 42.46 Encyclopedic 157; 60.38% 103; 39.62%

Table 4. Distribution of lexical and exophoric antecedent

Table 5. Distribution of ZPs by type of the sentence

Table 6. ZPs in main and subordinated clauses

Corpus Clauses with ZP Impersonal clauses Legal 26.45 0.32 Literary 26.92 2.73 News 27.27 7.14 Encyclopedic 27.40 8.85

Complex Complex-compound Corpus Simple Compound Main Subordinated Main Subordinated

Total

Legal 49 90 2 29 9 72 251 Literary 7 45 13 60 44 97 266 News 32 33 32 97 13 45 252 Encyclopedic 9 49 24 63 19 96 260 Total 98 217 72 251 85 310 1029

Corpus Main Subordinated Proportion in percentage Legal 150 101 59.76 / 40.24 Literary 109 157 40.98 / 59.02 News 110 142 43.65 / 56.35 Encyclopedic 101 159 38.85 / 61.15 Total 470 559 Avg. 45.81 / 54.19

Page 6: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been

1 201234 information technologiesand control

phenomenon - only 18 cataphoric clauses to 1011 anaphoric.This is on average 1.74% of the anaphora phenomenon withstandard deviation of 1.58, i.e. non-uniform distribution in thedifferent genres. Our observation shows that cataphoric clausesare part of the author’s style. Nine out of the ten cataphoricclauses in the literary genre belong to one of the tree authors-Dimitar Dimov.

Another section of the data presents the distribution of thelexical and exophoric antecedents - table 4. Again, differentgenres diverge a lot. The exophoric antecedents are absent in thelegal genre, with only 3.38% in the literature and 42.46% in thenews. The analysis of the texts shows that very often in the newsand in the encyclopedic texts the authors express their ownopinion and address the readers without using personal pro-nouns. Definite-personal are 83.56% of the clauses with exophoricantecedent. Other literary technique in encyclopedic and newsgenre is the usage of indefinite-personal (10.96%) and gener-alized-personal clauses (5.48%).

The next aspect of the study refers to the type of thesentence, where the zero pronouns exist. Table 5 gives detailedinformation about the kind of the sentences which include zeropronouns.

The compound sentence consists of independent clauses,but complex and complex compound clauses have independent(main) clause and subordinate clause(s). The complex sentencehas one main clause and at least one subordinate. The indepen-dent clauses are connected by coordinative conjunctions; thesubordinated - by subordinating conjunctions. In complex-com-pound sentences some of the clauses are connected as inde-pendent clauses, while others - as subordinated clauses [11].Table 6 gives us a clear picture how many ZPs we had in mainand how many in subordinated clauses.

Literary, news and encyclopedic texts have more ZPs insubordinated clauses in contrast to the legal genre, where theproportion is reverse. The authors who write literature, news andencyclopedic books use more narrative and descriptive sen-tences. On one side, very often these sentences are complex andcomplex-compound, but on the other side, to avoid redundancy,they have omitted pronouns.

Table 7 comprises next aspect of the study - the distancebetween the anaphor and the antecedent. It can be seen from thetable that in the legal genre this distance is the longest one.

In order to be more precise, we calculated not only theaverage distance (as number of sentences), but also the stan-dard deviation. The legal genre has the highest standard devia-

tion value - 2.96. The literature genre has standard deviation of1.14, the news - 0.70, and the encyclopedic - 0.55. Thetendency is the same when the distance is measured in thenumber of clauses. It was interesting to know the most fre-quently occurring value in the arrays of data, i.e. the mode. Theresults show that the antecedent most often is in the samesentence where the anaphor is and in the previous clause. Theusual position of the anaphor is next to the dependent verb. Thedistance to the verb increases when there is a conjunction, anadverb, negative particle or combination of them preceding theverb. The quantity with the biggest diversion of values is “thedistance to the antecedent”, measured in words. Often thisdistance is 2, 6 or 7 words. But we have an example of adistance of 163 words in the literary genre.

The final aspect of this study is the syntax position of theantecedent. Data from the corpora shows that from 809 anaphoricclauses with lexical antecedents, in 741 (91.59%) of them theantecedents are subjects of some previous clauses and 68(8.40%) are in some other syntax role: direct object - 28(3.46%), indirect object - 21 (2.56%), uncoordinated attribute- 16 (1.98%) and adjunct phrase - 3 (0.37%).

VI. Qualitative Analysis

The parser is based on bottom-up strategy and contextfree grammars with extensions. It is realized in Java. The exten-sion allows a meta-symbol, which can be linked to the right sideof every symbol in every production, to define the number ofpossible occurrences of the original symbol. The possible meta-symbols are: “?” - the symbol can exist zero or one time; “*”- the symbol can exist zero or more times; “+” - the symbolcan be repeated one or more times. If there is no meta-symbol,linked to the symbol, it must exist exactly once. These exten-sions allow reduction of the number of the productions whichconstitute the grammar. Using the extensions, we do not needa separate rule for each possible place of the words in theclause. For instance, the production in Ex. 7 means that theclause must consist only of a verb phrase (VP). Before and afterthis VP, it is possible to have all kind of phrases, even nophrases.

Ex. 7 Clause → Phrase * VP

Phrase *Because we do not have at our disposal a morphology

Table 7. Antecedent distance and dependent verb distance

Corpus Distance to antecedent, avg.

of sentences

Distance to antecedent, avg.

of clauses

Distance to antecedent, avg.

of words

Distance to dependent verb,

avg. of words Legal 1.67 3.97 25.75 1.36 Literary 0.40 2.10 11.87 1.60 News 0.40 1.75 9.39 1.54 Encyclopedic 0.25 1.66 12.18 1.45 Average 0.67 2.36 14.75 1.49

Page 7: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been
Page 8: 5-grigorova 1 12 - Acad · 2013. 6. 11. · Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children’s books. The texts in the news genre have been

Recommended