+ All Categories
Home > Documents > The IPP effect in Afrikaans: a corpus analysis

The IPP effect in Afrikaans: a corpus analysis

Date post: 25-Jan-2023
Category:
Upload: kuleuven
View: 0 times
Download: 0 times
Share this document with a friend
13
The IPP effect in Afrikaans: a corpus analysis Liesbeth Augustinus 1 , Peter Dirix 1,2 (1) Centre for Computational Linguistics, University of Leuven (2) Nuance Communications, Inc. ABSTRACT Compared to well-resourced languages such as English and Dutch, NLP tools for linguistic analysis in Afrikaans are still not abundant. In order to facilitate corpus-based linguistic research for Afrikaans, we are creating a treebank based on the Taalkommissie corpus. We adapted a tokenizer and a shallow parser, while using a TnT tagger to do part-of-speech annotation. A first linguistic phenomenon we are investigating is the occurrence of infinitivus pro participio (IPP) in Afrikaans. IPP refers to constructions with a perfect auxiliary, in which an infinitive appears instead of the expected past participle. The phenomenon has been studied extensively in Dutch and German, but studies on Afrikaans IPP triggers are sparse. In contrast to the former two languages, it is often mentioned in the literature that in Afrikaans, IPP occurs optionally. We want to check this statement doing a corpus analysis. KEYWORDS: Afrikaans, tokenizer, parser, chunker, corpus search tool, IPP.
Transcript

The IPP effect in Afrikaans: a corpus analysis

Liesbeth Augustinus1, Peter Dirix1,2

(1) Centre for Computational Linguistics, University of Leuven(2) Nuance Communications, Inc.

{liesbeth,peter}@ccl.kuleuven.be

ABSTRACTCompared to well-resourced languages such as English and Dutch, NLP tools for linguisticanalysis in Afrikaans are still not abundant. In order to facilitate corpus-based linguisticresearch for Afrikaans, we are creating a treebank based on the Taalkommissie corpus. Weadapted a tokenizer and a shallow parser, while using a TnT tagger to do part-of-speechannotation. A first linguistic phenomenon we are investigating is the occurrence of infinitivuspro participio (IPP) in Afrikaans. IPP refers to constructions with a perfect auxiliary, inwhich an infinitive appears instead of the expected past participle. The phenomenon hasbeen studied extensively in Dutch and German, but studies on Afrikaans IPP triggers aresparse. In contrast to the former two languages, it is often mentioned in the literature thatin Afrikaans, IPP occurs optionally. We want to check this statement doing a corpus analysis.

KEYWORDS: Afrikaans, tokenizer, parser, chunker, corpus search tool, IPP.

1 Introduction

Afrikaans is a West Germanic language spoken as a first language by about 7 million peoplein South Africa and Namibia and by many millions more as a second language. It can beconsidered a daughter language of Dutch, as it originates in 17th-century Dutch dialects,brought to southern Africa by settlers from the Netherlands. Although there are someinfluences from Malay, Portuguese, Bantu, and Khoisan languages, Dutch and Afrikaansare still more or less mutually comprehensible. One of the main features of Afrikaans isa simplification of Dutch morphology, e.g. dropping the nominal gender distinction andonly keeping two verb forms for all but the most common verbs (present/infinitive and pastparticiple).

In recent years, several NLP tools were created for Afrikaans, cf. Grover et al. (2011) foran overview of the available tools. Compared to well-resourced languages such as Englishand Dutch, however, it seems that the tools which are available for Afrikaans are lesswell-performing.

The purpose of our research is twofold. As a starting point, we describe the NLP tools thatwere used to process and query the data, as well as the first step towards the creation of atreebank based on an Afrikaans text corpus (the Taalkommissie corpus1) (cf. section 2).

In the second part of this paper we investigate whether and how the tools and resourcesthat are currently available can be used as a means for descriptive linguistics. As a casestudy, we will look for the occurrence of infinitivus pro participio, a.k.a. the IPP effect, inthe Taalkommissie corpus. In this linguistic study we compare the IPP phenomenon as it isdescribed in the literature (cf. section 3) to its occurrence in the data (cf. section 4 and 5).

Besides improving the performance of the existing annotation tools, we intend to include theparsed corpus into a user-friendly query engine in order to facilitate corpus-based linguisticresearch for Afrikaans (cf. section 6).

2 Tools

In order to investigate the linguistic case study described in sections 3 to 5, we automaticallyannotated the Taalkommissie corpus. This section describes the tools used to annotate andquery the corpus. We adapted a tokenizer and a shallow parser, while using a TnT tagger(Brants, 2000) trained on Afrikaans to do part-of-speech (PoS) annotation. We furthermoreadded a search engine to facilitate corpus exploitation.

2.1 Tokenizer

The Dutch tokenizer (Dirix et al., 2005) used in the METIS-II project is rule-based, usingregular expressions which model the finite-state characteristics of tokenization and givesonly one tokenization per sentence, so the output does not contain any ambiguities. Thetokenizer basically splits on white space and detaches punctuation marks from the adjacentwords. We adapted the Dutch rules to Afrikaans in order to deal with abbrevations thatinclude a period and the ones to deal with words containing apostrophes (e.g. the indefinitearticle ’n).

1Taalkommissie van die Suid-Afrikaanse Akademie vir Wetenskap en Kuns (2011).

2.2 Tagger and tag setWe used the TnT Tagger (Brants, 2000), a Hidden Markov Model based n-gram tagger, whichwas trained by CTexT2 to tag the corpus (further referred to as the CTexT tagger). The tagset consists of 139 different tags, based mainly on morphosyntactic features (Pilon, 2005).

In the case of verbs, which is the most relevant PoS for our research (cf. sections 3 and5), a distinction is made between transitive and intransitive verbs, between separableand inseparable verbs, and also between main verbs, modal verbs, temporal auxiliariesand passivizing auxiliaries. Marked forms (ge-marking or simple past in the case of a fewauxiliaries) and unmarked forms are also distinguished. Altogether, there are 17 verbal tags,as shown in Table 1.

Tag ValueVTHOG inseparable transitive main verb, unmarkedVVHOG inseparable transitive main verb, markedVTHOO inseparable intransitive main verb, unmarkedVVHOO inseparable intransitive main verb, markedVTHOV inseparable intransitive main verb requiring preposition, unmarkedVVHOK inseparable intransitive main verb requiring preposition, markedVTHOK copula, unmarkedVVHOK copula, markedVTHSG separable transitive main verb, unmarked, markedVTHSO separable intransitive main verb, unmarkedVTUOM modal auxiliary, presentVVUOM modal auxiliary, pastVTUOA aspectual auxiliary, presentVVUOA aspectual auxiliary, pastVTUOP passive auxiliary, presentVVUOA passive auxiliary, pastVUOT temporal auxiliary

Table 1: Verbal tags in the CTexT tagger.

The author of the tagger claims an accuracy of 85.87% on a small data set, which is ratherlow compared to state-of-the-art PoS taggers for well-resourced languages.3 Although thereare some other taggers trained for Afrikaans, they did not seem to meet our research goals.For example, Schlünz (2010) reports an accuracy of 94.64%, but with a tag set reduced toonly 17 different tags. The TiMBL-based tagger for Afrikaans (Puttkammer, 2006) is notusable for our purpose, because it mainly identifies different categories of named entitiesinstead of the regular PoS tags.

2.3 ParserWe aim to create a treebank for Afrikaans. As a starting point for syntactic annotations, weused ShaRPa, a shallow rule-based parser (Vandeghinste, 2008) coming with grammars forEnglish and Dutch. In order to parse the Taalkommissie corpus, we created an Afrikaansgrammar. The different steps can either be defined as context-free grammars, using thePoS tags as preterminals or as Perl subroutines, defined in a Perl module. Note that thegrammars are not automatically processed in a recursive mode. The module allows theapplication of rules which cannot be formulated in the context-free grammar formalism. An

2Centre for Text Technology, North-West University, Potchefstroom, South Africa.3The low accuracy is probably due to the fact that the tagger is trained on only 20,000 tokens.

option file defines the application order of the different grammars and subroutines. Bothgrammars and subroutines can be applied more than once.

Since this is a shallow parser, there is not much depth in the resulting parse tree. It usesthe tagged corpus as input, and returns the parsed structure, with marked NPs, PPs, verbgroups (VG), and some APs and VPs. The head of a phrase is also marked (/M). Each phraseis presented on one line. Each line is divided into three columns: the phrase tokens, thephrase name (assigned by ShaRPa), and the phrase structure, representing the parse buildinghistory (containing the PoS tags assigned by the tagger).

An example parse for the sentence Dis haar handpalms wat begin sweet het, besef sy. (It isthe palms of her hands that had started to sweat, she noticed.):

<s>dis NP NP[NSE0[NSE]]haar handpalms NP NP[P00B[PDVEB] NSE0[NSE]/M]wat PB PBbegin sweet het VG VG[VP[VTHOO] VP[VVHOO] VUOT/M], ZM ZMbesef NP NP[NA]sy P00B P00B[PDHEB]. ZE ZE</s>

Note that dis, a shortened form of dit is (it is), is mistagged as the far more infrequenthomograph noun (a formal word for ‘table’) and that besef, which can be both a verb (tonotice) and a noun (notion), is mistagged as noun.

The verb groups, however, do not give more information than the sequence of tags. Asour shallow parser currently does not identify discontinuous verb groups, we will need tointroduce a full parse in order to be able to use this information. The quality of the taggingalso influences the quality of the parse, so we need to improve the tagger results in order toachieve better results.

2.4 Corpus search tool

In order to look for linguistic constructions in the Taalkommissie corpus, we have created acorpus search tool.The preprocessing consisted of tokenizing and tagging the corpus with the tokenizer and PoStagger described in section 2.1 and 2.2 respectively. Next, we assigned a unique identifier toeach sentence. Then we stored the complete corpus into a PostgreSQL database.4 For eachsentence, we included the following information in the database:

ID | sentence | PoS string | token-PoS string

Since the Taalkommissie corpus is rather large, we used the (built-in) B-tree indexing aimingto speed up corpus search.

In order to facilitate querying the corpus, we added a search interface on top of the database.The interface is a combination of PHP scripts and HTML, resulting in a web-based search

4http://www.postgresql.org

tool which allows users to query the corpus without any local installation of corpora and/orsoftware.

As input, the user provides a query which could be a string of tokens, e.g. het kom kuier(lit. ‘have come visit’), a string of PoS tags, e.g. [VUOT] [VTUOA] [VTHOG] (base form oftemporal auxiliary, aspectual verb, main verb), or a combination of both tokens and tags,e.g. het[VUOT] kom[VTUOA] kuier[VTHOG]. Note that PoS tags should be put betweensquare brackets. It is furthermore possible to use a wildcard for the PoS tags. For example, ifone wants to look for any verb form, [V*] can be used; if one want to differentiate betweenbase forms and inflected verb forms, [VT*] and [VV*] can be used respectively.

Furthermore, there is an option to include some context before and after the matchingsentences. This might be useful to disambiguate homonyms in the case of short sentences,or if one is interested in discourse phenomena.

After querying the corpus, the results are presented to the user (see screenshot in Figure1). At the top of the page, the search instruction is repeated. Below the query, a list ofmatching sentences is displayed. The constructions matching the query are highlighted ineach sentence. It is also possible to view/save the results as plain text format (with andwithout PoS tags).

Figure 1: Corpus search tool interfaceAt the bottom of the page, a grid with the corpus results is printed. It indicates howmany hits in how many matching sentences were found. Furthermore, the ratio (matchingsentences/sentences in the corpus) is given.

At the moment, it is not possible to query the corpus partially. It might be interesting to lookinto specific parts of the corpus (e.g. newspaper texts only), but unfortunately the corpuslay-out did not allow us to divide the corpus along those lines.

3 Infinitivus pro participio

3.1 IPP in double infinitive constructionsInfinitivus pro participio (IPP) or Ersatzinfinitiv is a linguistic phenomenon occurring in asubset of the West Germanic languages, such as Dutch, German, and Afrikaans. IPP refers toconstructions with a perfect auxiliary, in which an infinitive appears instead of the expectedpast participle. In Afrikaans, one expects the temporal auxiliary for the perfect tense to select

a past participle, marked in various ways, most generally by a prefix ge- and sometimesan ending (usually either -d/-t or -en), cf. gebly in example (1a).5 However, when a verboccurring in the perfect tense selects another verb, it commonly occurs as an infinitive, cf.bly in example (1b), instead of the expected past participle, as illustrated in example (1c).6

(1) (a) Hyhe

hethave:PRES

stilsilent

gebly.stay:PP

‘He remained silent.’

(b) Hyhe

hethave:PRES

blystay:INF

praat.talk:INF

‘He kept on talking.’

(c) Hyhe

hethave:PRES

geblystay:PP

praat.talk:INF

‘He kept on talking.’

While Dutch and German grammars mention general types of verbs (e.g. modal verbs) forwhich IPP is either required or optional, none of our Afrikaans sources do. Nevertheless,Ponelis (1979), Zwart (2007), and De Vos (2001) report that the IPP effect appears option-ally in Afrikaans. This contrasts with Dutch and German, as in those languages the IPPphenomenon is obligatory for certain verbs, see amongst others Haeseryn et al. (1997), andDudenredaktion (2006). Donaldson (1993) mentions however that IPP is triggered in mostcases, such as in example (1b). Constructions with a past participle such as (1c) do occur,but Donaldson considers them non-standard Afrikaans. A similar construction as (1c) inDutch is not possible, as the cognate verb blijven (stay) always triggers IPP.

De Vos (2001) also reports that some of the IPP triggers, esp. laat (let), tend to passivizefairly productively (2). This phenomenon is ungrammatical in Dutch and German.

(2) HierdieThis

huishouse

isbe:PRES

deurby

mymy

oomuncle

(ge)laatlet:PRES/PP

bou.build:PRES

‘My uncle had this house built.’

3.2 IPP in progressive constructions

Apart from double infinitive constructions, there is a second construction in which IPP canbe triggered. Afrikaans has a serialization pattern using the conjunction en (and) in orderto express the continuous or progressive aspect of the verb, as in example (3a). Suchconstructions also exist in English (e.g. He sits and reads), but not in Dutch nor German. Inthe perfect of this construction, the first main verb has optional ge-marking, so it optionallytriggers IPP, while the second main verb always occurs in the infinitive, as shown for the verbstaan (stand) in examples (3b-c). Both forms are considered standard Afrikaans by Ponelis(1979), Zwart (2007), Donaldson (1993), and Verdoolaege and Van Keymeulen (2010).

5Some verbs have no ge-prefix though, so the past participle might actually be the same as the infinitive, e.g.bestuur (drive), begin (start, begin).

6Note that both examples (1b) and (1c) are grammatical in Afrikaans.

(3) (a) Onswe

staanstand:PRES

stilstill

enand

luister.listen:PRES

‘We are standing and listening.’

(b) Onswe

hethave:PRES

stilstill

staanstand:INF

enand

luister.listen:INF

‘We were standing and listening.’

(c) Onshe

hethave:PRES

stilstill

gestaanstand:PP

enand

luister.listen:INF

‘We were standing and listening.’

De Vos (2001) reports that, although speaker judgments might vary, it is generally difficultto passivize indirect linking verbs (4), while Breed (2012) considers them grammatical.

(4) DieThe

appelapple

wordbecome:PRES

deurby

homhim

gesitsit:PP

enand

eet.eat:PRES

‘The apple was being eaten by him.’

This construction is also impossible in both Dutch and German.

4 Hypothesis, data, and methodologyBased on the literature, the hypothesis is that, in contrast to Dutch and German, IPP occursoptionally in Afrikaans. We will test the hypothesis through a corpus-based study, usinga PoS-tagged version of the Taalkommissie corpus.7 The corpus, which is compiled by theAfrikaans language committee of the South African Academy for Science and Arts, containsabout 58 million words of formal, written Afrikaans. It comprises many different text types,including newspaper articles, magazines, Bible texts, scientific articles, and study guides.

In order to query the corpus, we have created a corpus search tool (cf. section 2.4), whichenables us to look for IPP constructions and their counterexamples with a past participle.We aim to find out whether IPP is actually optional or required in both double infinitiveconstructions and progressive constructions. Furthermore, we will investigate which verbsoccur as IPP triggers in Afrikaans. The results of the corpus study are presented in section 5.

5 Results and discussion

5.1 IPP in double infinitive constructionsIn order to retrieve IPP in double infinitive constructions and counterexamples with pastparticiples in the Taalkommissie corpus, we extracted all combinations in which the verbform het (have) was followed or preceded by two verbs.8 In addition, we also look at thesequence where there is one other word between het and the two other verbs. Although it ispossible that more than one word occurs between het and the verbal group, we limited ourresearch to constructions with zero or one word between het and the two verb forms.9 This

7Taalkommissie van die Suid-Afrikaanse Akademie vir Wetenskap en Kuns (2011).8We used the query het[VUOT] [VT*] [VT*] to retrieve double infinitive constructions. Discontinuous

constructions as well as counterexamples were found using variations of this query.9Since we only have a ‘flat’ corpus, it is hard to retrieve discontinuous structures. Using a treebank should solve

this problem.

results in 9,880 hits, which were manually checked and categorized. We threw out the falsepositives due to wrong tagging and cases that did not involve main verbs that are triggeringan infinitive. We also ignored the modal verbs kan (can), mag (could) and moet (must), asin those cases it is often hard to distinguish the matrix verb from the embedded verb.

We retained 5,679 matches for the infinitive selecting verbs, of which 5,616 occur as IPPtriggers (98.89% of the constructions under consideration). The results are shown in Table 2.

Verb IPP No IPP Two PPs Total % IPP Translationaanhou 45 6 0 51 88.24 keep onbegin 1,454 1 0 1,455 99.93 beginbly 270 0 1 271 99.63 staydoen 1 0 0 1 100.00 do, makedurf 35 1 0 36 97.22 daregaan 853 0 0 853 100.00 gohelp 110 8 0 118 93.22 helphoor 4 0 0 4 100.00 hearkom 645 5 12 662 97.43 comelaat 1,458 2 0 1,460 99.86 letleer 26 7 0 33 78.79 learn/teachloop 0 1 0 1 0.00 walk, runmaak 1 5 0 6 16.67 make, doophou 16 6 0 22 72.73 stop, endprobeer 564 1 0 565 99.82 trysien 130 5 2 137 94.89 seewil 4 0 0 4 100.00 wantTOTAL 5,616 48 15 5,679 98.89

Table 2: IPP in double infinitive constructions.

Although some verbs are used rather infrequently in this construction, it is clear that in mostof the cases, IPP is actually applied. Only for maak and the separable verbs aanhou andophou, we see a slightly higher percentage of cases that do not have IPP. Verbs like begin,bly, durf, gaan, help, hoor, kom, laat, probeer, and sien seem to require IPP, cf. examples (5)and (7a), while we could consider it optional at least for leer, cf. example (6). We also see afew cases, esp. for kom, in which both the main verb and the verb triggered by it appear aspast participles, cf. example (7b). This is not allowed by any of the Afrikaans grammarswe consulted (cf. section 3.1). In general, we can conclude that there is a clear tendencyfor infinitive-selecting verbs to trigger IPP. We have only found 63 sentences in which theselecting verb receives ge-marking, which might explain why Donaldson considers suchconstructions substandard. De Vos (2001) links the optionality to the level of formality.

(5) Mymy

maagstomach

hethave:PRES

beginbegin:INF

draai.turn:INF

‘My stomach has started to turn. [TKK, a00-2487]’

(6) (a) Hoehow

ekI

leerlearn:INF

leesread:INF

het,have:PRES

weetknow

ekI

nie.not

‘I do not know how I have learned to read.’ [TKK, a21-26482]

(b) Dinkthink

terugback

hoehow

jyyou

geleerlearned:PP

bestuurdrive:INF

hethave:PRES

(...)

‘Think about the time you learned to drive (...)’ [TKK, a16-20128]

(7) (a) (...) Onswe

hethave:PRES

komcome:INF

kuier.visit:INF

‘(...) We came to visit.’ [TKK, a26-9964]

(b) ’na

Vragmotorlorry

watwhich

inin

diethe

teenoorgesteldeopposite

rigtingdirection

aangerydrive-towards:PP

gekomcome:PP

hethave:PRES

(...)

‘A lorry which came from the opposite direction (...)’ [TKK, a44-12672]

As De Vos (2001) claimed, we found some passivized constructions with these selectingverbs (see Table 3), but they are far less frequent than the active variant. We investigatedboth the present form with word and the perfect form with is. Of the selecting verbs used inthe passive, laat is by far the most frequent. There is only one counterexample using a pastparticiple instead of the IPP construction.

Verb IPPpresent

No IPPpresent

IPPperfect

No IPPperfect

Total % IPP Translation

begin 2 0 5 0 7 100.00 beginhelp 0 0 1 0 1 100.00 helplaat 11 1 43 0 55 98.18 letprobeer 8 0 3 0 11 100.00 tryTOTAL 21 1 52 0 74 98.65

Table 3: IPP in passive double infinitive constructions.

5.2 IPP in progressive constructionsIn a second test, we looked at IPP triggers in the serialized form of progressive construc-tions.10 We again selected cases with het, but now with the conjunction en (and) betweenthe two content verbs. This resulted in 1,743 hits, which were again categorized manually.We only retained 244 positive examples, of which 50.82% appeared as IPP triggers. Theresults are shown in Table 4.

It is clear that the construction as such is only frequent using lê, sit and staan as IPP triggers.IPP occurs in slightly less than half of the cases for sit and staan, so we can agree with thegrammars that IPP is optionally triggered in progressive constructions, cf. example (8). Forlê however, there seems to be a clear preference for the IPP construction. The progressivealso occurs a few times with loop, but in that case the past participle seems to be preferred.We encounter again a few cases of two past participles, cf. example (9b). Similar to theconstructions with double participles in section 5.1, such constructions seem less preferred.If we compare the results with the frequencies of a verb being the trigger for the progressiveconstruction in this corpus (Breed, 2012), we see that verbs using the progressive frequently

10We used the query het[VUOT] [VT*] en[KN] [VT*] to retrieve double infinitive constructions. Discontin-uous constructions as well as counterexamples were found using variations of this query.

(sit, staan, and lê) are more likely to apply IPP then loop, which is less likely to occur in thisconstruction.

Verb IPP No IPP Two PPs Total % IPP Translationbly 0 0 1 1 0.00 stay, remainbystaan 0 1 0 1 0.00 stand nearkom 0 0 1 1 0.00 comelê 30 5 0 35 85.71 lieloop 1 4 1 6 16.67 walk, runrondstaan 0 1 0 1 0.00 stand aroundsit 48 58 1 107 44.86 sitstaan 45 47 0 92 48.91 standTOTAL 124 116 4 244 50.82

Table 4: IPP in progressive/continuous constructions.

Note that most of the verbs that use this construction do not occur in the double infinitiveconstruction (cf. Table 2), while their Dutch cognates do (e.g. Afrikaans lê vs. Dutch liggen(to lie)). We can conclude that both constructions are in general mutually exclusive.

(8) (a) (...) waarwhere

hyhe

diethe

spulstuf

onderunder

’na

soetdoringsweet thorn tree

sitsit:INF

enand

dophouwatch:INF

hethave:PRES

(...)

‘ (...) where he was watching the stuff under a sweet thorn tree (...)’ [TKK, a25-14908]

(b) EkI

hethave:PRES

daarthere

gesitsit:PP

enand

wagwait:INF

opfor

BrettBrett

(...)

‘I was waiting there for Brett (...)’ [TKK, a34-8014]

(9) (a) Hyhe

verteltell:PRES

hoehow

hyhe

(...) virfor

diethe

hysbaklift

staanstand:INF

enand

wagwait:INF

hethave:PRES

(...)

‘He tells how he (...) waited in front of the lift (...)’ [TKK, a34-1063]

(b) (...) ’na

paarcouple

metermetre

vanfrom

waarwhere

ekI

nogstill

soso

rustigquiet

gestaanstand:PP

enand

geselschat:INF

het.have:PRES

‘(...) a couple of metres from where I was chatting (...)’ [TKK, a34-1086]

According to Breed (2012) passive constructions with indirect linking verbs are possible,but she was, like us, not able to find any examples in the Taalkommissie corpus.

6 Conclusions and future work

The case study on IPP triggers in Afrikaans shows that a corpus-based study can shed a newlight on the descriptive research of a linguistic phenomenon. Based on the literature, weassumed that IPP is optionally triggered in Afrikaans (both in double infinitive constructions

and in progressive constructions). The corpus results, however, reveal that infinitive-selectingverbs in double infinitive constructions trigger IPP in almost 99% of the constructions underinvestigation. The results of the progressive constructions are more consistent with thecurrent literature, since the IPP phenomenon optionally occurs in such constructions (i.e. inca. 50% of the cases). Moreover, we can conclude that verbs that occur as IPP triggers in thedouble infinitive construction, do not occur as IPP triggers in the progressive constructionand vice versa.

Although we obtained some nice results from the present study, we had to do a lot of(semi-)manual filtering of the results. In order to reduce such tasks, as well as to improvethe quality of the annotated data, we will improve the output of the annotation tools infuture research. As the CTexT tagger still contains a lot of errors which could be correctedby a simple rule-based extension, we will create a rule-based tag corrector based on the Brilltagger (Brill, 1992).

We also want to extend the parser in order to have different options from the current shallowparsing (including updating and improving the current grammars) to a full parse tree. Theparser will then be used to find constructions with more tokens intervening between therelevant items (i.e. between the auxiliary het and the infinitives and past participles inour case study). We will need to adapt the search tool to be able to search for chunksas well. Of course, this includes dealing with issues like efficient querying, indexing, andthe representation of the trees in the tool. Besides improving existing tools, we will run alemmatizer on the data, in order to include lemmas in the search tool as well.

Finally, all of this will be integrated in an Afrikaans equivalent of GrETEL (Augustinus et al.,2012), a query engine in which linguists can use a natural language example as a startingpoint for searching a treebank with limited knowledge about tree representations and formalquery languages.

Using all these tools, we want to further investigate the IPP effect in Afrikaans. For example,it would be interesting to investigate whether the number of tokens occurring between theauxiliary and the other verb(s) have an influence on the construction used. Those resultscan also be useful for a cross-linguistic comparison with similar work in Dutch and German.

Acknowledgments

We wish to thank the people of the Taalkommissie and CTexT for providing us with theTaalkommissie corpus and the PoS tagger.

References

Augustinus, L., Vandeghinste, V., and Van Eynde, F. (2012). Example-Based TreebankQuerying. In Proceedings of the 8th International Conference on Language Resources andEvaluation (LREC 2012), Istanbul.

Brants, T. (2000). TnT – A Statistical Part-of-Speech Tagger. In Proceedings of the SixthApplied Natural Language Processing Conference (ANLP-2000), pages 224–231, Seattle.

Breed, A. (2012). Die grammatikalisering van aspek in Afrikaans: semantiese studie vanperifrastiese progressiewe konstruksies. PhD thesis, North-West University, Potchefstroom.

Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the thirdconference on Applied natural language processing (ANLC 42), pages 152–155, Stroudburg,PA.

De Vos, M. (2001). Afrikaans Verb Clusters: A Functional-Head Analysis. Master’s thesis,University of Tromsø, Tromsø.

Dirix, P., Vandeghinste, V., and Schuurman, I. (2005). METIS-II: Example-based machinetranslation using monolingual corpora – System description. In Proceedings of MT SummitX, Workshop on Example-Based Machine Translation, pages 43–50, Phuket.

Donaldson, B. C. (1993). A Grammar of Afrikaans. Mouton de Gruyter, Berlin/New York.

Dudenredaktion (2006). DUDEN. Die Grammatik. Unentbehrlich für richtiges Deutsch.Dudenverlag, Mannheim/Leipzig/Vienna/Zürich.

Grover, A. S., van Huyssteen, G. B., and Pretorius, M. W. (2011). A Technology Audit:The State of Human Language Technologies (HLT) R&D in South Africa. In Proceedings ofPICMET’11: Technology Management In The Energy-Smart World (PICMET), pages 1693–1706.

Haeseryn, W., , Romijn, K., Geerts, G., de Rooij, J., and van den Toorn, M. (1997). AlgemeneNederlandse Spraakkunst. Martinus Nijhoff/Wolters Plantyn, Groningen/Deurne, secondedition.

Pilon, S. (2005). Outomatiese Afrikaanse woordsoortetikettering. Master’s thesis, North-West University, Potchefstroom.

Ponelis, F. A. (1979). Afrikaanse Sintaksis. J.L. van Schaik, Pretoria.

Puttkammer, M. J. (2006). Outomatiese Afrikaanse tekseenheididentifisering. Master’sthesis, North-West University, Potchefstroom.

Schlünz, G. I. (2010). The effects of part-of-speech tagging on text-to-speech synthesis forresource-scarce languages. Master’s thesis, North-West University, Potchefstroom.

Taalkommissie van die Suid-Afrikaanse Akademie vir Wetenskap en Kuns (2011). Taalkom-missiekorpus 1.1., CTexT, North West University, Potchefstroom.

Vandeghinste, V. (2008). A Hybrid Modular Machine Translation System. PhD thesis,University of Leuven.

Verdoolaege, A. and Van Keymeulen, J. (2010). Grammatica van het Afrikaans. AcademiaPress, Ghent.

Zwart, J.-W. (2007). Some notes on the origin and distribution of the IPP-effect. GroningerArbeiten zur Germanistischen Linguistik, 45:77–99.


Recommended