+ All Categories
Home > Documents > Hunlex manual

Hunlex manual

Date post: 02-Jun-2018
Category:
Upload: gabor-recski
View: 215 times
Download: 0 times
Share this document with a friend

of 58

Transcript
  • 8/10/2019 Hunlex manual

    1/58

    HunLexMorphological resource title specification framework

    and title precompilation toolReference Manual

    Edition draft for release candidate for pre-beta version 0.1

    Viktor TronIGK, Language Technology and Cognitive Systems. Universities of Edinburgh &

    Saarbrucken. MOKK Lab, Budapest Intitute of Technology. Budapest. [email protected]

    This file documents the HunLex morphological resource specification framework and precom-pilation tool(HunLex). It corresponds to release 0.1 of the the Hunlex distribution.

    More information about Hunlex can be found at the MOKK Lab homepage,http://lab.mokk.bme.hu.

    mailto:[email protected]://lab.mokk.bme.hu/http://lab.mokk.bme.hu/mailto:[email protected]
  • 8/10/2019 Hunlex manual

    2/58

  • 8/10/2019 Hunlex manual

    3/58

    i

    Table of Contents

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1 Hunlex: A Short Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Configurable Compilations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3 Authors, Contact, Bugs . . . . . . . . . . . . . . . . . . . . . . . . 5

    3.1 License? What license? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Submitting a Bug Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Requesting a New Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3.4 Praises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    53.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.6 Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.7 Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.8 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    4 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    4.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Supported Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.4 Install. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    4.5 Uninstall and Reinstall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    84.6 Installed Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    5 Bootstrapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    6 Toplevel Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    6.1 Verbosity and Debugging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126.2 Storing your Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136.3 Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    6.3.1 Resource Compilation Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.3.2 Special Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    6.3.3 Test Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    146.4 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    6.4.1 Executable Path Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.4.2 Verbosity and Debug Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.4.3 Input File Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.4.4 Output File Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.4.5 Resource Compilation Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

  • 8/10/2019 Hunlex manual

    4/58

    ii

    7 Description Language . . . . . . . . . . . . . . . . . . . . . . . . . 24

    7.1 Morphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.1.1 Morph Preamble and Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.1.2 Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    7.2 Macros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    7.3 Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    29

    8 Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    8.1 Input Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308.1.1 Primary Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    8.1.1.1 Lexicon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308.1.1.2 Grammar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    8.1.2 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1.3 Morpheme Configuration File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1.4 Feature Configuration File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328.1.5 Usage Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    8.2 Output Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    9 Command-line Control . . . . . . . . . . . . . . . . . . . . . . . . 35

    10 Levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    10.1 Levels and Affix Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3610.2 Levels and Stems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3610.3 Levels and Ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3710.4 Manipulating Levels with Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    10.4.1 Levels and Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3710.4.2 Levels and No Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    10.4.3 Levels and Steps of Affix Stripping. . . . . . . . . . . . . . . . . . . . . .

    3810.5 Levels and Optimizing Performance. . . . . . . . . . . . . . . . . . . . . . . . . . 38

    11 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    11.1 Merging Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4011.2 Feature Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    12 Flags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    12.1 Two Forms of Flags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4212.2 Flaggable Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4212.3 Limit on the Number of Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    12.4 Special Flags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    44

    13 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    13.1 Installation Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4513.2 Problems running hunlex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4513.3 Resource Compilation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4513.4 Grammar Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

  • 8/10/2019 Hunlex manual

    5/58

    iii

    14 Related Software and Resources . . . . . . . . . . . . 46

    14.1 Software that can use the output of Hunlex as input. . . . . . . . . . 4614.1.1 Huntools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4614.1.2 Myspell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4614.1.3 Jmorph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    14.1.4 Ispell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    4614.2 Available resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4614.2.1 The Hungarian Morphdb Project . . . . . . . . . . . . . . . . . . . . . . . . 4614.2.2 The English Morphdb Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    14.3 Hunlexs relatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4614.3.1 XFST, TWOLC, LEXC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    Variables and Options Index. . . . . . . . . . . . . . . . . . . . . 47

    Description Language Index . . . . . . . . . . . . . . . . . . . . . . 48

    Files Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    49

    Concept Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . 52

  • 8/10/2019 Hunlex manual

    6/58

    Chapter 1: Introduction 1

    1 Introduction

    This document presents the HunLex morphological resource specification framework andprecompilation toolwhich is being developed as part of the Budapest Institute of TechnologyMedia Education and Research Centers HunTools Natural Language Processing Toolkithttp://lab.mokk.bme.hu

    1.1 Hunlex: A Short Description

    HunLex offers a description language, ie., a formalism for specifying a base lexicon andmorphological rules which describe a languages morphology. This description which isstored in textual format serves as your primary resources that represents your knowledgeabout the morphology and lexicon of the language in question.

    Now, providing a resource-specification language is rather useless in itself. Hunlex is ableto process these primary resources and create the type of resources that are used by somereal-time word-level analysis tools. If you create these from your primary resources youmight call them secondary resources. These provide the the language-specific knowledge toa variety of word-level analysis tools.

    At present, most importantly, Hunlex provides the language specific resources for theHunToolsword-level analysis toolkit see Section 14.1.1 [Huntools], page 46. This packagecontains the MorphBaselibrary of word-analysis routines such as spell-checker, stemmer,morphological analyzer/generator and their standalone executable wrappers. Therefore,your single Hunlex description of your favourite language will enable you to perform spell-checking,stemming, andmorphological analysisfor that language, which is more than useful.

    In addition to the HunTools routines, other software which use ispell-typeresources will beable to use Hunlexs output. Among these are myspell, an open-source spell-checker (alsoused in Open Office http://www.openoffice.org, seeSection 14.1.2 [Myspell], page 46),

    or jmorph, a superfast java morphological analyzer (seeSection 14.1.3 [Jmorph], page 46).

    This document describes how you can create your primary resources and what you can(make Hunlex) do with them.

    Note: This document is not intended to describe how to use any of these real-time tools, what they are good for. See the above links to learn more aboutthem.

    In particular, this document provides you with:

    1. The compulsory tedium about Chapter 2 [License], page 4, Section 3.7 [Authors],page 6,Section 3.8 [Contact], page 6,Section 3.2 [Submitting a Bug Report], page 5,etc. SeeChapter 3 [About], page 5.

    2. The indispensable but trivial Installation notes, seeChapter 4 [Installation], page 7.3. A bit aboutChapter 5 [Bootstrapping], page 10your way as a Hunlex user.

    4. The detailed exposition of the syntax and semantics of the resource specification lan-guage (seeChapter 7 [Description Language], page 24);

    TODO: not yet

    5. The description of the toplevel control of the hunlex resoure compiler (seeChapter 6[Toplevel Control], page 12) detailing all the options and parameters. The direct com-mand line interface is also descibed there.

    http://lab.mokk.bme.hu/http://www.openoffice.org/http://www.openoffice.org/http://lab.mokk.bme.hu/
  • 8/10/2019 Hunlex manual

    7/58

    Chapter 1: Introduction 2

    6. Some hints onChapter 13 [Troubleshooting], page 45.

    7. Information aboutChapter 14 [Related Software and Resources], page 46.

    8. as well as a lot ofadvanced issues, likeChapter 12 [Flags], page 42,Chapter 10 [Levels],

    page 36,Chapter 11 [Tags], page 40, the list and format ofChapter 8 [Files], page 30.

    1.2 Motivation

    The motivation behind HunLex came from two opposing types of requirements lexical re-sourcesare supposed to fulfill:

    1. (i) scalability, maintainability, extensibility; and

    2. (ii) optimized format for the application.

    The constraints in (i) favour one central, redundancy-free, abstract, but transparent specifi-cation, while the ones in (ii) require possibly multiple application-specific, potentially redun-dant, optimized formats.

    In order to reconcile these two opposing requirements, HunLex introduces an offline layerinto the word-analysis workflow, which mediates between two levels of resources:

    1. a central database conforming to (i) (also primary resource, input resource),

    2. various application-specific formats conforming to (ii) (also secondary or output re-source)

    The primary resources are supposed to reasonably designed to help human maintanance,and the secondary ones are supposed to optimize very different things ranging from file size,performance with the tool that uses it, coverage, robustness, verbosity, normative strictnessdepending on who uses it for what purpose.

    HunLex is used to compilethe primary resources into a particular application-specific for-

    mat see Section 8.2 [Output Resources], page 33. This resource compilation phase is anoffline process which is highly configurable so that users can fine-tune the output resourcesaccording to their needs.

    By introducing this layer of offline resource compilation, maintenance, extendability, porta-bility of lexical resources is possible without compromising your performance on specificword-analysis tasks.

    Providing the environment for a sensible primary resource specification framework andmanaging the offline precompilation process are the raison detre behind Hunlex.

    1.3 Configurable Compilations

    Configuration allows you to adjust the compilation of resources along various dimensions:

    1. choice of output format that suits the algorithm (spell-checking, stemming, morpho-logical analysis, generation, synthesis),

    2. selection of morphemes to be included in the resource

    3. grouping of morphemes to be stripped in one step as an affix cluster (with one ruleapplication)

    4. selection of morphophonological features that are to be observed or ignored

    5. depth of recursive rule application

  • 8/10/2019 Hunlex manual

    8/58

    Chapter 1: Introduction 3

    6. selection of registers, degree of normativity, etc. based onusage qualifiers in the data-base

    7. selection of output morphological annotation, configurable tags information

  • 8/10/2019 Hunlex manual

    9/58

    Chapter 2: License 4

    2 License

    Hunlex is free software.

    It is licensed under LGPL, which roughly means the following.There are no restrictions on downloading it other than your bandwidth and our slothfulways of making things available.

    There areno restrictions on useeither other than its deficiencies, clumsy features and out-ragous bugs. However, this can be amended, because there areno restrictions on modifyingit either. See alsoSection 3.5 [Contribution], page 5.

    Freedom of use implies that any resources that you created, compiled with the mediationof Hunlex is yours and you hold the right to distribute it in any way. Consider telling usabout this great news, see Section 3.8 [Contact], page 6.

    What is more, there are no restrictions on redistributing this software or any modifiedversion of it.

    For some legalese telling you the same, read the License http://creativecommons.org/licenses/LGP

    Todo: Shall we not include the License?

    http://creativecommons.org/licenses/LGPL/2.1/http://creativecommons.org/licenses/LGPL/2.1/
  • 8/10/2019 Hunlex manual

    10/58

    Chapter 3: Authors, Contact, Bugs 5

    3 Authors, Contact, Bugs

    3.1 License? What license?

    SeeChapter 2 [License], page 4.

    3.2 Submitting a Bug Report

    If you find a bug or an undesireable feature or anything that is worth a couple of linesranting at the authors, please go ahead and send a bugreport on the MOKK Lab bugzillapage at http://lab.mokk.bme.huor send a mail to me (see Section 3.8 [Contact], page 6).

    3.3 Requesting a New Feature

    So you are using hunlex and find yourself realizing that you would need a certain featuredesparately which happens not to be implemented. Go ahead and request it from the

    authors (seeSection 3.8 [Contact], page 6) or sit silently and hope!

    3.4 Praises

    So you found hunlex cool and/or useful and would like the authors to hear about that. Hownice is that! SeeSection 3.8 [Contact], page 6.

    3.5 Contribution

    Hunlex is open source development, so developpers are welcome to contribute to make itbetter in any imaginable way. Contact us (seeSection 3.8 [Contact], page 6) to work outthe details of how and what you would want to contribute to Hunlex.

    3.6 Reference

    For the context of the whole huntools kit, use

    @InProceedings{szoszablya_saltmil:04,

    author = {L\aszl\o N\emeth and Viktor Tr\on

    and P\eter Hal\acsy and Andr\as Kornai

    and Andr\as Rung and Istv\an Szakad\at},

    title = {Leveraging the open-source ispell codebase

    for minority language analysis},

    booktitle = {Proceedings of SALTMIL 2004},

    year = 2004,

    organization = {European Language Resources Association},url = {http://lab.mokk.bme.hu/}

    }

    A very brief intro to hunlex with a one-page English resume.

    @InProceedings{hunlex_mszny:04,

    author = {Tr\on, Viktor},

    title = {HunLex - a description framework and

    resource compilation tool for morphological dictionaries},

    http://lab.mokk.bme.hu/http://lab.mokk.bme.hu/
  • 8/10/2019 Hunlex manual

    11/58

    Chapter 3: Authors, Contact, Bugs 6

    booktitle = {II. Magyar Sz\am\it\og\epes

    Nyelv\eszeti Konferencia},

    institution = {Szegedi Tudom\anyegyetem},

    address = {Szeged, Hungary}

    year = 2004

    }

    These and other papers can be downloaded from the MOKK Lab publications page athttp://lab.mokk.bme.hu

    3.7 Authors

    The author of hunlex and this document is Viktor Tron. He can be mailed to [email protected]

    Hopefully more can be found on MOKK Labs pages at http://lab.mokk.bme.hu.

    3.8 ContactWe can get in contact if you

    1. Mail toViktor Tron [email protected]

    2. Join the forums on http://lab.mokk.bme.hu

    3. Submit a bug report (see Section 3.2 [Submitting a Bug Report], page 5) or featurerequest (seeSection 3.3 [Requesting a New Feature], page 5).

    http://lab.mokk.bme.hu/mailto:[email protected]://lab.mokk.bme.hu/mailto:[email protected]://lab.mokk.bme.hu/http://lab.mokk.bme.hu/mailto:[email protected]://lab.mokk.bme.hu/mailto:[email protected]://lab.mokk.bme.hu/
  • 8/10/2019 Hunlex manual

    12/58

    Chapter 4: Installation 7

    4 Installation

    So you want to install the hunlex toolkit (see Chapter 1 [Introduction], page 1) from thehunlex source distribution. This document describes what and how you can install withthis distribution.

    4.1 Download

    The latest version of the hunlex source distribution is always available from the MOKK LABwebsite athttp://lab.mokk.bme.huor, if all else fails, by mailing to [email protected].

    4.2 Supported Platforms

    The hunlex executable in principle runs on any platform for which there is an ocamlcompiler(seeSection 4.3 [Prerequisites], page 7). This includes all Linuxes, unices, MS Windows,etc.

    Warning: This package has not been tested on platforms other than linux.

    4.3 Prerequisites

    [Prerequisite]ocamlHunlex is written in the ocaml programming language http://www.ocaml.org/.OCaml compilers are extremely easy to install and are available for vari-ous platforms and downloadable in various package formats for free fromhttp://caml.inria.fr/ocaml/distrib.html .

    You will need ocaml version >=3.08 to compile hunlex.

    [Prerequisite]ocaml-makeocaml-make

    OCamlMakefile (i.e., ocaml-make) is needed for the installation of hunlex and isavailable from Markus Mottls homepage athttp://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefile

    (I used version 6.19. writing on 8.1.2004).

    For OCamlMakefile you will need ocaml and GNU make. (for ocaml-make version6.19 you will need GNU make version >= 3.80)

    NB: Most probably earlier versions of ocaml-make and GNU make shouldalso work but have not been tested yet.

    You dont need anything else to use hunlex (but a little patience).

    4.4 Install

    Hunlex is installed in the good old way, i.e., by typing

    $ make && sudo make install

    in the toplevel directory of the unpacked distribution. Read no further if you know what Iam talking about or if you trust some God.

    http://lab.mokk.bme.hu/mailto:[email protected]://www.ocaml.org/http://caml.inria.fr/ocaml/distrib.htmlhttp://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefilehttp://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefilehttp://caml.inria.fr/ocaml/distrib.htmlhttp://www.ocaml.org/mailto:[email protected]://lab.mokk.bme.hu/
  • 8/10/2019 Hunlex manual

    13/58

    Chapter 4: Installation 8

    The hunlex distribution is available in a source tarball called hunlex.tgz. First you haveto unpack it by typing

    $ tar xzvf hunlex.tgz

    Then, you enter the toplevel directory of the unpacked distribution with

    $ cd hunlex

    To compile it, simply type

    $ make

    in the toplevel directory of the distribution.

    To install it (on what gets installed, seeSection 4.6 [Installed Files], page 9), type

    $ make install

    Well, by default this would want to install things under /usr/local, so you have to have

    admin permissions. If you are not root but you are in the sudoers file with the appropriaterights, you type:

    $ sudo make install

    You can change the location of the installation by changing the install prefix path with

    $ sudo make PREFIX=/my/favourite/path install

    Changing the location of installation for individual install targets individually is not rec-ommended but easy-peasy if you have a clue about make and Makefile-s. To do thisyou have to change the relevant Makefile-s in the subdirectories of the distribution. SeeSection 4.6 [Installed Files], page 9.

    If it works, great! Go ahead to Chapter 5 [Bootstrapping], page 10.

    If you have problems, doubleckeck that you have the prerequisites (see Section 4.3 [Prereq-uisites], page 7). If you think you followed the instructions but still have problems, submita bug report (seeSection 3.2 [Submitting a Bug Report], page 5).

    If you are upgrading an earlier version of hunlex, you may want to uninstallthe earlier onefirst (seeSection 4.5 [Uninstall and Reinstall], page 8).

    4.5 Uninstall and Reinstall

    The install prefix is remembered in the source distribution in the file install_prefix.So after you cd into the toplevel directory of the distribution, you can uninstall hunlex bytyping

    $ make uninstall

    You can reinstall it with

    $ make reinstall

    at any time if you make modifications to the code or compile options.

    Warning: Note that if you fiddle with changing the location of individual installtargets, uninstall and resinstall will not work correctly.

  • 8/10/2019 Hunlex manual

    14/58

    Chapter 4: Installation 9

    4.6 Installed Files

    The following files and directories are installed, paths are relative to the install prefix (seeSection 4.4 [Install], page 7):

    bin/hunlexthe executable which can be run on the command line (see Chapter 9 [Command-lineControl], page 35)

    lib/HunlexMakefile

    is the Makefile that defines the toplevel control of hunlex (see Chapter 6 [ToplevelControl], page 12). This file is to be include-ed into yourlocal Makefile to give youa Makefile-style wrapper for calling hunlex (seeChapter 5 [Bootstrapping], page 10andChapter 6 [Toplevel Control], page 12).

    Note that HunlexMakefile will assume that the hunlex executable is found in yourpath. Make sure that install-prefix/bin is in the path (usually /usr/local/bin isin the PATH.

    share/doc/hunlex//

    is a directory containing hunlex documentation. Various documents in various formatsare found under this directory including a replica of this document.

    TODO: this is not yet the case

    man/hunlex.1

    is the hunlex man page describes the command-line use of hunlex (also seeChapter 9[Command-line Control], page 35. Command-line use of hunlex is not the recommendedway of using it for the general user. Instead, use hunlex through the toplevel controldescribed in a chapter (seeChapter 6 [Toplevel Control], page 12).

    Todo: there is no man page yet

  • 8/10/2019 Hunlex manual

    15/58

    Chapter 5: Bootstrapping 10

    5 Bootstrapping

    So you installed hunlex and its running smoothly.

    This section leads you through the first steps and gives you hints on how you set out workingwith hunlex.

    Create your sandbox directory.

    Change to it.

    Create your own local Makefile. This will be your connection to the hunlex toplevel control.For your Makefile to understand hunlex predefined toplevel targets (seeSection 6.3 [Targets],page 13), you have toinclude(notinsert) the hunlex systemwide Makefile. So you create aMakefile with the following content:

    -include /path/to/HunlexMakefile

    where /path/to/HunlexMakefile is the path to HunlexMakefile which is supposed tobe installed on your system (see Section 4.6 [Installed Files], page 9), by default under/usr/local/lib/HunlexMakefile.

    Now, you are ready to test things for yourself. In order to see if all is well, type

    $ make

    at your prompt in the same sandbox directory.

    In fact, you will always type the make command to control hunlex. If you dont givearguments to make, a so-called default action (target, seeSection 6.3 [Targets], page 13) isassumed. The default target is resources which creates the output resources accordingto the default settings (see Section 6.4 [Options], page 15).

    Toplevel control assumes by default that all its necessary resources are found in the currentdirectory (seeSection 6.4.3 [Input File Options], page 17). If this is not the case, because

    the files do not exist, the compulsory ones are created and the compilation runs creatingthe output resources.

    Surely, the missing files are created without contents and your output resources will beempty as well. However, this vacuous run will test whether hunlex (and toplevel control) isworking properly.

    Now if you list your directory, you should see:

    $ ls

    affix.aff grammar Makefile phono.conf

    dictionary.dic lexicon morph.conf usage.conf

    If this is not the case, go to see Chapter 13 [Troubleshooting], page 45.

    The meaning of these files in your directory are explained in detail in another chapter (seeChapter 8 [Files], page 30).

    If you type make(or the equivalent make resources again, your resources will not be com-piled again, since the input resources did not change. If you still want to compile yourresources again, you type

    $ make new resources

    which forces toplevel to recompile although no input files changed (seeSection 6.3.2 [SpecialTargets], page 14).

  • 8/10/2019 Hunlex manual

    16/58

    Chapter 5: Bootstrapping 11

    Now.

    If you want to develop (toy around with) your own data and create resources, the nextstep is to fill in the input files. Read on to learn more about files (seeChapter 8 [Files],

    page 30) and then about the hunlex morphological resource specification language (seeChapter 7 [Description Language], page 24). Since you want to test your creation, youultimately have to learn about toplevel control (see Chapter 6 [Toplevel Control], page 12)and gradually about the advanced issues in the chapters that follow these.

    If you already have your hunlex-resources describing your favourite language ready and youwant to compile specific output resources from it with hunlex, you better read about toplevelcontrol with special attention to the options (seeChapter 6 [Toplevel Control], page 12). Ifyou want to fiddle around with more advanced optimization, such as levels and tags, youmay end up having to read everything, sorry.

  • 8/10/2019 Hunlex manual

    17/58

    Chapter 6: Toplevel Control 12

    6 Toplevel Control

    You typically want to use hunlex through its toplevel control interface. Toplevel controlmeans that you invoke hunlex indirectly through a Makefile to compile your resources.

    We envisage typical users of hunlex developing their lexical resources in an input directoryand occasionally dump output resources for their analyser into specific target directoriesfor various applications.

    If you dont like Makefiles or your system does not have make(how did you compile hunlex,then?), you will then invoke hunlexfrom a shell and use it via the command-line interface.This is non-typical use and not recommended. The Command-line interface which is almostequivalent in functionality to the Makefile interface is described only for completenessand for people developing alternative wrappers (see Chapter 9 [Command-line Control],page 35).

    In fact, you dont actually need to know much about makeand Makefile-s to use hunlex.

    Just follow the steps described inChapter 5 [Bootstrapping], page 10. We assume that youhave a project directory with a Makefile sitting in it in order to try out what is describedhere.

    This document is more like a reference manual that details what you can do with yourresources and how you can do it through the Makefile interface. What the resources areand how you can develop your own is described in other chapters (see Chapter 8 [Files],page 30and seeChapter 7 [Description Language], page 24).

    6.1 Verbosity and Debugging

    First of all, you need to know how to make your compilation process more verbose.

    In order to see what the toplevel Makefile wrapper is doing you have to unset QUIET

    option. For instance, typing$ make QUIET= new resources

    will tell you what the Makefile is doing, i.e., what programs it invokes, etc. Unless youare debugging the toplevel control interface of hunlex, you dont want the toplevel to beverbose about what it is doing. So just dont do this.

    What you want instead is to make theresource compilation processmore verbose, probablybecause you want to debug your grammar or want hunlex to give you hints what wentwrong with your resource compilation.

    Verbosityof the hunlex resource compilation can be set with the DEBUG_LEVEL option.

    Typing

    $ make DEBUG_LEVEL=1in your sandbox (with empty primary resources) will give you something like this (seeChapter 5 [Bootstrapping], page 10):

    Reading morpheme declarations and levels...0 morphemes declared.

    Reading phono features...0 phono features declared.

    Reading usage qualifiers...0 usage qualifiers declared.

    Parsing the grammar...ok

    Parsing the lexicon and performing closure on levels... 0 entries read.

  • 8/10/2019 Hunlex manual

    18/58

    Chapter 6: Toplevel Control 13

    Dynamically allocating flags; dumping affix file...ok

    Dumping precompiled stems to dictionary file...ok

    0.00user 0.00system 0:00.02elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k

    0inputs+0outputs (0major+329minor)pagefaults 0swaps

    The first couple of lines give you information about the stages of compilation and aredescribed elsewhere.

    The enigmatic last two lines give you information about the timeit took hunlex to compileyour resources. If you are not interested in this information you can deset it using theTIMEoption (see 4)

    You can choose not to bother with this information and deset the TIME option. Typing,say,

    $ make TIME= new resources

    will not measure and display the duration of compiling.

    6.2 Storing your Settings

    Your favourite settings can be remembered by adding them to your local Makefile in arather obvious way. Let us assume you want your DEBUG_LEVEL to be set 1 by default andalso that you couldnt care less about the time of compilation. In this case you want tohave the following in your Makefile:

    DEBUG_LEVEL=1

    TIME=

    You can also define your default target (seeSection 6.3 [Targets], page 13), i.e., the taskthat make will carry out if you invoke it without an expicit target. For instance, if youalways want to recompile your resources each time you invoke make irrespective of whether

    your primary resources and/or compile configurations changed, you can add the followingline at the top of the file:

    default: new resources

    Now, your Makefile looks something like this:

    # comments are introduced by a #

    # my favourite target

    default: new resources

    # my favourite settings

    DEBUG_LEVEL=1

    TIME=

    -include /path/to/HunlexMakefile

    6.3 Targets

    The functionality of hunlex is accessed through targets. Targets are arguments of the makecommand which reads your local Makefile and ultimately consults the systemwise hunlextoplevel Makefile called HunlexMakefile (seeSection 4.6 [Installed Files], page 9).

    Usually, you will control hunlex through makeby typing:

    make options target

  • 8/10/2019 Hunlex manual

    19/58

    Chapter 6: Toplevel Control 14

    whereoptionsis a sequence ofvariable assignmentswhich set your options described below(seeSection 6.4 [Options], page 15) and where targets is a sequence of targets. For moreon variables and targets you may consult the manual ofmake.

    The available toplevel targets are detailed below:

    6.3.1 Resource Compilation Targets

    [Resource Compilation Target]resourcescompiles the output resources given the input resources and configuration files. Thenecessary file locations and options are defined by the relevant variables describedbelow (seeSection 6.4.3 [Input File Options], page 17). This file creates the dictionaryand the affix files (by default dictionary.dic and affix.aff, see Section 8.2[Output Resources], page 33).

    [Resource Compilation Target]generateby setting MIN LEVEL to a big number, this call generates resources that contain

    all words of the language precompiled into the dictionary. And the stems of thedictionary without their output annotation (see Annotation) are found in the file*wordlist*.

    6.3.2 Special Targets

    [Special Target]newpretends that the base resources are changed. You need this directive if you wantto recompile the resources althouth no primary resource has changed. This mighthappen because you are using a different configuration option. (If the base resourcesare unchanged, no compilation would take place, you have to force it with new, see

    make).

    make MIN_LEVEL=3 new resources

    [Special Target]cleanremoves all intermediate temporary files, so that only lexicon, grammar, and theconfiguration files, and the output resources (affix and dictionary) remain.

    Todo: This is not implemented yet.

    [Special Target]distcleanremoves all non-primary resources, so that only lexicon, grammar, and the configu-ration files remain.

    6.3.3 Test Targets

    Additional targets for testing are available, these all presuppose that the huntools (seeSection 14.1.1 [Huntools], page 46) is installed and that the executable hunmorph is foundin the path. An alternative hunmorph can be used by setting the HUNMORPH option, (seeSection 6.4.1 [Executable Path Options], page 15).

    [Test Target]testtests the resource by making hunmorph read the resources (dic and aff files) andanalyze the contents of the file that is value of TEST (see Section 6.4.3 [Input FileOptions], page 17). TEST is by default set to the standard input, so after saying

  • 8/10/2019 Hunlex manual

    20/58

    Chapter 6: Toplevel Control 15

    $ make test

    you have to type in words in the terminal window (exiting with C-d).

    If you want to test by analyzing a file, you have to set the value ofTEST.

    $ make TEST=my/favourite/testfile testTest outputs are to stdout, so just pipe it to a file

    $ make TEST=my/favourite/testfile test > test.out 2> test.log

    [Test Target]testwordlistwill run hunmorph on the wordlist file (see Section 6.3.1 [Resource CompilationTargets], page 14, generate) and outputs the result on the standard output (so youmay want to pipe the result to a file).

    [Test Target]realtestputs hunlex and the analyzer to the test, by creating the resources according to thesettings of your makefile, and then run hunmorph on the generated whole wordlist.

    Warning: Note that this target first generatesal lwords and then createsthe resources again. Running this on huge databases is probably not agood idea.

    The way you want to test a bigger database instead is by creating a a setof words that your ideal analyzer has to recognize or correctly analyzeand test on that (with test). Realtest is just a quick and dirty shorthandfor toy databases to check if everybody is with us.

    6.4 Options

    Options of the toplevel are in effect Makefile variables that can be set at the users will.

    (All the command-line options of hunlex can be accessed through the toplevel options arepassed to hunlex to regulate the compilation process. The documentation of command lineoptions is found inChapter 9 [Command-line Control], page 35, but only for the record. Allhunlex options are all capital letters (LEXICON) and all command line options begin with adash and are all small letters but otherwise they are the same (-lexicon)).

    All options can be set or reset in your local Makefile (and remembered, see Section 6.2[Storing your Settings], page 13). These will override the system default. Both the systemdefault and your local default can be overriden by direct command-line variable assignmentspassed to make, such as the ones shown in this file:

    $ make QUIET= DEBUG_LEVEL=3 OUTPUTDIR=/my/favourite/ouputdir

    Listed and explained below are all the hunlex options (all public Makefile variables) that

    the toplevel control provides for the user to manipulate.When you see something like variable (value), it means the default value of the variablevariable is value.

    6.4.1 Executable Path Options

    [Option]HUNLEX (hunlex)The hunlex executable is by default assumed to be found in the path with namehunlex. By default, installation installs hunlex into /usr/local/bin (seeChapter 4

  • 8/10/2019 Hunlex manual

    21/58

    Chapter 6: Toplevel Control 16

    [Installation], page 7). If you want to use (i) an alternative version of hunlex thatis not the one found in the path, or (ii) an uninstalled version of hunlex, or (iii) aninstalled version but the path to which you dont want to include in your path, thenyou should set which hunlex to use with this variable.

    HUNLEX=/my/favourite/version/of/hunlex

    [Option]HUNMORPH (hunmorph)You need the executable hunmorph from the Huntools package (seeSection 14.1.1[Huntools], page 46) only for testing, if you dont want to test with direct analysis(just want to compile the resources), you dont need to bother.

    When used, however, the hunmorph executable is assumed to be found in the pathwith namehunmorph. If this is not the case, update your path or provide the path tohunmorph with the line

    HUNMORPH=/my/favourite/version/of/hunmorph

    6.4.2 Verbosity and Debug Options[Option]QUIET (@ = quiet)

    Quiet mode is set by default which means that the workings of the Makefile toplevelwont bore you to death. The compilation debug messages that Hunlex blurps whenrunning can still be displayed independently (see the DEBUG_LEVELoption below). TheQUIET option only refers to what the toplevel wrapper invokes (this way of handlingMakefile verbosity is an idea nicked from OCamlMakefile by Markus Mottl).

    [Option]DEBUG_LEVEL (0)sets the verbosity of hunlex itself. By default debug level is set to 0. Debug messagesare sensitive to the debug level in the range from 0 to 6-ish: the higher the numberthe more verbose hunlex is about its doings.

    0 is non-verbose mode, which means that it only displays (fatal) error messages. Ifyou set DEBUG LEVEL to say -1, even error messages will be suppressed (only anuncaught exception will be reported in case of fatal errors).

    It is typically a good idea to set DEBUG_LEVELto 2 or 3 and request more if we reallywant to see what is happening.

    Caveat: In fact you wont understand the messages anyway, so the debugblurps just give you an idea of the context where something went wrongwith your grammar/lexicon, etc.

    Todo: This shouldnt be so and debug messages pertaining to grammardevelopment should be self-evident or well designed and documented.Especially parsing errors and/or compile warnings about the grammarand lexicon should be clear.

    Usually you want to create a log by piping the debug output ofmake(standard error)with your debug messages to a file. This can be done by, for instance by

    $ make DEBUG_LEVEL=5 resources 2> log

    [Option]TIME (time)By default with every run of hunlex it is measured how long it takes to compile theresources (unix shells timecommand) and this information is displayed. Surely, this

  • 8/10/2019 Hunlex manual

    22/58

    Chapter 6: Toplevel Control 17

    is only interesting with big lexicons. If you (i) dont have a time command, (ii) havea different time command, (iii) dont want time measured and displayed, just resetthe TIME variable. The option can be unset by the line

    TIME=

    in your local Makefile.

    6.4.3 Input File Options

    The type and use of hunlex input resource files are described in detail elsewhere (seeSection 8.1 [Input Resources], page 30). The options by which their locations can be (re)setare listed below:

    [Option]LEXICON (grammardir/lexicon)lexicon file

    [Option]GRAMMAR (grammardir/grammar)grammar file

    They can all be set to alternative paths individually. If they are in the same directory, thedirectory path can also be set via the variable GRAMMARDIR:

    [Option]GRAMMARDIR (inputdir)the directory for the hunlex primary input resource files, which is, by default, set toinputdir, the value of the variable INPUTDIR, see below.

    There are three further input resources which need to be present for a hunlex compilation.These are the compilation configuration files.

    [Option]USAGE (confdir/usage.conf)the usage configuration file (see Section 8.1.2 [Configuration Files], page 31)

    [Option]MORPH (confdir/morph.conf)the morph(eme) configuration file (seeSection 8.1.2 [Configuration Files], page 31)

    [Option]PHONO (confdir/phono.conf)the configuration file (see Section 8.1.2 [Configuration Files], page 31) formorphophonologic and morphoorthographic features

    There are two optional configuration files, the signature and the flags file. By default, theoptions correspoding to these files are set to the empty string, which tells hunlex not to usea feature structures (seeSection 11.2 [Feature Structures], page 41) or custom output flags(seeChapter 12 [Flags], page 42).

    [Option]SIGNATURE ()The location of the signature file used to process and validate features structures (seeSection 11.2 [Feature Structures], page 41, see Section 8.1.2 [Configuration Files],page 31). If it is set to the empty string (the default), hunlex does not use featurestructures.

    If you use this file, it makes sense to call it something like fs.conf orsignature.conf and store it in confdirwith your other configuration files, so theassignment

  • 8/10/2019 Hunlex manual

    23/58

    Chapter 6: Toplevel Control 18

    SIGNATURE=$(CONFDIR)/fs.conf

    is an appropriate setting.

    [Option]FLAGS ()

    The location of the custom output flags file (see Section 8.1.2 [Configuration Files],page 31) used to decide which flags are used in the output resources (seeChapter 12[Flags], page 42). If it is set to the empty string (default), hunlex will use a built-inflagset to determine flaggable characters (seeChapter 12 [Flags], page 42).

    If you use this file, it makes sense to call it something like flags.conf and store itin confdirwith your other configuration files, so the assignment

    FLAGS=$(CONFDIR)/flags.conf

    is an appropriate setting.

    All configuration files can be set to alternative paths individually. If they are in the samedirectory, the directory path can also be set via the variable CONFDIR:

    [Option]CONFDIR (inputdir)the directory for the hunlex compilation configuration files, which is, by default, setto inputdir, the value of the variable INPUTDIR, see below.

    As explained all input files can be set to alternative paths individually or primary resourcestogether and configuration files together. Ifal l input resources (primary and configuration)are in the same directory, this directory path can also be set via the variable INPUTDIR:

    [Option]INPUTDIR (. = current directory)the directory for all hunlex input resource files, which is by default, set to the currectdirectory.

    A special test file is only used with the Test targets:

    [Option]TEST (/dev/stdin)The value of TEST is a file (well, a file descriptor, to be precise), the contents ofwhich is tested whenever the toplevel test target is called (seeSection 6.3.3 [TestTargets], page 14). By default it is set to the standard input, so testing with testwill expect you to type in words in your terminal window.

    6.4.4 Output File Options

    Hunlexs output resources are the affix and the dictionary files (seeSection 8.2 [OutputResources], page 33). The options by which their locations can be (re)set are listed below:

    [Option]AFF (outputdir/affix.aff)affix file

    [Option]DIC (outputdir/dictionary.dic)dictionary file

    [Option]WORDLIST (outputdir/wordlist)The wordlist generated by the generate target (seeSection 6.3.1 [Resource Compi-lation Targets], page 14).

  • 8/10/2019 Hunlex manual

    24/58

    Chapter 6: Toplevel Control 19

    whereoutputdir(the default directory of the files) is the value of the variable OUTPUTDIR:

    [Option]OUTPUTDIR (. = current directory)the directory for the hunlex output resource files, which is, by default, set to the

    currect directoryAs you can see, the default setting is that all input and output files are located in thecurrent directory under their recommended canonical names. Putting the output resourcesin the same directory as the primary resources might not be a good idea if you want tocompile various types of output resources.

    6.4.5 Resource Compilation Options

    [Option]DOUBLE_FLAGS ()if set, hunlex uses double flags (two-character flags) in the output resources (seeChapter 12 [Flags], page 42).

    The following two options regulate the level of morphemes. You find more details aboutlevels in a separate chapter (see Chapter 10 [Levels], page 36).

    [Option]MIN_LEVEL (1)Morphemes of level belowMIN_LEVELare treated as lexical, i.e., are precompiled withthe appropriate stems into the dictionary file. By default, only morphemes of level 0or below are precompiled into the dictionary.

    [Option]MAX_LEVEL (10000)Morphemes with levels higher than the value of MAX_LEVEL are, on the other hand,treated as being on the same (non-lexical) level. By default, only morphemes of levelabove 10000 are treated as having the same level.

    The options below regulate the format of output resources in detail:

    [Option]TAG_DELIM ()determines the delimiter hunlex puts between individual tags of affixes when tags aremerged.

    This is interesting if you have a tagging scheme where a morpheme is tagged with alabel MORPH1, but in the output you want them clearly delimited, like:

    wordtoanalyze

    >lemma_MORPH1_MORPH2

    The above is possible if you set

    TAG_DELIM=_

    [Option]OUT_DELIM ( )

    [Option]OUT_DELIM_DIC ()sets the delimiter to put between the fields of the affix and dictionary files,respectively. By default it is set to a single space for the affix file and set to in the dictionary.

    NB: A tab might allow better postprocessing in the affix file and evenallow spaces in the tags which might be useful.

    At the time of writing the huntools reader only allowed a TAB not aspace as delimiter in the dictionary file so change with caution.

  • 8/10/2019 Hunlex manual

    25/58

    Chapter 6: Toplevel Control 20

    [Option]MODE (Analyzer)the major output mode regulates what information gets output in the affix and dic-tionary files and how affix entries are conflated.

    Warning:This option is not effective at the moment due to the lack of aclear functional specification and it is also unclear how this option should

    interact with the option STEMINFO(below).

    Todo: Clarify this. See warning.

    The possible values at the moment are:

    Spellchecker

    Stemmer

    Analyzer

    NoMode

    all without effect (see warning).

    [Option]STEMINFO (LemmaWithTag)regulates what info the analyzer should output about a word.

    This option can take the following values:

    Tag only output the tag of a lexical stem (to output the pos tag of the stem)

    Lemma only output the lemma of a lexical stem (for stemmers doing lexicalindexing)

    Stem output the stem allomorph of the stem (e.g., for counting stem variantoccurrences?)

    LemmaWithTag output lemma with the tag (default, for morphological analysis)

    StemWithTag output stem (allomorph) with the tag (?)

    NoSteminfo no output for the dictionary (for spell-checker resources).

    [Option]FS_INFO ()regulates if feature structure annotations (see Section 11.2 [Feature Structures],page 41) should be output along with the normal (string type) tags (seeChapter 11[Tags], page 40). This is extremely useful for debugging purposes. If the manuallysupplied tag chunks are supposed to yield well-formed features structures in theoutput annotation of the analyzer, it is a good idea to check whether this is thecase. If this option is set to -fs_info (the corresponding command-line option),the feature structures resulting from unification are output along with the tags in

    the dictionary and the affix file. Typically, this option is used with the generate

    target (see Section 6.3.1 [Resource Compilation Targets], page 14) and the secondand the third columns of the dictionary file are compared (they are supposed to beidentical).

    Todo: This process should be added to the set of toplevel test targets.

    The affix file specifies a lot of variables to be read by the morphbase routines. Some of theseare metadata but some are crucial for suggestions and accent replacement for automaticerror correction, see below.

  • 8/10/2019 Hunlex manual

    26/58

    Chapter 6: Toplevel Control 21

    Warning: This part is a disasterously underdevelopped part of hunlex and anoutragously ad-hoc part of morphbase as well.

    Preamblescan be generated to hunlex output files these are meant to be official comment

    headers about copyright information, etc.[Option]AFF_PREAMBLE ()[Option]DIC_PREAMBLE ()

    are files to be included as preambles in the affix and dictionary output resources,respectively. By default, they are unset, ie., no preambles will be included into theoutput resources.

    NB: This feature is only available on toplevel control and will never beintegral part of the hunlex executable.

    [Option]CHAR_CONVERSION_TABLE ()[Option]REPLACEMENT_TABLE ()

    These are the character-conversion table and replacement table to be included intomorphbase resources if alternatives (e.g., for spellchecking) or robust error correctionis required (seeSection 14.1.1 [Huntools], page 46). These features are documented inthe huntools documentation (hopefully, but certainly not here, seeSection 8.2 [OutputResources], page 33, seeSection 14.1.1 [Huntools], page 46).

    NB: This feature of including these extra files into the affix file is onlyavailable through toplevel control and will never be integral part of thehunlex executable.

    [Option]AFF_SET (ISO8859-2)Identifies the character-set for the analyzer reading the affix file. By default, this isset to ISO8859-2, i.e., Eastern European. Maybe this is the hun in hunlex...

    [Option]AFF_SETTINGS (confdir/affix_vars.conf)is the file from which settings for some affix variables are read. If it doesnt exist, noaffix variables other than the ones directly managed are dumped into the affix file

    Todo: Need to sort these things out.

    Some affix file variables are managed by hunlex internally but dumped to the affix file bythe toplevel routines.

    Todo: This is done at the moment by the toplevel Makefile, but should beintegrated into the hunlex executable itself.

    [Option]AFF_FORBIDDENWORD (!)

    [Option]AFF_ONLYROOT (~)These two flags will be attached to (i) bound stems and (ii) affix entries which cannot be stripped first (i.e., suffixes which cannot end a word, see Chapter 12 [Flags],page 42).

    [Option]STEM_GIVEN ()If this flag is present, it indicates for the stemmer/analyzer that the stem string isto be output or not as part of the annotation. For instance (if STEM GIVEN flag isx), the following dic file

  • 8/10/2019 Hunlex manual

    27/58

    Chapter 6: Toplevel Control 22

    go/ [VERB]

    went/x go[VERB]

    will result in the following stemming:

    > gogo[VERB]

    > went

    go[VERB]

    This makes more compact dictionaries. What information one wants the stemmerand analyzer to output can be configured through hunlex options (see below).

    Todo: This flag is not implemented yet (since it is not implemented yet inmorphbase, either, but probably will never be implemented since treat-ment of special flags shouldnt be user customizable above the choice offlaggable characters.

    Warning: Make sure the flags given here are consistent with the double flags op-

    tion and the custom flags file (see the FLAGS variable above, and seeChapter 12[Flags], page 42).

    These options are superfluous and should be automatically managed by hunlexwhich would write them into the affix file. Very likely to be deprecated soon.

    Todo: This needs to be implemented.

    Warning: Additional settings that are to be included in the affix file and arecrucial part of the resources (partly should be set by hunlex itself) such ascompoundflags. I have no idea what to do with these at the moment. The onesI know of are listed here just for the record.

    Todo: This needs to be sorted out.

    Some of these data are actually global and could even go to the settings preamble (AFF_SETTINGS):

    NAME

    LANG

    HOME

    VERSION

    These ones should be dynamic metadata

    ??

    The ones below should clearly be controled and output by hunlex itself. (also ONLYROOTand FORBIDDENWORD, but they are handled by the toplevel, at least).

    Ones relating to compounding (compounding is handled very differently by myspell,morphbase and jmorph):

    COMPOUNDMIN

    COMPOUNDFLAG

    COMPOUNDWORD?

    COMPOUNDFORBIDFLAG

    COMPOUNDSYLLABLE

  • 8/10/2019 Hunlex manual

    28/58

    Chapter 6: Toplevel Control 23

    SYLLABLENUM

    COMPOUNDFIRST

    COMPOUNDLAST

    Warning: Compounding is as yet unsupported by hunlex and should be workedon with high priority.

    I have really no idea about the following ones:

    TRY

    ACCENT

    CHECKNUM

    WORDCHARS

    HU KOTOHANGZO

  • 8/10/2019 Hunlex manual

    29/58

    Chapter 7: Description Language 24

    7 Description Language

    This chapter is about the framework that allows you to describe the morphology and lexicon

    of a language. Below we specify the syntax and semantics of this description language. Thefiles written in this language (the lexicon and grammar) are the primary resources ofhunlex (seeSection 8.1 [Input Resources], page 30) and the basis for all compiled output(how this works is described in another chapter, seeChapter 6 [Toplevel Control], page 12).

    There are three kinds ofstatement in this language:

    morph definition

    macro definition

    metadata definition

    Only the grammar file can contain macro definitions (see Section 7.2 [Macros], page 28)and metadata definitions (seeSection 7.3 [Metadata], page 29) and both the lexicon and

    the grammar

    file can contain morph definitions which describe morphological units (affixmorphemes, lexemes and their paradigms). In this respect, the syntax of the lexicon andgrammar files are identical and, therefore, it is discussed together (seeSection 7.1 [Morphs],page 24) are not described separately, although the usefulness (and sometimes even thesemantics) of certain expressions might be different in the lexicon and in the grammar.

    7.1 Morphs

    Morphsare the central entities in the description language. They stand for morphologicalunits of any size and abstractness including affix morphemes, lexemes, paradigms, etc. andare not what linguists call morphs (i.e., a particular occurrence of one morpheme). Morphsare meant to describe an affix morpheme or a lexeme, but in fact, it is up to you what

    level of abstractness you find useful in your grammar, so you can have individual morphsdescribing each allomorph of a morpheme or each stem variant of a lexeme. But the pointis that morphs support description ofvariantsor allomorphs. Anyway, a morph is basicallya collection of rules, variants, etc. that somehow belong together. Ideally, a variant of anaffix morpheme is actually an affix allomorph, a concrete affixation rule, while a variant ofa lexeme is a stem variant or an exceptional form of the lexemes paradigm.

    7.1.1 Morph Preamble and Variants

    [statement](MORPH:) preamble, variant0, variant1, ... ;[preamble]morph-name block...

    [variant]block...

    A morph statement is introduced by an optional MORPH:keyword. It is a good idea to dropit and start the statement directly with the preamble (in fact, the name of the morph),which is compulsory.

    A morph description has a preamble, i.e., a header describing the global properties of themorph, the properties which characterize all of its variants/allomorphs.

    After the preamble, one finds the variants one after the other. The preamble and thevariants are delimited by a comma.

    Finally, the morph definition like all other statements is closed by a semicolon.

  • 8/10/2019 Hunlex manual

    30/58

    Chapter 7: Description Language 25

    The preamble starts with the name of the morph. The name of the morph can be anyarbitrary id, a mnemonic string that ideally uniquely identifies the morph. Referring to othermorphs is an important in describing how morphemes can be combined: in order for thesereferences to be reliable, the names in the grammar are supposed to be unique. This is notimportant in the lexicon, where homophonous lemmas can have identical names (however,this is not recommended, since, in such a case, for instance, morphological synthesis wouldbe unable to distinguish two senses especially if they are of the same morphosyntacticcategory).

    The rest of the preamble as well as each individual variant is composed ofblocks. Blocksare the ingredients of the description, they specify information such as conditions of ruleapplication, output of a rule, the tag associated with the rule, etc.

    In sum, then, morphs have the following structure:

    [statement](MORPH:) morph-name block ... (, block ... )* ;Blocks are explained in detail in the next subsection.

    7.1.2 Blocks

    Blocks are the ingredients of the description, they specify information such as conditions ofrule application, output of a rule, the tag associated with the rule, etc.

    Blocks all have a leading keyword followed by some expressions (arguments) and last tillthe next keyword or the end of the variant:

    [block]KEYWORD argument...Blocks can come in any order within a variant and can be repeated any number of times.So writing

    KEYWORD: argument0 argument1 argument2 ...

    has the same effect as when it is written like

    KEYWORD: argument0 KEYWORD: argument1 KEYWORD: argument2 ...

    or even

    KEYWORD: argument0 SOME-OTHER-BLOCKS KEYWORD: argument1 SOME-OTHER-BLOCKS KEYWO

    or when it is included with a macro (see Section 7.2 [Macros], page 28).

    Certain blocks specify information in acumulativeway, so every time they are specified theinformation is added to the info specified so far. For instance an IF block is cumulative,all the arguments of all the IF blocks of a variant cumulate to give the conditions of ruleapplication, i.e., the rule applies only if all conditions on features are satisfied by the input(see IF block below).

    However, other blocks do not specify information that can be interpreted cumulatively, so it

    does not make sense to have more than one argument with them or specify them more thanonce for a variant. (They, however, may still be specified in the preamble and overriden ina variant, for instance).

    In every case, out of contradictory information, the one given last has the last wordoverriding previous ones.

    So if you write

    CLIP: 1 CLIP: 2

    it is the same as

  • 8/10/2019 Hunlex manual

    31/58

    Chapter 7: Description Language 26

    CLIP: 2

    In what follows, blocks are listed and explained one by one.

    [block]DEFAULT feature ...

    default morphs are used to assign features to inputs unspecified for some features. Amorph with a default block just adds extra rules that leave alone inputs which arespecified for any of the features to be defaulted. The variants of a morph having adefault block in their preamble will assume that neither of the features to be defaultedis present in the input.

    So morph DEFAULT: feature0 feature1 , MATCH: x OUT: feature0 ;

    is equivalent to

    morph , IF: !feature0 feature1 OUT: feature1 , IF: feature0 !feature1 OUT: feature0, IF: feature0 feature1 OUT: feature0 feature1 , IF: !feature0 !feature1 MATCH: xOUT: feature0

    Filters typically want to pass on their whole input by default.

    [block]VARIANT variantthis block defines the actual affix or lexis.

    The exact shape ofvariant determines what type of affix, lexis the variant describes:

    +aff describes a suffix when the rule applies aff is appended to the end of theinput (after possibly clipping some characters)

    aff+ describes a prefix when the rule applies aff is appended to the beginning ofthe input (after possibly clipping some characters)

    pref+suff describes a circumfix

    when the rule applies pref is appended to the beginning of the input (afterpossibly clipping some characters) and suff is appended to the end of the input

    (after possibly clipping some characters)

    lexis defines a lexis. This is typically used in the lexicon and used as input tothe rules. If the VARIANT keyword is left out, it has to come as the first blockof the rule (after the comma closing the preamble or the preceding rule).

    If a lexis is used in the grammar, it is meant to stand for a suppletive form.Since it may well be a typo, a warning is given. We encourage the policy to putsuppletive paradigmatic exceptions as variants of the lexeme in the lexicon file.Especially since matches are ineffective for lexis rules, therefore conditions on thesuppletion should be expressed with features which is much safer anyway.

    All the lexis and affix strings can contain any character except whitespace, comma,semcolon, colon?? exclamation mark slash tilde plus sign [^# \t \n ; , \r! / + ~]

    there should be a way to allow escapes.

    Substitutions (which are special kind of rules) are specified by REPLACE/WITHblocks.

    [block]CLIP integerThis block specifies the number of characters that needs to be clipped from one endof the input.

  • 8/10/2019 Hunlex manual

    32/58

    Chapter 7: Description Language 27

    It has no effect if the variant is a lexis or substitution. So you dont use this block inthe lexicon.

    If no CLIP block is given, no characters are clipped (the integer defaults to zero).

    [block]REPLACE pattern[block]WITH template

    These blocks specify a substitution.

    pattern is a hunlex regular expression.

    tamplate is a replacement string which can contain special symbols \1, \2, etc,which reference the bracketed subpatterns in pattern.

    [block]MATCH patternspecifies a match condition on rule application. The rule only applies if the inputmatchespattern, which is a hunlex regular expression. So you dont use this in thelexicon.

    The matched expression defines a match at the edge of the word, the beginning forprefixes and the end for suffixes. You may include special symbols like ^ and $, tomake this more explicit.

    Match blocks are non-cumulative, but circumfixes allow two matches (one beginningwith a ^ and one ending in a $).

    [block]IF condition ...If blocks specify the conditions of rule application. Conditions are either positiveconditions (feature name) or negative conditions (NOT feature-name).

    The rule only applies if the input has the positive features specified in the IF blocksand doesnt have the negative features specified in the IF block.

    IF blocks are therefore cumulative and the conditions are understood conjunctively.

    [block]OUT output ...specify the output conditions of the variant (affix rule or lexis). An output can be afeature or a morph

    Features can be restricted to particular morphs.

    [block]TAG tag-stringspecifies the output tag chunk associated with the variant.

    [block]USAGE usage-qualifier ...specifies usage qualifiers describing the variant.

    Cumulative (conjunctive)

    [block]FILTER feature ...tells that the morph in question is a filter which defines fallback rules for lexicalfeatures.

    This means that the variants are meant to apply only if the input has none of thefiltered features.

    Has no effect within individual variants or in the lexicon. Only relevant in a morphpreamble in the grammar.

    Cumulative (conjunctive on the rule conditions)

  • 8/10/2019 Hunlex manual

    33/58

    Chapter 7: Description Language 28

    [block]KEEP feature ...Defines inheritance of features: a feature mentioned in the KEEP block is an outputfeature of the result of rule application if and only if the input has the feature. Aslong as a particular variant applies to an input.

    If output features and keep features overlap, output features are meant to overrideinheritance.

    Features which are restricted by the input condition (IF block) are inherited normally,but since they are known, can also be mentioned in the OUT block for clarity.

    NB: The thingies following KEEP in a KEEP block are features. Theycan not be macro names. Dont trick yourself by abbreviating a sequencephonofeatures with a macro and then refer to that in a keep block. Dontforget that macros abbreviate (a series of) blocks, so clearly they cantbe nested within a KEEP block.

    Cumulative

    [block]FREE boolspecifies if the rule application gives a full form. For bound stems or non-closingaffixes, it has to be set to false.

    By default, variants in the lexicon are NOT-free variants in the grammar are free ????!!!! is this ok?

    [block]FS feature-structurespecifies the feature structure graph to merged when the rule applies. feature-structureis a kr-style features structure description string.

    [block]FS feature-structure

    [block]PASS bool

    7.2 Macros

    [expression]DEFINE macro-name blocksdefines a macro named macro-name. Later (any time after this definition), any timemacro-name is encountered it is understood as if it said blocks. blocks is a sequenceof any blocks including (other) macro-names. The macro-name appearing elsewherethan its definition has to be already defined.

    If a macro-name is a declared morph-name? If a macro-name is a declared feature?

    [expression]REGEXP regexp-name regexpbinds regexp-name to a hunlex regular expression, i.e., a regular expression that

    can contain regular expression macro-names in angle-brackets. regexp-name can bereferenced within any regular expression later. An expression is resolved by replacingthe substring with the resolved regexp.

    This means, that you have to esacpe you -s if they do not delimit regexpnames.

    As said, you can define regexp-macros using other macros, only at the time of using aregexp-name it has to be defined already (the definition should be earlier in the file),so that it can be resolved at the time of reading the definition.

  • 8/10/2019 Hunlex manual

    34/58

    Chapter 7: Description Language 29

    7.3 Metadata

  • 8/10/2019 Hunlex manual

    35/58

    Chapter 8: Files 30

    8 Files

    There are various files that hunlex processes. Input as well as Output files are described in

    this chapter. The file names used in this section are just nicknames (which happen to bethe default filenames assumed) and can be changed at will with setting toplevel option (seeSection 6.4 [Options], page 15).

    8.1 Input Resources

    There are several types of files hunlex considers and they will all be discussed in turn.

    Lexicon and grammar are the two files which are considered the primary resources. Thesefiles contain the description of the languages morphology with all the rules for affixation,lexical entries, specifying morphological output annotation (tags), etc., see Section 8.1.1[Primary Resources], page 30.

    Secondly, there are configuration files, which declare the morphemes and features that

    are considered active by hunlex for a particular compilation. By choosing and adjustingparameters of these features, one can manipulate under- and over-generation of the analyzer(seeSection 6.4.5 [Resource Compilation Options], page 19) and, most importantly, regulatewhich affixes are merged together to yield the affix-cluster rules dumped into the affix file.The way affixes are merged is crucial for the efficiency of real-time analyzers (see Chapter 10[Levels], page 36). These files are also described below (see Section 8.1.2 [ConfigurationFiles], page 31).

    8.1.1 Primary Resources

    Primary resources are the files that you are supposed to develop, maintain, extend and thatdescribe your morphology (see Section 1.2 [Motivation], page 2). There are two primary

    resources: the grammar and the lexicon. These files are described below.

    8.1.1.1 Lexicon

    The lexicon file (the file name of which is lexicon by default, but can be set throughoptions, seeSection 6.4.3 [Input File Options], page 17) is the repository of lexical entries,containing information about:

    lemmas

    stem-allomorphs (variants) belonging to the lemmas paradigm

    suppletive forms expressing some paradigmatic slot of the lemma

    morphological output annotation (tag) of the lemma (and the variants)

    sense indices (arbitrary tag to distinguish identical lemmas)

    the morphosyntactic and morphophonological features which characterize variants (orlemmas), and which determine its morphological combinations (i.e., which rules applyto it and how).

    usage qualifiers of the variants (or lemmas), such as register, usage domain, normativestatus, formality, etc.

    The syntax of the lexicon file is basically the same as that of the grammar, except thatit cannot contain macro definitions (see Section 7.2 [Macros], page 28). This syntax of

  • 8/10/2019 Hunlex manual

    36/58

    Chapter 8: Files 31

    describing morphology is explained in detail in another chapter (see Chapter 7 [DescriptionLanguage], page 24).

    For examples of lexicons, have a look at the zillion examples in the Examples directory that

    comes with the distribution (see Section 4.6 [Installed Files], page 9).

    8.1.1.2 Grammar

    The grammar file is the other primary resource and also absolutely necessary to describethe morphology of your language. Its name is grammar by default but can be changed bysetting toplevel options (seeSection 6.4.3 [Input File Options], page 17). The grammar filespecifies:

    affix morphemes

    affix-allomorphs (variants) belonging to the same morpheme

    morphological output annotation (tag) of the affix morpheme (and its variants)

    the morphosyntactic and morphophonological features which characterize variants (or

    morphemes) and which determine its morphological combinations (which rules applyto it and how).

    usage features of the variants (or affixes), such as register, usage domain, normativestatus, formality, etc.

    possibly special pseudo affixes, so called filterswhich assign (default) features to vari-ants based on their form (orthographic patterns) or other features.

    The syntax of the grammar file is the same as the one used for the lexicon except that thegrammar file can contain macro definitions (seeSection 7.2 [Macros], page 28). The syntaxand semantics of this description language is explained in detail in another chapter, seeChapter 7 [Description Language], page 24.

    For examples of grammar files, have a look at the zillion examples in the Examples directorythat comes with the distribution (seeSection 4.6 [Installed Files], page 9).

    8.1.2 Configuration Files

    Configuration files are the files which mediate between primary resources describing a lan-guage and a particular resource created for a particular method, routine, application.

    There are three configuration files which tell hunlex which units and features should beincluded into the output resource from among the ones mentioned in the primary resources.The units (morphemes, features) not declared in these configuration files are consideredineffective by hunlex while reading the primary resources.

    The format of these three definition files are the same, each declaring a unit each line (with

    some parameters) and accept comments starting with # lasting till the end of the line.They are discussed in turn below.

    8.1.3 Morpheme Configuration File

    The morph.conf file is one of the compilation configuration files that determine how hunlexcompiles its output resources (aff and dic, see Section 8.2 [Output Resources], page 33)from the primary resources (lexicon and grammar, see Section 8.1.1 [Primary Resources],page 30).

  • 8/10/2019 Hunlex manual

    37/58

    Chapter 8: Files 32

    It declares the affix morphemes and the filters that are to be used from among the onesthat are in the grammar.

    Warning: the affix morphemes not listed (or commented out) in this file are

    ineffective for the compilation (as if they were not in the grammar).Each line in this file contains the affix morphemes name and optionally a second field,which gives the level of the morpheme. If no level is given, the affix is assumed to beof level maximum level (the value of the option MAX_LEVEL, see Section 6.4.5 [ResourceCompilation Options], page 19). Very briefly, levels regulate which affixes will be mergedwith which other affixes to yield the affix clustersthat are dumped as affix rules into the affixfile. The odds and ends of levels are described in detail in another chapter (seeChapter 10[Levels], page 36).

    For examples of the rather dull morph.conf files, browse the examples in the Examplesdirectory that comes with the distribution (see Section 4.6 [Installed Files], page 9).

    If you have a grammar and you want to declare all the (undeclared) morphs defined in it

    by including them in the morph.conf. All you do is type

    make DEBUG_LEVEL=1 new resources 2>&1 | grep (morph skipped) | cut -d -f1 >

    in the directory where your local Makefile resides. This will append all the undeclaredmorphs (one per line) to the morph.conf file. Note, the morphs so declared will be of levelmaximum level(see above).

    8.1.4 Feature Configuration File

    The phono.conf file is one of the compilation configuration files that determine how hunlexcompiles the output resources (aff and dic, see Section 8.2 [Output Resources], page 33)from the primary resources (lexicon and grammar, see Section 8.1.1 [Primary Resources],

    page 30).The phono.conf file is the file simply listing all thefeaturesthat we want used from amongthe ones used in the grammar and the lexicon. Very briefly, features are attributes of affixesand lexical entries the presence or absence of which can be a condition on applying an affixrule.

    Warning: Features used in the grammar but not mentioned (or commentedout) in the phono.conf file will be ignored (as if they were never there) forthe present compilation by hunlex when reading the primary resources.

    Warning: Features mentioned in phono.conf but never used in the grammaror the lexicon are allowed and maybe should generate a warning, but they dont.This may cause a lot of trouble.

    So, phon.conf simply declares the features one on each line and allows the usual comments(with a #).

    For examples of phono.conf files, browse the examples in the Examples directory thatcomes with the distribution (see Section 4.6 [Installed Files], page 9).

    If you have a grammar and you want to declare all the (undeclared) features referred to inthe grammar in conditions by including them in the phono.conf. All you do is type

    make DEBUG_LEVEL=1 new resources 2>&1 | grep (feature skipped) | cut -d -f1

  • 8/10/2019 Hunlex manual

    38/58

    Chapter 8: Files 33

    8.1.5 Usage Configuration

    The usage.conf file is is one of the compilation configuration files that determine how

    hunlex compiles the output resources (aff and dic, see Section 8.2 [Output Resources],page 33) from the primary resources (lexicon and grammar, see Section 8.1.1 [PrimaryResources], page 30).

    usage.conf in particular determines which usage qualifiersare allowed for the input units(lexical entries, affixes, filters and the variants thereof) that are included into the resourceto be compiled. Units having a usage qualifier that is not listed in this file are ignored forthe compilation (as if they were not there).

    NB: Usage qualifiers are not first class features. They can not be negated orused as conditions on rule application. They are simply used to categorize rules(affixes and stems) in certain dimensions such as etymology, register, usagedomain, normative status, formality, etc.

    In addition to declaring allowed usage qualifiers, this file has another function as well. Eachline containing the usage qualifier may contain a second field which is atagassociated withthat usage feature. If this field is missing, the name of the usage qualifier string is assumedto be its tag. Usage qualifier tags can be output by the analyzer if they are compiled intothe resources by hunlex.

    This can be configured with the output info option (seeSection 6.4.5 [Resource CompilationOptions], page 19).

    Warning: This option is not implemented yet.

    Todo: This is not implemented yet. I dont even know if this is fine like this.The problem is that they cannot really be just intermixed with the ordinarymorphological tags.

    Various dimensions of usage information can be made effective by introducing expressionswith arbitrary leading keywords(seeChapter 7 [Description Language], page 24). Redefin-ing each of the wanted usage dimensions in the parsing_common.ml file will result inmaking any one or more of them effective as usage qualifiers. The point is that you cankeep a lot of information in the same lexical database. When the keywords it contains arehunlex-ineffective, the expressions they lead are simply ignored.

    Caveat: At the moment, for these alternatives, you have to recompile hun-lex, with the new keyword associations, see Chapter 7 [Description Language],page 24.

    Todo: This could be done online but has very low priority.

    For examples of usage.conf files, browse the examples, in the Examples directory thatcomes with the distribution (see Section 4.6 [Installed Files], page 9).

    8.2 Output Resources

    The output of a hunlex resource compilation is an affix file and a dictionary file. In brief,the affix file contains the description of the affix (cluster) rules of the language we analyze,while the dictionary contains the stems the affix rules can apply to. They have more orless the same role as the grammar and lexicon files, the primary resources of hunlex (see

  • 8/10/2019 Hunlex manual

    39/58

    Chapter 8: Files 34

    Section 8.1.1 [Primary Resources], page 30). But the affix and dictionary files are resourcesthat are used by real-time word-analysis routines (such as morphbase, myspell or jmorph,seeChapter 14 [Related Software and Resources], page 46). They share commonalities offormat with minor idiosyncrasies, some of which are still in the changing.

    Hunlex reads a transparent human-maintainable non-redundant morphological grammardescription with the lexicon of a language and creates affix and dictionary files tailored toyour needs (seeChapter 1 [Introduction], page 1). The ultimate purpose of hunlex is thatthese output resource files could at last be considered abinary-likesecondary (automaticallycompiled) format, not a primary (maintained) lexical resource.

    Therefore the technical specification of these output formats should only concern you hereif you want to compile affix and dictionary files for your own (or modifief versions of ourown) word-analysis software which also reads the aff/dic files. In such a case, however, youknow that format better than I do. All I can say is that the parameters along which theformat can be manipulated is supposed to conform with the format of the software listedin seeSection 14.1 [Software that can use the output of Hunlex as input], page 46. If you

    develop some such stuff as well and would like your format to be supported, take a deepbreath and consider requesting a feature from the authors see Section 3.3 [Requesting aNew Feature], page 5.

    In sum, the format of these output resource files are not detailed. Anyway, they are (prob-ably) well documented elsewhere (e.g., myspell manual page). See especially the documen-tation of huntools and the morphbase library (seeSection 14.1.1 [Huntools], page 46).

  • 8/10/2019 Hunlex manual

    40/58

    Chapter 9: Command-line Control 35

    9 Command-line Control

    This chapter is a verbatim include of the hunlex manpage. Command-line control is not therecommended interface to use hunlex, see toplevel control (seeChapter 6 [Toplevel Control],page 12).

    removed so make doc would run

  • 8/10/2019 Hunlex manual

    41/58

    Chapter 10: Levels 36

    10 Levels

    Levels index morphemes and are assigned to morphemes in the morph.conf file (see

    Section 8.1.3 [Morpheme Configuration File], page 31).Levels govern which affixes will be merged together into complex affixes(or affix clusters)and will constitute an affix rule (linguistically correctly, and affix-cluster rule) in the outputaffix file (see Section 8.2 [Output Resources], page 33). Affix rules in the affix file willbe stripped from the analyzed words by the analysis routines in one step (i.e., by onerule-application).

    Levels, then, regulate the output resources of hunlex and have no role to play in how youdesign your grammars. There are no levels in the hunlex grammar and lexicon, the fileswhich describe the morphology of the language (see Section 8.1.1 [Primary Resources],page 30). Levels make sense only in relation to the compilation process.

    This chapter describes why you would want levels, how you manipulate them and what

    consequences it has on analysis.

    10.1 Levels and Affix Rules

    Imagine a word has several affixes likedalokban(=dal song+ ok plural+ baninessive).Assume that your hunlex grammar correctly describes the plural and inessive morphemesand their combination rules. If you assign these morphemes to differentlevels, the outputresource will contain affix rules expressing the morphemes separately. This means thatthese affixes are not stripped in one go by the analysis routines using the affix file as theirresource.

    Some affixes, however, may need to be stripped as a cluster in one go, because some analysisalgorithms do not allow any number of consecutive affix-strippings operations or because

    stripping them in one go is just more optimal for your purposes (see Section 10.5 [Levels andOptimizing Performance], page 38). Therefore the separate affix rules in the input grammarshould be merged when they are dumped by hunlex as rules into the affix file. Well, levelsregulate which morphemes should be merged with which other morphemes. (To be moreprecise, they regulate which affix rules expressing which morphemes should be merged withwhich other which other affix rules expressing which other morphemes.

    Since merged affix rules are highly redundant and tedious to maintain, one of the mainpurposes of hunlex is actually to allow for high flexibility in your choice of merging affixesto create resources optimized for your needs, while at the same time also allow for trans-parent and non-redundant description for easy maintenance and scalability (seeChapter 1[Introduction], page 1).

    10.2 Levels and Stems

    Levels do not only regulate which affixes are compiled into one affix cluster (an affix rule inthe output affix file, seeSection 10.1 [Levels and Affix Rules], page 36). They also deter-mine which stems are precompiled into the dictionary (seeSection 8.2 [Output Resources],page 33). In particular, all affixes below a so called minimal lexical level (seeSection 10.3[Levels and Ordering], page 37) are precompiled with the stems of the lexicon into theoutput dictionary.

  • 8/10/2019 Hunlex


Recommended