+ All Categories
Home > Documents > 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

Date post: 07-Jul-2018
Category:
Upload: escarlata-ohara
View: 216 times
Download: 0 times
Share this document with a friend

of 9

Transcript
  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    1/9

    Eugenio Picchi(Istituto di Linguistica Computazionale, CNR, Pisa, ltaly),Carol Peters(Istituto di Elaborazione della Informazione, CNR, Pisa, ltaly)Elisabetta Marinai(ACQUILEX Project, Istituto di Linguistica Computazionale, CNR,Pisa, ltaly)

    The  Pisa LexicographicWorkstation: The BilingualComponents

    ABSTRACT:   The  main components   of the   Pisa   Lexicographic Workstation

    are   a   full text retrieval system   and a   lexical database system;   each   system

    Incorporates procedures that   have been   Implemented   to   meet   the   spe- 

    cific   needs   of the   lexicographer.  The  paper   describes   the   recent tailoring

    ofexistlng modules   and the   development ofnew ones wfth bilingual   lexi- cographyln   mind.  The alm   Is  to  provide   a   flexible, userfriendlysystem that

    can   be   employed In  all  stages   of  dictionary compilation, from   the   acquisi- 

    tion   of   citation material   to the   formatting   of the   entry   for   printing.

    1. Introduction

    For  some time now, a lexicographic workstation has been  under  development at the

    "Istituto di Linguistica Computazionale" in Pisa. The workstat ion provides a series oftools designed specifically for linguistic and lexicographic text processing tasks which

    can be used by the lexicographer to assist him/her in the various activities involved in

    the creation and revision of dictionaries. The main components of the workstation are the

    DBT  (Data Base Testuale), a full text retrieval system that has been developed to query

    and analyse all kinds of texts and textual corpora  (Picchi  1991),  and a  lexical  database

    system that has been implemented to handle dictionary acquisition and processing acti

    vities;  a morphological procedure is associated with the text and dictionary query sys

    tems. The lexicographer can use these two systems, the  DBT  and the  LDB,  to interrogate

    on-line text archives and electronic dictionaries and retrieve and extract reference andcitation material. The core module of the system is a procedure for on-line dictionary

    editing which includes functions for windowing into and copying data from the diction

    ary and text archives, and is integrated with a structured indexing procedure that can be

    used to query the dictionary in compilation in order to check the regularity and consist

    ency of the  input.  The present paper describes the recent specialization of existing mo

    dules and the development of new ones to meet the specific needs of the bilingual

    lexicographer.

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    2/9

    278 EURALEX  '92 -  PROCEEDINGS

    2 . The Bilingual Components

    The  bilingual modules consist of an on-line bilingual entry editor, a bilingual   lexical

    database query system, and a system for the automatic creation and retrieval of parallel

    concordances from bilingual text archives  (DBT-Synchro).  As the languages which are

    currently being considered for the bilingual components are Italian and English, we have

    also added  a morphological procedure for English to complement the program already

    implemented for Italian. Figure 1 shows the global structure of the integrated mo no-/ bi -

    lingual lexicographic workstation.

    Figure  1  Integrated Monolingual and Bilingual Lexicographic Workstation

    2.1. Bil ingual Dictionary Editor

    The  bilingual editor is a specialized version of the on-line editor developed for mono

    lingual dictionary compilation, providing functions to assist the bilingual lexicographer

    in creating or revising lexical entries. When the editor environment is entered, the user is

    presented with a basic entry template. This schema represents explicitly, in tagged fields,

    all  the lexical information (phonetic , syntactic and semantic) which will be contained in

    the printed dictionary. The default entry template has been designed on the basis of

    experience acquired  during the study of a standardized representation structure for dic-

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    3/9

    Picchi et. al : The Pisa  Lexi cogr aphi c Workstation 279

    tionary entries in the ACQUILEX project (see Calzolari et a l.  1990). The use of a uniform

    representation language ensures compatibility between entries, permits the exchange of

    data between different projects in a common format, and facilitates analyses over different dictionaries

    2

    . However, the lexicographer is free to modify the structure of the tem

    plate proposed by the system to meet the particular characteristics of the dictionary or

    lexicon on which he is working.

    The  source language lexicographer, whose task it is to analyse the headword and

    make a first proposal of sense divisions and example material, compiles his entries filling

    in the appropriate data fields; they can then be saved on  file  or printed out and passed

    on to the target language compiler who is responsible for transfer into L2 but who may

    also propose changes to the original analysis on the basis of L2 dependent factors. The

    final synthesis of the entry will be the result of the joint efforts of  both lexicographers. Theresults of each stage can be saved separately so that a trace of the development of the

    entry, through the several stages of  analysis, transfer and synthesis, is maintained. At any

    moment, the lexicographers can access both monolingual and bilingual text corpora and

    LDBs  from within the editor environment, which includes functions to window into,

    query and copy information from these reference archives. The structure of the entry and

    the main functions available for bilingual dictionary editing can be seen in Figure 2

    which shows a screen dump  from a work session in which the Help function has been

    invoked.

    E.  Picchi - Bilingual Dictionary Editor - Italian/English CARRO

    Hdwd 1 carro = Help  IPron 2 ['karro] '  T ' Previous field

    PoS 4 sm • 1  '  Next field

     NPoS 5 N ' " ' Duplicate field

    SenNo 6 (a) • C Change field code

    Trans 7 cart, wagon; *D' Delete field

    SICon 8 per carnevale •E' Edit field

    Trans 9 float; •I' Insert  new field

    Ex 10 mettere il =**=  avanti ai buoi 'J' Join two fields

    SemCd 11 fig •R' Restore a fieldExTr 12 to put the cart before the horse. 'S' Split text field

    SenNo 13 (b)  ALT+F8 Restore entrySubCd 14 Astron  ALT+F9 Copy entry

    Ex 15 il Gran/Piccolo C=**= ALT+F10 Delete entry

    ExTr 16 the Great/Little Bear. F3 Quit entry Mwd 17 =**=  armato F4 Save entry

    SubCd 18 MiI F5 Previous entry

     MwdTr 19 tank; F6 Next entry

     Mwd 20 =**=  attrezzi F7 Call MLDB

    SubCd 21  Aut F8 Call DBT

     MwdTr 22  breakdown van; F9 Call DBTSynchro

    i  for more = = = = =

    Select Function»

    Figure 2: Bilingual Dictionary Editor

    During work session, parts  of the dictionary being compiled can be extracted and

    printed or saved in separate  files. In this way, particular subsets of the dictionary can be

    treated independently: a) a given subset of the lexicon; b) a given subset of the data fields;

    c)  only entries satisfying certain constraints. These  facilities  can be adopted when it is

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    4/9

    280 EURALEX '92 -  PROCEEDINGS

    desired to have certain data fields compiled separately (e.g. the phonetic transcriptions)

    or to make certain consistency checks throughout  the dictionary (e.g. for semantic label

    attribution).The compiled  lexical entries can be input  to the parsing and indexing procedures of

    the bilingual  lexical  database system, and can then be checked interactively by the  lexi

    cographer using the  LDB query system (see 2.3 below). Another procedure that can be

    used on the entries in LDB form to  check   the consistency of the data is the Reverse

    procedure, which operates on the data contained in the translation field of each entry in

    order to recognize certain incongruities between the two bilingual data sets. The proce

    dure identifies (i) all cases where translations on one side of the dictionary are not listed

    as headwords on the other, and (ii) all cases where words, appearing as translat ions of a

    lemma on one side of the dictionary, when listed as headwords on the other side do notgive the lemma as one of their possible translations. There are a number of reasons why

    symmetry is not always desirable but the bilingual lexicographer should have   access  to

    such information  during  the construction of the dictionary. He can then decide whether

    an omission is deliberate or due to an oversight, and whether to make amends or not. The

    Reverse  procedure can also be executed in a similar way on  words  appearing in the

    Example  Translation held. In each  case,  the procedure is run automatically and the

    results are printed out for verification by the lexicographer.

    The completed  lexical  entries can be printed out in various formats, or used as  input

    for  photocomposition systems which will produce the final version of the dictionary.

     22.  Morphologica l Procedure for EngUsh

    Our morphological system consists of a language independent set of procedures which

    operate on a suitably encoded description of a language in order to recognise and pro

    duce  words  in that language. We adopt  a two-level approach rather  than  a generative

    one3

    . The language description is formulated in two  files:  a lexicon  file  containing a list

    of   base lemmas with associated morphosyntactic information and an inflection code; a

    rule  file  containing the rules which specify the correspondences between underlyinglexical  items and surface forms. The program is reversible: the same  lexicon  and set of

    rules is used for recognition and for generation. Our guiding principle has been   effi

    ciency and convenience of implementation.

    This module has already been implemented for Italian (see Picchi and Calzolari  1990).

    For  the English version we have derived our list of lemmas from the  headwords of two

    computerized dictionaries. Each  lexical  entry has associated morphosyntactic informa

    tion. A code has been assigned (semi)automatically to each lemma and is used to invoke

    the rule which determines its inflection. Rules have been written to cover all regular

    inflections  and those irregular inflections which can be grouped together into  classes.Highly irregular inflections are encoded singly. Information on irregular morphology

    and other phenomena such as gemination has been extracted from the computerized

    dictionaries; we use orthographic rather  than phonological information as this informa

    tion is more conveniently derived from our dictionaries.

    The  morphological tool includes an on-line display and editor which can be used to

    view the generation of the word forms for any lemma in the  lexicon  and to add to or

    correct  either the inflectional or morphosyntactic codes if necessary. The lexicographer

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    5/9

    PiccN et. al : The Pisa  Lexicographic Workstation 281

    с п use the generator when querying text or dictionary archives to expand any given

    lemma by producing the set of all its forms; the whole paradigm for the lemma can then

    be  searched by entering a single command. The complete morphological procedure

    (English/Italian analyzers and generators) is also an essential component of the bilingual

    text  retrieval system.

    2 3 . T he Bilingual Lexical Database Query System

    A major component of the set of tools implemented with the bilingual lexicographer in

    mind is the bilingual  lexical  database system. The various stages in the design and

    development of this system have already been described elsewhere and the bilingual

    LDB  now forms  part  of a Multilingual  Lexical  Database System  (MLDB)  that is beingimplemented in the context of the ACQUlLEX  project (see Marinai et al. 1990).

    The  bilingual LDB query system provides dynamic search procedures which permit

    the user to navigate  through  the dictionary data and within the different fields of the

    entry in order to  access  and retrieve information of interest in whatever  part  of the

    dictionary it is stored, specifying the language on which the query is to operate. In this

    way, much information which is normally  '^idden"  in the printed dictionary can be

    accessed  and exploited. The query system supplies a series of functions which can be

    used to look up  lexical  items or combinations of items. The user can search given items

    or character strings, define search functions in which items or character strings are combined by AND, OR and NOT operators, retrieve all entries satisfying the search condi

    tions, display, print or store on  file all or a selected part of the results, and define restr ic

    tion rules on the results of a previous search. Searches are made on an attribute-value

    basis and the results are given for each field in which the item is found. The lexicographer

    can  use the query system on existing dictionaries maintained by the system or on the

    dictionary  under  construction, which can be  input  to the  MLDB  parsing and indexing

    procedures and structured as an on-line database.

    LDB  (E. Picchi) Collins Bilingual Dictionary (Normalized) VItem   : {I} CARRO Frequency  : 9

    1)   C H A R I O T  {Hdwd} chariot  {PoS} n {NPoS}  N  {Trans} cocchio, carro.2)   D U S T C A R T  {Hdwd} dustcart  {PoS} n  {NPoS}  N  {Trans} carro dellanettezza urbana or delle immondizie.3)   F L O A T  gen {Trans} galleggiante  {m}; {SI}  cork {Trans} sughero;{SI}  in procession {Trans} carro;  {SI} sum of money {Trans} somma.4)  H E A R S E  {Hdwd} hearse  {PoS} n  {NPoS}  N  {Trans} carro funebre.5)   H O R S E B O X  {Hdwd} horsebox  {PoS} n  {NPoS}  N  {Trans} carro orfurgone  {m} per il trasporto dei cavalli.6)   T A N K  {Ex}  fuel =**= {ExTr} serbatoio del carburante. {SenNo}  (b){SubCd}  MiI {Trans} carro armato.7)  T R U C K  {PoS} n  {NPoS}  N  {SenNo} 

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    6/9

    282 EURALEX '92 -  PROCEEDINGS

    Figure 3 gives just one example of how the LDB can be used to acquire information on

    a  given item that is scattered  throughout  the dictionary and otherwise  inaccessible. The

    Italian lemma CARRO, shown in Figure 2, was searched  throughout  the LDB which hasbeen constructed for the Collins Concise English/Italian dictionary; the figure displays

    all those entries in which CARRO was found in the Translation field.

    A procedure has also been implemented to permit semi-automatic mapping between

    monolingual and bilingual LDBs in the workstation. This procedure provides the biling

    ual lexicographer with a useful tool that permits him to examine and compare the lexical

    information given for the same item in different source dictionaries (for a full description

    see Marinai et  aL , forthcoming).

    2.4. Bilingual Text Retrieval

    The most recent component to be added to the Lexicographic Workstation is a system for

    the automatic construction and retrieval of parallel contexts from bilingual text archives.

    The  importance of large language reference corpora in monolingual lexicography is

    widely acknowledged and it has also been asserted that such corpora are useful for the

    bilingual lexicographer (see, for example , Atkins  1990). However, interest is now grow

    ing in the potential of bilingual corpora as valuable sources of documented evidence on

    the relationships between two languages. We feel  that ideally the bilingual lexicographer

    should have  access  to both sources. The monolingual corpus will be used  during  the

    analysis stage helping to achieve an accurate first breakdown of a given headword into

    senses and to retrieve valid examples of usage and  collocations; the bilingual corpus will

    be  employed in transfer helping to find appropriate real world translations for each

    usage of the Ll word suggested by the source language compiler, reflecting parallelisms

    or differences in sense divisions. For example, a particular sense of a word in Ll , estab

    lished on the basis of corpus evidence, may well have no single equivalent or set of

    equivalents in L2 to cover the full scope of  L l . The bilingual corpus can be used to study

    carefully  how each use is rendered in order to  group  the TL equivalents. It may well

    provide evidence which suggests an adjustment of an Ll sense division to meet the

    demands o f  L2.

    Of   course, this implies the availability of a high quality, sufficiently representative

    bilingual corpus. However, the construction of a text corpus implies a considerable in

    vestment of time and resources. Before any decisions are taken, the criteria to be adopted

    when assembling corpus material must be carefully evaluated. In particular, each type  of

    text sample must be labelled, as source and target texts reveal different uses of language;

    it is claimed that a translation is never a true representation of the language in which it is

    written but rather reflects the relationships between the language of the target and that

    of   the source. At Pisa, we have assembled a sample set of bilingual texts, selected to cover

    a  number of different language varieties, ranging from scientific papers  to poetry, from

    university text books to magazine articles. This set of texts was collected in the first place

    to provide a test-bed for our bilingual retrieval system but should also supply useful data

    to assist us in the definition of design criteria, which can then be used in a subsequent

    extension of these archives.

    So  far, mos t of the bilingual concordancing tools implemented for the lexicographer

    use statistically based programs to align the texts at the sentence level. But, as stated by

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    7/9

    Picchi  et. al : The Pisa Lexicographic Workstation 283

    Church and Gale  (1991),  such sentence based concordance programs are not very good

    at showing  what  is not already known as the user is requested to supply  the program

    with both a SL word and a TL candidate translation. These  authors  aiso describe aword-based concordance tool in which the possible translations for a given word are

    discovered from the corpus by the program, using a pnxx>mputed index indicating

    which words in one language correspond to which words in the other. We have adopted

    a  different approach which  depends  on the use of external evidence (extracted from a

    computerized bilingual dictionary) to create direct links between parallel texts on the

    basis of translation equivalents. In this way, we exploit already known information  (dic

    tionary translations) to access "unknown"  information (real world  TL  renderings).

    A preliminary version of our system, known as DBT-Synchro, is described in Marinai

    et  al .  (1991).  It operates in two distinct steps. In the first, sets of bilingual texts are"synchronized" using morphological procedures and a bilingual electronic dictionary

    (based on the Collins LDB). Each word form in the text taken as the source is input  to the

    morphological analyzer for Ll in order to identify its base lemma, which is then searched

    in the bilingual  LDB. AIl translations read for this lemma are input to the morphological

    generator for L2 which produces all possible forms and these are then searched over the

    relevant search zone in the target text. When one of the translation equivalent forms are

    found in the L2 text, a link is created between this form and its equivalent in the Ll text.

    These links are then stored with the texts in the bilingual archives to be used by the query

    system for the on-line construction of parallel contexts. 'Wrong" links between  falsely

    recognized translation equivalents which  disturb  context calculation are identified and

    eliminated by the query system, which then recalculates the parallel contexts on the basis

    of   those links recognised as valid.

    D.B.T.  (Picchi) Testini di prova della sincronizzazione  V{E)CHANGE & {E)COLOUR

    1 {E} September days continue, followed  by those splendid October days marked  by golden sunset and  skies which  ch*ng* colour from green to goldas in the Venetian paintings of Cima da Conegliano and  Titian.E-impressi.63

    {1} Ie giornate di sole della fine di settembre, e Ie splendideottobrate dai tramonti dorati, dai ci#tl che trascolorano dal verde, all'oro,  come quelli che si trovano nella pittura veneta da Cima diI-impressi.65

    2 {E} the melancholy signs that the season is hurrying to an end.The corn begins to ch*ng* colour in the fields. The leaves are partlygreen and  partly the dry dirty gray colour E-impressi.l34

    {1} suoi segnali melanconici di stagione che corre verso Ia fine. Le piante del granoturco cominciano a c*mbi*r* color* nel campo. Le fogliesono mezzo verdi e mezzo secche, di un grigio sporco in  I-impressi.l31

    3 {E} that winter is now just around  the corner. The leaves on all the

    trees have started  to Chang* colour. Everything begins to turn yellow orred. Though it seems like a time of celebration E-impressi.l67

    {1} e che 1' inverno è ormai dietro Ia porta. Le foglie di tutti glialberi hanno cominciato a mutar* color*. Tutto inizia a ingiallire o arosseggiare,Pare un momento di festa e di trionfo I-impressi.l61

    Scelta N.Contesto>  Fl for help

    Figure 4: Querying the Bilingual Text Corpus

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    8/9

    284 EURALEX '92 -  PROCEEDINGS

    The  archives are considered to be symmetric; either of the two languages can be

    selected  as Ll . The lexicographer can either search for single word forms or, using the

    morphological generator, for all the forms of a given lemma. For each Ll word or combination of  words  queried by the user, the parallel L2 contexts are constructed and dis

    played on the screen. The word(s) for which the contexts are being created are high

    lighted and where a direct link for the Ll form(s) being searched  exists, the L2 matched

    word(s) will be highlighted in the same colour. Otherwise, the two directly linked forms

    which are closest to the point calculated as the middle of the L2 context will be evidenced

    in a different colour, as indicators of the position in the L2 context of the translation

    equivalent being searched. Other words which have been linked in the paired contexts

    can  be optionally evidenced. Bilingual concordances of interest can be printed out or

    saved in a separate  file  for future reference. Figure 4 gives an extract of the resultsobtained from our test set of texts for a query requesting parallel contexts for the  cooc

    currence of CHANGE and  COLOUR.

    3. Final Remarks

    We  have given a  brief   description of the main computational tools which have been

    implemented in the Pisa Lexicographic Workstation in order to facilitate the task of the

    bilingual lexicographer. The entire system is implemented on personal computers  run-

    ning the MS/DOS  operating system and is intended to run on a Local Area Network sothat a team of lexicographers can work in unison, using the same tools and accessing the

    same reference data. At the same time, the procedures are easily transportable onto

    smaller desk-top systems for the lexicographer working at home. The system is menu-

    driven and context sensitive Helps are accessible atall times during a query session. Our

    main consideration has been to provide tools which are not only efficient  but also user-

    friendly.

    Endnotes

    1  ACQUlLEX is an ESPRTT Basic Research Action which is developing techniques and

    methodologies for utilising both monoUngual and biUngual machine-readable dictionarysources to construct  lexical components for NLP systems.

    2  The study made for ACQUTLEX has also been used by the Text Encoding hùtiative group studying dictionary representation as part of the general TEI programme to provide guidelinesand  standards for the representation and exchange of texts in machine-readable form usingthe SGML mark-up kmguage (see Ide et al. 1991). We wiU consider incorporating the final recommendations of this group into our actual data model.

    3 For a description of the two-level model versus generative phonology, see Antworth  (1990,Introduction).

  • 8/18/2019 038_Eugenio Picchi & Carol Peters & Elisabetta Marinai-The Pisa Lexicographic Workstation_ the Bi

    9/9

    PiccN et . al: The Pisa  Lexicographic Workstation 285

    Bibliography

    ANTWORTH, E.L. (1990): PC-KlMMO: A Two-level Processor for Morphological Analysis. Occasional Publications in Academic Computing. No.l6, Summer mstitute of Linguistics, Dallas.

    ATKlNS, B. (1990): "Corpus Lexicography: The Bilingual Dimension", m: Computational  Lexicology and Lexicography. Special Issue dedicated to Bernard Quemada.  I. Ed. by L. Cignoni andC Peters. Linguistica Computazionale,  VoI VI.

    CALZOLARI, N.,  PETERS, C., ROVENTTM, A. (1990): Computational Model of the Dictionary

    Entry: Preliminary Report. ACQUILEX, Esprit BRA 3030. Six Month Deliverable. ILC-ACQ-I-

    90 ,Pisa.

    CHURCH, K., GALE, W. (1991): "Concordances for Parallel Text", hi: Using Corpora. Proceedings

    of  the 7th Annual Conference of the Centre for the New Oxford EngUsh Dictionary and Text

    Research, Oxford, UK.IDE, N., VERONIS,  J.,  WARWICK-ARMSTRONG, S., CALZOLARI, N.  (1991): Principles for En

    coding Machine Readable Dictionaries, TEIWP, A15W6.

    MARTNAI, E., PETERS, C., PICCHI, E. (1990): The Pisa Multilingual Lexical Database System.Esprit BRA 3030. Twelve Month DeUverable. uX-ACQ-2-90, Pisa.

    MARJNAI, E., PETERS, C., PICCHI, E.: "A prototype system for the semi-automatic sense linking

    and merging of mono- and bilingual  LDBs". m: Research in Humanities Computing. Ed. by

    N.Ide and S. Hockey. OUP, Oxford, (forthcoming).

    MARTNAI, E., PETERS, C., PICCHI, E (1991): bilingual Reference Corpora: A System for Parallel

    Text Retrieval". In: Using Corpora. Proceedings of the 7th Annual Conference of the Centre for

    the New Oxford EngUsh Dictionary and Text Research, Oxford, UK.

    PICCHI, E. (1991): "D.B.T.: A Textual Data Base System". In: Computational Lexicology and  Lexicography. Special Issue dedicated to Bernard Quemada. II. Ed. by L. Cignoni and C Peters.Linguistica Computazionale,  VoI VII.

    PICCHI, E., CALZOLARI, N. (1990): "Pisa Linguistic Database". In: Literary and Linguistic Computing  1988, Proceedings of ALLC Fifteenth International Conference. Ed. by Y.Choueka.Champion-Slatkine, Paris


Recommended