+ All Categories
Home > Documents > NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid...

NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid...

Date post: 12-Jan-2016
Category:
Upload: clifford-hunt
View: 214 times
Download: 0 times
Share this document with a friend
23
NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris
Transcript
Page 1: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

NooJ international Conference, Komotini, May 2010

Portability of Armenian Corpus

by NoojAnaid Donabedian & Victoria Khurshudian

Institut National des Langues et Civilisations Orientales (INALCO), Paris

Page 2: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Armenian: preliminaries

an Indo-European language

right-branching

of an accusative type

typically with an SOV structure and

dominantly with an agglutinative morphology

Page 3: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Historical Armenia

Page 4: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Republic of Armenia

Page 5: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Periodization prealphabetical

alphabetical (405 A.D. – up to present).

1. Old Armenian or Grabar (V-XI);

2. Middle Armenian (XII-XVI);

3. Modern Armenian (XVII – up to present)

Western Eastern (based on Constantinople dialect) (based on Ararat dialect)

dialects… dialects….

Page 6: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Objective

Provide data compatibility and portability between Nooj and

Eastern Armenian National Corpus (EANC) platform

Page 7: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

What is Eastern Armenian National Corpus

www.eanc.netCorpus Technologies

Michael Daniel, Victoria Khurshudian, Dmitri Levonian,

Vladimir Plungian, Alexey Polyakov,Sergey Rubakov

Page 8: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

8

Source texts

PARSER

Annotated texts

Annotation algorithm

Grammatical dictionary

Page 9: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

EANC History

Moscow, Russia

March 2006: Project Launch

July 2007: 1st Release

May 2008: 2nd Release

March 2009: 3rd release

Page 10: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Eastern Armenian National Corpus (EANC) is:

about 110 million tokens

morphological and other markup

English translations for frequent tokens

covers SEA from the mid-19th century to the present

both written and oral discourse

full-text view for over 100 Armenian classic titles

open internet access

Page 11: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Written Discourse

over 106 mln. tokens

510 authors (1841-2009)

1039 fiction texts (including 206 translated texts)

7858 press issues

non-fiction (scientific and other) texts

Page 12: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Spontaneous discourse

Polylogues

Task-oriented discourse

TV-shows transcripts

Movies …

☼ EANC oral corpus has all been recorded and transcribed

by the project.

Oral Discourse (3.5 mln. tokens)

Page 13: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

13

EANC Functionality

Page 14: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

14

Search Functionality

Token queries

Context queries

Subcorpus selection

Page 15: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

15

Simple token queries:

• lexeme search

• wordform search

• gram search

• translation search

• lexeme + gram search

Search Functionality

Page 16: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

16

Advanced options for token queries:

case-sensitivity

punctuation marks

position in the sentence

wildcard (*)

logical functions (e.g. ‘or' |)

negated features

grammatical/lexical homonymy inclusion/exclusion

Search Functionality

Page 17: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

17

Subcorpus selection by:

time

author(s) / title(s)

genres

types of texts (translated vs. original)

superposition of any of the above

Search Functionality

Page 18: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

18

Display options

context expanding

‘sort by’ (time, lexeme, wordform etc.)

Latin transliteration

glossed display

KWIC (key word in the context)

Search Functionality

Page 19: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

19

Transliterated samples:

Page 20: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

20

Glossed samples:

Page 21: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

21

KWIC samples:

Page 22: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Main Current Tasks:

Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure

Make EANC and Nooj Western Armenian platforms interportable

Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)

Page 23: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Recommended