+ All Categories
Home > Documents > Russian National Corpus - narod.ruolesar.narod.ru/papers/RNCworkshop_SCLCHarvard_2014.pdf1 Russian...

Russian National Corpus - narod.ruolesar.narod.ru/papers/RNCworkshop_SCLCHarvard_2014.pdf1 Russian...

Date post: 26-Apr-2020
Category:
Upload: others
View: 31 times
Download: 0 times
Share this document with a friend
57
1 Russian National Corpus ruscorpora.ru Ekaterina Rakhilina, Vladimir Plungian, Olga Lyashevskaya, Dmitry Sichinava RNC Workshop, SCLC 2014 16 Feb 2014 Harvard University
Transcript

1

Russian National Corpus

ruscorpora.ru

Ekaterina Rakhilina, Vladimir Plungian, Olga Lyashevskaya, Dmitry Sichinava

RNC Workshop, SCLC 2014

16 Feb 2014 Harvard University

2

Preliminary plan

� Russian National Corpus Season 2014:

� hints and tricks

� new features and plans

� Corpus data for offline research

� Discussion

Your input is much appreciated!

3

Main participants

� V.V.Vinogradov Russian Language Institute

Russian Academy of Sciences Moscow

Yandex

Internet and technologycompany

Ilya Segalovitch (1964-2013)chief technical officer of Yandex

5

RNC non-commercial partnership

� universities (Moscow, Saint-Petersburg, Saratov, etc.)

� research institutes (IPPI RAN, ILI RAN)

� IT-companies

� personal membership

You are welcome to share your corpus data through RNC!

New goals: Licensing issues and data distribution.

Corpora

Statistics & offline data

RusGram

Dictionaries

7

RUSCORPORA.RU family

� The main corpus of written Modern Russian (1700-present, 230 MW)

� Newspapers & news (2000-present, 174 MW)

� Corpus of Russian poetry (10 MW)

� Spoken corpus (11 MW)

� Multimedia corpus (4 MW)

� Accentuated corpus (14 MW)

� Parallel corpora (54 MW)

� Syntactic treebank (0,7 MW)

� Corpus of Russian dialects

� Russian-for-Schools corpus

8

RUSCORPORA.RU - new corpora

� Diachronic corpora:

� Old Russian

� Church-Slavonic

� Middle Russian

� Blogger corpus

� Learner corpora

9

The full body of search results is freely

available online

10

... also in KWIC format

11

Sorting results

12

Saving results in Excel format

13

Customizing subcorpus

The main corpus:

Modern fiction of various genres

Modern drama

Memoirs and biographies

Journalism and literary criticism

Scientific, popular scientific and teaching texts

Religious and philosophical texts

Technical texts

Business and jurisprudence texts

Day-to-day life texts, including texts not intended for publication (letters, diaries, etc.)

14

Hints & tricks

� sorting: надо же было ...

� раз...ся (рас...ся)

� Мама мыла раму.

� hypocoristic personal names not ending with *чка, *нька

� use word-formation

� вс- prefix

� also with possible alternations

� also on the 2nd place

15

Recent news from the RNC

� Poetry: up to 1990-2002

� MURCO: Multi-media corpus (movies, talks, etc.)

types of speech situations (welcome,questioning, interview, dispute, quarrel etc.)

gestures + gestures provided by speech

+ academic talks & discussions

+ Parallel Spoken Russian:

Gogol's Revizor on many stages (MultiParC)

� Diachronic evidences (Russian in XII-XVII cc.)

� Parallel corpora

Corpus of Russian poetry

Corpus of Russian poetry

18

RUSCORPORA.RU - new corpora

� Diachronic corpora

� Old Russian & Birch letters

� Church-Slavonic

� Middle Russian

� Slavic parallel corpora

� Blogger corpus

� Learner corpora

Old-Russian

Old-Russian

21

RNC annotation: the main corpus

Four major annotation layers:

� meta-textual annotation

register/genre, author, creation date, size, etc.

� word-level morphosyntactic annotation

lemma, POS, inflectional categories, distorted or anomalous forms etc.

� accentual annotation

normative place of accent, accentual shifts in fixed expressions

� lexico-semantic annotation

lexical classes of verbs, nouns, pronouns, adjectives and adverbs

+ new! word-formation annotation

prefixes, suffixes, roots

22

N-gram viewerhttp://ruscorpora.ru/ngram.html

� word forms - Графики

� cf. Google Books Ngram Viewer

� + wildcards *сторонился

� year span by by date of creation, not date of publishing (cf. GoogleBooks)

� smoothing (3... to 20 is recommended)

� lemmas, not words - Распределение по годам(output page)

� Статистика по метаатрибутам

Графики:

сторонился,

посторонился,

*сторонился

Year: 1800... 2010Smoothing: 10

25

Annotation mistakes and how to fix them

� Please tag mistakes if you come across them in the output data

26

Even more Russian corpora

in cooperation with the RNC

� "Simple" Russian (HSE in Nizhny Novgorod)

� "we cannot ask 5-year-old children to read examples from the corpus" (NB students!)

� a subcorpus of short simple sentences, frequent words from the "lexical minimum"

� "Non-perfect" Russian

� Heritage language in Finland and USA (study of language interference)

� Russian as L2 in Daghestan and other parts of Russia

� Learner corpus of academic writing

27

Even more Russian corpora

in cooperation with the RNC

� "Simple" Russian (HSE in Nizhny Novgorod)

� "we cannot ask 5-year-old children to read examples from the corpus" (NB students!)

� a subcorpus of short simple sentences, frequent words from the "lexical minimum"

� "Non-perfect" Russian

� Heritage language in Finland and USA (study of language interference)

� Russian as L2 in Daghestan and other parts of Russia

� Learner corpus of academic writing >

28

Корпус Академического Письма

http://web-corpora.net/RussianAcademCorpus/search/

Essays, drafts of term papers, other academic texts written by students

>> sociology, economics, politics, law, psychology, linguistics, management, etc.

>> 1 MW available so far

29

Corpus of academic writing

30

Corpus of academic writing

� 3 level of mistake annotation

1) linguistic type (orthography, punctuation, lexical choice, grammatical choice & form, discourse-oriented)

2) weight (minor mistake, medium level, major/critical mistake)

3) interpretation: what is the cause?(misprint, wrong synonym, mixt of constructions, etc)

31

Heritage languagehttp://web-

corpora.net/RussianLearnerCorpus/search

� National Heritage Language Resource Center (UCLA)

� Polynsky Lab in Harvard� О. Kisselev, A. Alsufieva, I.Dubibina et al.� E. Rakhilina and her research lab in HSE

Russian learner corpus

32

33

Some examples

� Эти ноутбуки потребляли меньше энергии, но были менее компактнее по объему.

� И прибыль от разрушения гораздо болеезаметна и быстра, нежели чем отстроительства.

� В русском языке семантический диапазонданного слова чрезвычайно широк, нежели в английском

(Academic Writing Corpus)

� В России человек большебольше (! чащечаще)считается расистом из за действий(Heritage Corpus)

Corpora

Statistics & offline data

RusGram

Dictionaries

35

RusGram

Corpus-based Russian reference grammar

� traditional академическая грамматика

� morphology (inflection)

� syntax

� + RNC-based statistics

� + lexical anchors in focus

� substandard Russian: negative evidences or "points of future development"?

rusgram.ru

37

Corpus-based dictionaries

� http://dict.ruslang.ru/

Frequency dictionary of Modern Russian

offline version available from my homepage

New grammatical dictionary

Russian idiomaticity in real usage (with frequences):

Which adjectival intensifier can we use with nouns?

Which verb can we use with abstract nouns?

Framebank (the dictionary of argument-predicate constructions attested in the RNC)

offline release summer 2014

38

Corpus-based dictionaries

In progress: Grammatical forms of Russian lexemes

� Paradigms of verbs, nouns, adjectives

� Distribution by time & text registers

� Lexical classes: comparative study

Corpora

Statistics & offline data

RusGram

Dictionaries

40

Statistics & offline use

Overall idea: to show patterns in your output

� statistics

� visualization

But: RNC corpus workbench is not adapted to

work with customized set of data

1 step: N-grams

41

N-grams search Beta!

2-, 3-, 4-, 5- word chains

� не до *

� потрясающе (*о | *е)

Most frequent N-grams - ЧАСТОТЫ

In progress: Search by lemma, morphology, semantics,

word formation

In progress: Explore time & text registers

+ in any subcorpus of your choice

In progress: Search with distance btw words (incl.

repetitions)

42

Offline data for advanced users & computational resources

NB! We are linguists, not lawyers: we cannot distribute texts

But: we can share annotations & statistics on this data

So far:

� ЧАСТОТЫ: 2-, 3-, 4-, 5-grams http://ruscorpora.ru/corpora-freq.html

� 1 MW Morphological standard (manually

disambiguated, shuffled sentences)

Plans:

� N-grams for other corpora + annotated data

� POS-annotations etc. V-S-S-CONJ-ADJ-S.

43

studiorum.ruscorpora.ru

A companion web site to the RNC

� Corpus methods in linguistic research

� Corpus in teaching Russian as a second language

� Corpus in teaching linguistics, Russian stylistics, philology and social sciences

� Corpus in teaching Russian in school

� References (incl. PhD manuscripts and term papers)

� Corpus resources

� F.A.Q.

44

Discussion

Any questions?

comments?

complaints?

What would you like to see in the corpus?

Known issues >

45

Known issues

1. A bag of words

� Lemma: дуло 'muzzle'

� Gram: V

2. *базар* (разбазарить, разбазаривать, пробазарить, базарчик, Базаров)

� NB word-formation: just words in the dictionary

3. Search across sentence boundaries

4. Unbalansed portions of data across time

� который

� и, в, на, они

� не

Solution: TBA soon

annotated n-grams database search

46

Thank you!

Спасибо!

http://ruscorpora.ru

47

Appendix: RNC annotation layers

� meta-text info

� morphology

� lexico-semantic classes

48

RNC annotation: the main corpus

Four major annotation layers:

� meta-textual annotation

register/genre, author, creation date, size, etc.

� word-level morphosyntactic annotation

lemma, POS, inflectional categories, distorted or anomalous forms etc.

� accentual annotation

normative place of accent, accentual shifts in fixed expressions

� lexico-semantic annotation

lexical classes of verbs, nouns, pronouns, adjectives and adverbs

+ new! word-formation annotation

prefixes, suffixes, roots

49

Subcorpora and meta-textual parameters

>>

50

Morphological parsing

Zaliznjak's (1967, 1977) formal model of Russian inflection

A set of parsers based on Grammatical dictionary

MYSTEM (Segalovich 2003) and DIALING (Sokirko2004) morphological parsers in use

Lemma, POS and grammatical features:

Examples: взял ‘take.PAST’

<ana lex=“взять" gr="act,indic,m,pf,praet,sg,tran,V"/>

жалеючи ‘pity.GER’

<ana lex=“жалеть" gr="act,anom,ger,ipf,praes,tran,V"/>

Hypotheses for words-not-in-dictionary: "Рогочим"

51

Morphosyntactic features

>>

NB

52

� 6 million corpus of manually disambiguated texts

� Other texts are not disambiguated

� Applying automatic disambiguation techniques (training on the disambiguated corpus and its evaluation)

●Morphosyntactic annotation

●●●● ● ●Future directions

WSDExamplesLexical taxonomy

Russian grammar

RNC

Morphosyntactic annotation & disambiguation

Manually

disambiguated

corpus, 6 million

Non-disambiguated

corpus

53

� The traditions of Moscow lexical semantics (Apresjan1974/1992, Mel’chuk 1996, Paducheva 1974, etc.)

� Dictionaries of lexical classes (Kuznecova 1982, Babenko2001, Shvedova 2004, 2007)

� DB LEXICOGRAPHER: verbs and nouns (Kustova&Paducheva 1994, Kustova 2004, Paducheva 2004, Rakhilina 2000)

Main principles:

� Coarse-grained classification

� Well-known classes traditionally used in linguistic research

� The classification is aimed to explore the semantically motivated peculiarities of Russian grammar

� and allows for identify various constructions in the text

●Morphosyntactic annotation

●●●● ● ●Future directions

WSDExamplesLexical taxonomy

Russian grammar

RNC

Lexical taxonomy

54

Includes 6 independent classifications (some of them hierarchical):

• Category (prime lexical divisions that determine main semantic features: concrete, abstract, proper nouns; qualitative, relative, possessive adjectives);

• Taxonomy (e.g. luk ‘bow’: «weapon», radost’‘joy’: «emotion», bystryj ‘quick’: «speed», staryj‘old’: «age»);

• Mereology (e.g. rukav ‘sleeve’: «parts of clothes», buket ‘bunch’: «sets and aggregates», kaplja ‘drop’: «quanta and portions of stuff»);

• Topology (e.g. kastrjulja ‘pot’: «container», nos‘nose’ «juts», zmeja ‘snake’ «ropes»);

• Evaluation (e.g. blagouxanije ‘odor’: «positive», presmykat’sja ‘lick the boots’: «negative»);

• Derivational classes (e.g. knizhechka ‘little book’: «diminutives», sosnovyj ‘piny’: «adjectives derived from nouns»).

●Morphosyntactic annotation

●●●● ● ●Future directions

WSDExamplesLexical taxonomy

Russian grammar

RNC

RNC semantic database

55

●Morphosyntactic annotation

●●●● ● ●Future directions

WSDExamplesLexical taxonomy

Russian grammar

RNC

Separate entry for each meaning of the word:

Pojas ‘belt’, ‘waist’, ‘zone’1. Category: non-predicate ‘belt of a dress’

Taxonomy: accessoryMereology: part(cloth)Topology: stripe

2. Category: non-predicate ‘to bow from a waist’Mereology: bodypart(human/animal)

3. Category: non-predicate ‘time zone’Taxonomy: spaceTopology: stripe

56

●Morphosyntactic annotation

●●●● ● ●Future directions

WSDExamplesLexical taxonomy

Russian grammar

RNC

57

� All content words (nouns, verbs, adjectives, adverbs, pronouns, numerals) are automatically assigned semantic tagsets

� Currently more than 350 000 entries (ca. 100 000 lemmas) in the database

� Large-scale, word-by-word annotation

� Disambiguation is still needed

●Morphosyntactic annotation

●●●● ● ●Future directions

WSDExamplesLexical taxonomy

Russian grammar

RNC

Lexico-semantic annotation


Recommended