+ All Categories
Home > Documents > MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years...

MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years...

Date post: 17-Jan-2016
Category:
Upload: doreen-walton
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
101
MULTILINGUAL CORPORA MULTILINGUAL CORPORA InterCorp
Transcript
Page 1: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

MULTILINGUAL MULTILINGUAL CORPORACORPORA

InterCorp

Page 2: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

90’s, north of Europe (Norway, Sweden)

“In the last 10-15 years or so there has been a great deal of interest in the development and use of multilingual or parallel corpora. To begin with, we can define such corpora provisionally as collections of texts in two or more languages which are parallel in some way, either by being in a translation relationship or by being comparable in other respects, such as genre, time of publication, intended readership, and so on.” (Johansson 2007, 51)

Multilingual corpora: When and where?

Page 3: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Types of multilingual corporaTypes of multilingual corpora

A. Source texts in one language and their translation to other languages - translation corpus according to Aijmer and Granger

Page 4: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

B. Pairs or groups of monolingual corpora designed using the same “sampling frame” Lancaster corpus of Mandarin Chinese (same sampling frame as LOB)

-parallel corpus according to Aijmer and Granger

-comparable according to McEnery and Wilson

The term parallel corpus is sometimes used for both A and B (Johansson, Barlow)

Page 5: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

C. A combination of A and B English-Norwegian Parallel Corpus (ENPC)

The original texts are comparable (genre, number of words)The translations go in both directions –

a bidirectional translation corpus

Page 6: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Hasselgard´s presentation at UCCTS 2010

Page 7: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

“Because permission could not be obtained to distribute the English–Norwegian Parallel Corpus to any interested researcher, it is only available to researchers able to travel to the University of Oslo, where it was created and is now housed.” (Mayer)

https://www.hf.uio.no/ilos/english/services/omc/enpc/

Page 8: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

AdvantagesThey give insights into the languages compared – insights that are likely to be unnoticed in studies of monolingual corpora. (Aijmer & Altenberg 1996: 12)They can be used for a range of comparative purposes and increase our understanding of language-specific, typological and cultural differences, as well as of universal features. (Aijmer & Altenberg 1996: 12) CONTRASTIVE LINGUISTICS

Contrastive analysis is the systematic comparison of two or more languages, with the aim of describing their similarities and differences. (Johansson 2007: 1)

“Vilém Mathesius, founder of the Linguistic Circle of Prague, spoke about analytical comparison, or linguistic characterology, as a way of determining the characteristics of languages and gaining a deeper insight into their specific features (Mathesius 1975).”

Page 9: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

“Any cross-linguistic comparison presupposes that the compared items are insome sense similar or comparable. That is, to be able to say that certain categories in two languages are similar or different it is necessary that they have some common ground, or tertium comparationis. For lexis it is obvious that the compared items should express ‘the same thing’, i.e. have the same (or at least similar) meaning and pragmatic function (see James 1980: 90f.)” (Granger and Altenberg 2002, 15)

“In contrastive linguistics, the ‘assumption of comparability’ for specific pairs of categories is reflected in, and supported by, linguistic output” (frequent if languages are socio-culturally related), namely “(high quality) translations and parallel corpora based on such translations” , and “non-target-like language of second language learners (learner corpora)” (Gast 2012)

Gast, Volker. Contrastive Analysis: Theories and Methodshttp://www.personal.uni-jena.de/~mu65qev/papdf/contr_ling_meth.pdf

Hasselgård, Hilde on Parallel corpora and contrastive studieshttp://www.lancaster.ac.uk/fass/projects/corpus/UCCTS2010Proceedings/papers/Hasselgard.pdf (= keynote lecture delivered at UCCTS 2010 conference)

Page 10: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

“in monolingual corpora we can easily study forms and formal patterns, but meanings are less accessible. One of the most fascinating aspects of multilingual corpora is that they can make meanings visible through translation” (Stig Johansson 2007: 57)

Page 11: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

“Ambiguity and vagueness are revealed through translation patterns.” (Johansson 2007, 57)

What is the meaning of apparent?

Page 12: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 13: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

“‘Mutual correspondence’ (MC) is a simple statistical measure of the frequency with which a pair of items from two languages are translated into each other in a bi-directional translation corpus (see Altenberg 1999).”

takže = (and) so?

Page 14: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 15: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

They can be used for a number of practical applications, e.g. in lexicography, language teaching, and translation. (Aijmer & Altenberg 1996: 12)

They illuminate differences between source texts and translations, and between native and non-native texts. (Aijmer & Altenberg 1996: 12)

S universals - universal differences between translations and their source texts; claims about the way in which translators process the source text. T-universals are the modern equivalent to the criticisms of unnaturalness, of translationese, made in the pejorative approach.

T-universals - differences between translations and nontranslated texts written in the target language;

(CHESTERMAN, A. 2003)

Page 16: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

n-gram korpusabs. FQ

nepřeklady

rel. FQnepřeklad

y

abs. FQpřeklady

rel. FQpřeklady

je mi líto Jerome 109 4,1 214 8,0

ani v nejmenším

Jerome 113 4,3 324 12,2

co do činění s

Jerome 11 0,4 92 3,5

CHLUMSKÁ, L – RICHTEROVÁ, O. (2014): Jak zkoumat překladovou češtinu. Výzkum simplifikace na korpusu Jerome. Korpus, gramatika, axiologie, 9, s. 16–29.https://ucnk.ff.cuni.cz/jerome.php

Page 17: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Limitations

Restricted to texts / text types that have been translated scarcity of spoken data

Usually small: The size of the corpus restricts studies of less frequent lexical/ grammatical constructionsFaulty and less successful translationsUsually word-class tagged, but not parsed (syntactically annotated), i.e. it is not possible to search for grammatical constructions, patterns of word order etc.

As with corpus linguistics in generalTagging errors

You can only search for something that is explicit in the text

Page 18: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Ways round the limitations

Identify typical (and searchable!) expressions of a grammatical construction, use a combination of word class tagging, and wildcards (CQL query)

In any case – a lot of work involved in tidying up the search results (precision).

Possibility of searching with regular expressions (next week)

Errors/idiosyncracies in the translation: Weed out? Ignore translations that occur only once, or in only one text?

Supplement results of parallel corpus study with (larger) monolingual corpora.

Supplement corpus study with e.g. experimental data.

Page 19: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Parallel corpora for English and Czech? No comparable corpus of English and Czech exists. No corpus of the corpora of the Czech National Corpus is similar to the British National Corpus in structure.

No equivalent to ENCP

Project KAČENKA at FF MU Brno http://www.phil.muni.cz/angl/kacenka2/

Parallel translation corpus: InterCorp

Page 20: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Structure of InterCorp http://ucnk.ff.cuni.cz/intercorp/

developed at the Institute of the Czech National Corpus, Philosophical Faculty, Charles University Prague

But our department participates parallel corpus of Czech and 38 languages with Czech as the pivot language (in the sense that every foreign language text is ALIGNED with Czech, irrespective of which language is the source language)

Page 21: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 22: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Concordances (concordance lines, KWICs) in a parallel corpus

Page 23: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 24: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

What kinds of texts does it contain?

Page 25: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 26: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

CORE

Page 27: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

COLLECTIONS

Project Syndicatehttp://www.project-syndicate.org/

Page 28: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

http://www.presseurop.eu/cs, http://www.voxeurop.eu/en

Acquis – legal documents of the EU

http://europa.eu/legislation_summaries/glossary/community_acquis_en.htm

Europarlhttp://www.statmt.org/europarl/

Presseurope, Europarl, Project Syndicate – what is source text?“Until 2003, the texts were translated directly from the source languages into any of the target languages. From 2003 onwards, English has been used as a pivot language (Cartoni & Meyer 2012: 3), i.e., all languages were first translated into English and then into the relevant target language” (Gast 2014)

Page 29: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

http://www.opensubtitles.org/cs

Page 30: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 31: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 32: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 33: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Searching InterCorp datahttp://ucnk.ff.cuni.cz/intercorp/?lang=en

“Most likely, Park and NoSketch Engine will be discontinued already at the end of March 2015. We would like to use this opportunity to invite all users of InterCorp to migrate to KonText”

Page 34: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 35: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 36: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Query type

Page 37: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 38: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 39: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

METADATA (data about data)

Page 40: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 41: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 42: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Limit your search for so to SUBTITLES where English is the SOURCE language

Page 43: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 44: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Search for kind of in original English texts of fiction

Page 45: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 46: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

CREATE A SUBCORPUS USING METADATA

Page 47: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 48: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Search for little girl, honey, wow in English original texts of fiction and their Czech translation

Create a subcorpus of subtitles translated from English to Czech. How big is it?

Page 49: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

What usually precedes kind of in your subcorpus of subtitles (Engl. originals)?

Page 50: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 51: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 52: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Are there any Czech films? Create a subcorpus of Czech subtitles in original Czech films that allows searching in Czech. How big is it?

Find the particle prej/prý in it (as a WORD FORM)

Page 53: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 54: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

What is the English equivalent of the adjective hloupý found in original Czech subtitles?

Page 55: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Find English translations of the particle prý used in original texts of fiction

Page 56: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

In which text is prý used most frequently?

Question for thought: is this Kundera´s idiolect?

Page 57: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Find prý in translations of English texts of fiction

Page 58: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 59: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

You can also create a subcorpus of original Czech texts of fiction.

Page 60: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

How about comparing English with Spanish?How many original works of fiction are translated into Spanish? Which ones?

Page 61: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 62: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 63: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 64: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 65: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Zastoupení překladů v Českém národním korpusu

Page 66: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

In SYN2010, create a subcorpus of A) CZECH ORIGINAL TEXTSB) CZECH translated from English

Hint: English as source language is marked as ENG, Czech as source language is not markedSearch for tokens of šálek čaje in A and B

Page 67: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Regular expression (from Glossary of Corpus Linguistics)

“A type of string that may include special characters that mean the regular expression as a whole will match with more than one string. For instance, the full stop . as a special character in a regular expression can represent any single letter. If so, the regular expression b.d would match with the strings bad, bbd, bcd, bdd, bed etc.”

“Regular expressions, sometimes known as patterns, are often used when searching a corpus. It is often easier to define a regular expression that matches the set of words in a search, than to search for each word individually and combine the results.”

Page 68: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

. = any single character(p.s = pes, pás, pas)* = libovolný počet opakování předchozího znaku, totéž co {0,} (ps*t =

pt, pst, psst, pssst atd.),

+ = libovolný počet opakování předchozího znaku > 0 , totéž co {1,} (ps+t =

pst, psst, pssst atd.)

Try it. Make a WORD QUERY for all words starting in L and ending in A

Page 69: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

WHICH WORD FORMS?

Try a LEMMA search for m.*o as a noun

Page 70: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

? = žádný nebo jeden výskyt předchozího znaku/výrazu {0,1}

interval {n, k} - představuje n až k opakování předchozího znaku nebo výrazu; je-li k vynecháno, odpovídá intervalu nejméně n opakování, pokud má interval tvar {n}, odpovídá mu přesně n opakování;

[] = výběr ze seznamu (např. [Pp]řeklad, disku[sz]e)

| = výběr z možností (např. diskuse|diskuze) - představuje také alternativu, ne ovšem mezi jednotlivými znaky, ale celými řetězci,

() - libovolnou část výrazu je možné seskupit do kulatých závorek a ovlivnit tak prioritu jeho vyhodnocování

^ = s vyloučením znaků v dané závorce

Page 71: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Look up diskuse and diskuze - in the subtitles of Czech original films- in the original works of Czech fiction

Check whether what you got is what you searched for

Page 72: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Find all tokens of ratata, ratatata... in all Czech texts in Intercorp (originals, translations from all languages). Where are they most frequent?

Page 73: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Pro rata temporis - http://en.wikipedia.org/wiki/Pro_rata

Exclude Acquis and take a look at the subtitles

Page 74: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Try j?est?l?i, again in the whole of Intercorp. What are you going to find?

Page 75: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

kampak / kdepak / jestlipak / kdypak /kdopak/propak

[A-ž]+pak

Page 76: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

GET RID OF OPAK

Page 77: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 78: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Make list of all prefixed verbs with the root lézt.Find all tokens of pst (all possible numbers of the letter s)

Go back to English fiction originally written in English.-find both spellings of the word color/colour.-find both spellings of the past tense of the verb travel-find both spellings of the word gray/grey

Page 79: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

What is the most frequent personal pronoun in the English subtitles (originals)?

1. English texts annotated with Tree Taggerhttp://courses.washington.edu/hypertxt/csar-v02/penntable.html

2. There is no TAG query! You have to use the Corpus Language Query query (CQL query)

Page 80: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 81: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

English subtitles (English originals)

Page 82: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Corpus query languageCorpus query language

Think of each text position (the thing between spaces) as

[attribute=“value”]attribute can be:

◦word◦lemma◦tag

If a default attribute is set, you do not have to use the brackets

Page 83: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Quite followed by an adjective in English originals (fiction/subtitles)

[word=“quite”] [tag=“JJ”]

But since WORD is the default attribute, this is the same as "quite" [tag=“JJ"]

Page 84: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 85: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 86: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Translation of the copula feel followed by an adjective

[lemma="feel"][tag=“JJ"]or "feel“[tag=“JJ"] (implicit attribute = lemma)

How about seem, smell, etc…

Page 87: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 88: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Find in English original texts tokens of the have construction analogous to

Danny, soft and drowsy, had him thinking of things other than showing her what a lazy employer could be like

[lemma="have"][tag="PP"][tag="VBG"]

Page 89: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 90: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Now think of an NP instead of a pronoun

[lemma="have"][]{0,2}[tag="N.*"][tag="VBG"]

Which state verbs are used in the progressive tense (be+VVG) most often?

[lemma=“be”] [tag=“VBG”]

Page 91: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 92: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Combination of two attributes in one text position: &

Which nouns end in –ie in subtitles where English is the source language:

Page 93: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

What is the most frequent personal pronoun in original CZECH subtitles?

http://ucnk.ff.cuni.cz/doc/popis_znacek.pdf

Klikátko http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=cs

[tag="PP..1--..-..--.*"]= [tag="PP..1.*"]

Page 94: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Czech tags are incorporated into KonText...

Page 95: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.
Page 96: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Practise:noun in the gen.sg.

adj. in the comparative degree, instrumental plural

punctuation

imperative of an imperfective verb, plural

Search for the adverb pěkně followed by an adjective in Czech translations of English subtitles

Page 97: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

pěkně studená sprchaPivo?

Page 98: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Search for adjectives beginning in hyper in the Czech translations of English subtitles

Page 99: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

Jak vyhledávat v textech beletrie od určitého data? insert “within”

Only gives you texts from 1950, and all types of text

Page 100: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

https://podpora.korpus.cz/boards/8/topics/228

[word="query"] within <div pubyear>="1990" & group="Core" />

E.g. quite after 1990

Page 101: MULTILINGUAL CORPORA InterCorp. 90’s, north of Europe (Norway, Sweden) “In the last 10-15 years or so there has been a great deal of interest in the development.

YOU CAN ADD MORE RESTRICTIONS, eg. source language srclang=“en”

[word="quite"] within <div pubyear>="1990" & group="Core" & srclang="en" />


Recommended