Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | doreen-walton |
View: | 215 times |
Download: | 0 times |
MULTILINGUAL MULTILINGUAL CORPORACORPORA
InterCorp
90’s, north of Europe (Norway, Sweden)
“In the last 10-15 years or so there has been a great deal of interest in the development and use of multilingual or parallel corpora. To begin with, we can define such corpora provisionally as collections of texts in two or more languages which are parallel in some way, either by being in a translation relationship or by being comparable in other respects, such as genre, time of publication, intended readership, and so on.” (Johansson 2007, 51)
Multilingual corpora: When and where?
Types of multilingual corporaTypes of multilingual corpora
A. Source texts in one language and their translation to other languages - translation corpus according to Aijmer and Granger
B. Pairs or groups of monolingual corpora designed using the same “sampling frame” Lancaster corpus of Mandarin Chinese (same sampling frame as LOB)
-parallel corpus according to Aijmer and Granger
-comparable according to McEnery and Wilson
The term parallel corpus is sometimes used for both A and B (Johansson, Barlow)
C. A combination of A and B English-Norwegian Parallel Corpus (ENPC)
The original texts are comparable (genre, number of words)The translations go in both directions –
a bidirectional translation corpus
Hasselgard´s presentation at UCCTS 2010
“Because permission could not be obtained to distribute the English–Norwegian Parallel Corpus to any interested researcher, it is only available to researchers able to travel to the University of Oslo, where it was created and is now housed.” (Mayer)
https://www.hf.uio.no/ilos/english/services/omc/enpc/
AdvantagesThey give insights into the languages compared – insights that are likely to be unnoticed in studies of monolingual corpora. (Aijmer & Altenberg 1996: 12)They can be used for a range of comparative purposes and increase our understanding of language-specific, typological and cultural differences, as well as of universal features. (Aijmer & Altenberg 1996: 12) CONTRASTIVE LINGUISTICS
Contrastive analysis is the systematic comparison of two or more languages, with the aim of describing their similarities and differences. (Johansson 2007: 1)
“Vilém Mathesius, founder of the Linguistic Circle of Prague, spoke about analytical comparison, or linguistic characterology, as a way of determining the characteristics of languages and gaining a deeper insight into their specific features (Mathesius 1975).”
“Any cross-linguistic comparison presupposes that the compared items are insome sense similar or comparable. That is, to be able to say that certain categories in two languages are similar or different it is necessary that they have some common ground, or tertium comparationis. For lexis it is obvious that the compared items should express ‘the same thing’, i.e. have the same (or at least similar) meaning and pragmatic function (see James 1980: 90f.)” (Granger and Altenberg 2002, 15)
“In contrastive linguistics, the ‘assumption of comparability’ for specific pairs of categories is reflected in, and supported by, linguistic output” (frequent if languages are socio-culturally related), namely “(high quality) translations and parallel corpora based on such translations” , and “non-target-like language of second language learners (learner corpora)” (Gast 2012)
Gast, Volker. Contrastive Analysis: Theories and Methodshttp://www.personal.uni-jena.de/~mu65qev/papdf/contr_ling_meth.pdf
Hasselgård, Hilde on Parallel corpora and contrastive studieshttp://www.lancaster.ac.uk/fass/projects/corpus/UCCTS2010Proceedings/papers/Hasselgard.pdf (= keynote lecture delivered at UCCTS 2010 conference)
“in monolingual corpora we can easily study forms and formal patterns, but meanings are less accessible. One of the most fascinating aspects of multilingual corpora is that they can make meanings visible through translation” (Stig Johansson 2007: 57)
“Ambiguity and vagueness are revealed through translation patterns.” (Johansson 2007, 57)
What is the meaning of apparent?
“‘Mutual correspondence’ (MC) is a simple statistical measure of the frequency with which a pair of items from two languages are translated into each other in a bi-directional translation corpus (see Altenberg 1999).”
takže = (and) so?
They can be used for a number of practical applications, e.g. in lexicography, language teaching, and translation. (Aijmer & Altenberg 1996: 12)
They illuminate differences between source texts and translations, and between native and non-native texts. (Aijmer & Altenberg 1996: 12)
S universals - universal differences between translations and their source texts; claims about the way in which translators process the source text. T-universals are the modern equivalent to the criticisms of unnaturalness, of translationese, made in the pejorative approach.
T-universals - differences between translations and nontranslated texts written in the target language;
(CHESTERMAN, A. 2003)
n-gram korpusabs. FQ
nepřeklady
rel. FQnepřeklad
y
abs. FQpřeklady
rel. FQpřeklady
je mi líto Jerome 109 4,1 214 8,0
ani v nejmenším
Jerome 113 4,3 324 12,2
co do činění s
Jerome 11 0,4 92 3,5
CHLUMSKÁ, L – RICHTEROVÁ, O. (2014): Jak zkoumat překladovou češtinu. Výzkum simplifikace na korpusu Jerome. Korpus, gramatika, axiologie, 9, s. 16–29.https://ucnk.ff.cuni.cz/jerome.php
Limitations
Restricted to texts / text types that have been translated scarcity of spoken data
Usually small: The size of the corpus restricts studies of less frequent lexical/ grammatical constructionsFaulty and less successful translationsUsually word-class tagged, but not parsed (syntactically annotated), i.e. it is not possible to search for grammatical constructions, patterns of word order etc.
As with corpus linguistics in generalTagging errors
You can only search for something that is explicit in the text
Ways round the limitations
Identify typical (and searchable!) expressions of a grammatical construction, use a combination of word class tagging, and wildcards (CQL query)
In any case – a lot of work involved in tidying up the search results (precision).
Possibility of searching with regular expressions (next week)
Errors/idiosyncracies in the translation: Weed out? Ignore translations that occur only once, or in only one text?
Supplement results of parallel corpus study with (larger) monolingual corpora.
Supplement corpus study with e.g. experimental data.
Parallel corpora for English and Czech? No comparable corpus of English and Czech exists. No corpus of the corpora of the Czech National Corpus is similar to the British National Corpus in structure.
No equivalent to ENCP
Project KAČENKA at FF MU Brno http://www.phil.muni.cz/angl/kacenka2/
Parallel translation corpus: InterCorp
Structure of InterCorp http://ucnk.ff.cuni.cz/intercorp/
developed at the Institute of the Czech National Corpus, Philosophical Faculty, Charles University Prague
But our department participates parallel corpus of Czech and 38 languages with Czech as the pivot language (in the sense that every foreign language text is ALIGNED with Czech, irrespective of which language is the source language)
Concordances (concordance lines, KWICs) in a parallel corpus
What kinds of texts does it contain?
CORE
COLLECTIONS
Project Syndicatehttp://www.project-syndicate.org/
http://www.presseurop.eu/cs, http://www.voxeurop.eu/en
Acquis – legal documents of the EU
http://europa.eu/legislation_summaries/glossary/community_acquis_en.htm
Europarlhttp://www.statmt.org/europarl/
Presseurope, Europarl, Project Syndicate – what is source text?“Until 2003, the texts were translated directly from the source languages into any of the target languages. From 2003 onwards, English has been used as a pivot language (Cartoni & Meyer 2012: 3), i.e., all languages were first translated into English and then into the relevant target language” (Gast 2014)
http://www.opensubtitles.org/cs
Searching InterCorp datahttp://ucnk.ff.cuni.cz/intercorp/?lang=en
“Most likely, Park and NoSketch Engine will be discontinued already at the end of March 2015. We would like to use this opportunity to invite all users of InterCorp to migrate to KonText”
Query type
METADATA (data about data)
Limit your search for so to SUBTITLES where English is the SOURCE language
Search for kind of in original English texts of fiction
CREATE A SUBCORPUS USING METADATA
Search for little girl, honey, wow in English original texts of fiction and their Czech translation
Create a subcorpus of subtitles translated from English to Czech. How big is it?
What usually precedes kind of in your subcorpus of subtitles (Engl. originals)?
Are there any Czech films? Create a subcorpus of Czech subtitles in original Czech films that allows searching in Czech. How big is it?
Find the particle prej/prý in it (as a WORD FORM)
What is the English equivalent of the adjective hloupý found in original Czech subtitles?
Find English translations of the particle prý used in original texts of fiction
In which text is prý used most frequently?
Question for thought: is this Kundera´s idiolect?
Find prý in translations of English texts of fiction
You can also create a subcorpus of original Czech texts of fiction.
How about comparing English with Spanish?How many original works of fiction are translated into Spanish? Which ones?
Zastoupení překladů v Českém národním korpusu
In SYN2010, create a subcorpus of A) CZECH ORIGINAL TEXTSB) CZECH translated from English
Hint: English as source language is marked as ENG, Czech as source language is not markedSearch for tokens of šálek čaje in A and B
Regular expression (from Glossary of Corpus Linguistics)
“A type of string that may include special characters that mean the regular expression as a whole will match with more than one string. For instance, the full stop . as a special character in a regular expression can represent any single letter. If so, the regular expression b.d would match with the strings bad, bbd, bcd, bdd, bed etc.”
“Regular expressions, sometimes known as patterns, are often used when searching a corpus. It is often easier to define a regular expression that matches the set of words in a search, than to search for each word individually and combine the results.”
. = any single character(p.s = pes, pás, pas)* = libovolný počet opakování předchozího znaku, totéž co {0,} (ps*t =
pt, pst, psst, pssst atd.),
+ = libovolný počet opakování předchozího znaku > 0 , totéž co {1,} (ps+t =
pst, psst, pssst atd.)
Try it. Make a WORD QUERY for all words starting in L and ending in A
WHICH WORD FORMS?
Try a LEMMA search for m.*o as a noun
? = žádný nebo jeden výskyt předchozího znaku/výrazu {0,1}
interval {n, k} - představuje n až k opakování předchozího znaku nebo výrazu; je-li k vynecháno, odpovídá intervalu nejméně n opakování, pokud má interval tvar {n}, odpovídá mu přesně n opakování;
[] = výběr ze seznamu (např. [Pp]řeklad, disku[sz]e)
| = výběr z možností (např. diskuse|diskuze) - představuje také alternativu, ne ovšem mezi jednotlivými znaky, ale celými řetězci,
() - libovolnou část výrazu je možné seskupit do kulatých závorek a ovlivnit tak prioritu jeho vyhodnocování
^ = s vyloučením znaků v dané závorce
Look up diskuse and diskuze - in the subtitles of Czech original films- in the original works of Czech fiction
Check whether what you got is what you searched for
Find all tokens of ratata, ratatata... in all Czech texts in Intercorp (originals, translations from all languages). Where are they most frequent?
Pro rata temporis - http://en.wikipedia.org/wiki/Pro_rata
Exclude Acquis and take a look at the subtitles
Try j?est?l?i, again in the whole of Intercorp. What are you going to find?
kampak / kdepak / jestlipak / kdypak /kdopak/propak
[A-ž]+pak
GET RID OF OPAK
Make list of all prefixed verbs with the root lézt.Find all tokens of pst (all possible numbers of the letter s)
Go back to English fiction originally written in English.-find both spellings of the word color/colour.-find both spellings of the past tense of the verb travel-find both spellings of the word gray/grey
What is the most frequent personal pronoun in the English subtitles (originals)?
1. English texts annotated with Tree Taggerhttp://courses.washington.edu/hypertxt/csar-v02/penntable.html
2. There is no TAG query! You have to use the Corpus Language Query query (CQL query)
English subtitles (English originals)
Corpus query languageCorpus query language
Think of each text position (the thing between spaces) as
[attribute=“value”]attribute can be:
◦word◦lemma◦tag
If a default attribute is set, you do not have to use the brackets
Quite followed by an adjective in English originals (fiction/subtitles)
[word=“quite”] [tag=“JJ”]
But since WORD is the default attribute, this is the same as "quite" [tag=“JJ"]
Translation of the copula feel followed by an adjective
[lemma="feel"][tag=“JJ"]or "feel“[tag=“JJ"] (implicit attribute = lemma)
How about seem, smell, etc…
Find in English original texts tokens of the have construction analogous to
Danny, soft and drowsy, had him thinking of things other than showing her what a lazy employer could be like
[lemma="have"][tag="PP"][tag="VBG"]
Now think of an NP instead of a pronoun
[lemma="have"][]{0,2}[tag="N.*"][tag="VBG"]
Which state verbs are used in the progressive tense (be+VVG) most often?
[lemma=“be”] [tag=“VBG”]
Combination of two attributes in one text position: &
Which nouns end in –ie in subtitles where English is the source language:
What is the most frequent personal pronoun in original CZECH subtitles?
http://ucnk.ff.cuni.cz/doc/popis_znacek.pdf
Klikátko http://utkl.ff.cuni.cz/~skoumal/morfo/?lang=cs
[tag="PP..1--..-..--.*"]= [tag="PP..1.*"]
Czech tags are incorporated into KonText...
Practise:noun in the gen.sg.
adj. in the comparative degree, instrumental plural
punctuation
imperative of an imperfective verb, plural
Search for the adverb pěkně followed by an adjective in Czech translations of English subtitles
pěkně studená sprchaPivo?
Search for adjectives beginning in hyper in the Czech translations of English subtitles
Jak vyhledávat v textech beletrie od určitého data? insert “within”
Only gives you texts from 1950, and all types of text
https://podpora.korpus.cz/boards/8/topics/228
[word="query"] within <div pubyear>="1990" & group="Core" />
E.g. quite after 1990
YOU CAN ADD MORE RESTRICTIONS, eg. source language srclang=“en”
[word="quite"] within <div pubyear>="1990" & group="Core" & srclang="en" />