Building a parallel corpus for translation research
and much more"
Ana Frankenberg-Garcia
The study of human translation
Traditionally not a hard scienceDifficult to be systematic
With the advances of corpus linguistics,
things can change …
What is a corpus?
large
specific criteriatext-retrieval software
machine-readable
naturally occurring texts
Advantages of using corpora to study human translation
An enormous amount of translated texts
Systematic analyses
Quantifiable results
Corpora used in translation practice and research
1. Bilingual comparable corpora Farmhouse holidays (EN) & Agroturismo (IT)
2. Monolingual comparable corpora Translational English Corpus (EN)
3. Simple parallel corpora Tectra (EN-GL)
4. Bidirectional parallel corpora COMPARA (PT-EN and EN-PT)
Building parallel corpora text selection
• Genre (scientific, imaginative, technical, etc.)
• Mode (oral? written?)
• Variety (standard? regional?)
• Time (contemporary? older?)
• Languages (which? just two or more?)
• Translations (professional? native speakers? different translators? )
• Simple or bidirectional?
Are there translations?
Building parallel corpora example of interrelated factors
PT-EN or EN-PT PT-EN ↔ EN-PT
scientificacademic
tourism
literaturepolitics (EP)
Languages: PT-ENGenreoral popular
Building parallel corpora
Personal use Shared use
copyright permissions
results verifiable
more users and uses
copyright
no hassle
Building parallel corporacopyright
• Two permissions, double the work
• Publishers, authors and translators generally don’t know what a corpus is
• Protect
• Advertise
Building parallel corpora alignment
Text?
Paragraph?
Sentence?
Clause?
Word?
Which parts of ST and TT match?
Building parallel corpora tagsAlignment tags
e.g. textual, grammatical, semantic
What do we want tags for? More pre-processing, less post-processing
Optional tags
<id=EBJT1 1845>Joe watched Robin climb into the trailer and man-handle the calves one by one towards the ramp, their winglike ears pierced with plastic identity tags.
<id=EBJT1 1845>Joe ficou a ver Robin subir para o atrelado e encaminhar as vitelas uma a uma para a rampa, com as suas orelhas, que faziam lembrar asas, furadas e umas etiquetas de plástico a identificá-las.
Our options for
A bidirectional parallel corpus of English and Portuguese
Funding Portuguese Government and European Union (FEDER and FSE) contract ref. POSC/339/1.3/C/NAC
Project leaders Ana Frankenberg-Garcia & Diana SantosResearch assistants Pedro Sousa, Rosário Silva & Susana Inácio
PT Source texts EN Source texts
Corpus structure
EN TranslationsPT Translations
parallel
bi-directional
parallel
PT ENPT1 PT2
EN1 EN2
ST TT
Language varieties
Portugal
Brazil
Angola
Mozambique
UK
US
South Africa
PORTUGUESE ENGLISH
Unbalanced distribution!
Publication dates
1837
2002
1880
1997
1988
1914
Genre
Published fiction other genres
EXTENSIBLE
Portuguese authors
PortugalCamilo Castelo BrancoEça de QueirósJosé Cardoso PiresJosé SaramagoJorge de SenaLídia JorgeMário de CarvalhoSá Carneiro
Brazil Aluísio AzevedoAutran Dourado Chico Buarque Jô SoaresJosé de AlencarMachado de AssisManuel Antônio de AlmeidaMarcos ReyPatrícia MeloPaulo CoelhoRubem Fonseca
MozambiqueMia Couto
AngolaJosé Eduardo Agualusa
English authors
British IslesDavid Lodge
Ian McEwan
Julian Barnes
Joseph Conrad
Joanna Trollope
Kazuo Ishiguro
Lewis Carrol
Mary Shelley
Oscar Wilde
United StatesHenry JamesEdgar Allan PoeRichard Zimler
South AfricaNadine Gordimer
Portuguese translators
Ana Maria Amador, Ana Falcão Bastos, Ana Luísa Faria, Aníbal Fernandes, Carlos Grifo Babo, Cristina Ferreira de Almeida, Cristina Rodriguez, Eduardo Guerra Carneiro, Fernanda Pinto Rodrigues, Geraldo Galvão Ferraz, Helena Cardoso, Januário Leite, José Viera Lima, J. Teixeira de Aguilar, Lídia Cavalcante-Luther, Lucinda Santos Silva, Luís Lobo, Manuel João Gomes, M. F. Gonçalves de Azevedo, Maria Carlota Pracana, Maria do Carmo Figueira, Mário Martins de Carvalho, Nina Videira, Paula Reis, Yolanda Artiaga.
English translators
Adria Frizzi, Alan Clarke, Alexis Levitin, Alice Clemente, Cliff Landers, David Brookshaw, David Rosenthal, Elizabeth Lowe, Ellen Watson, Helen Caldwell, Giovanni Pontiero, Graeme Mac Nicoll, Gregory Rabassa, Isabel Burton, John Gledson, John Parker, John Byrne, John Vetch, Margaret Jull Costa, Mary Fitton, Natália Costa, Peter Bush, Richard Zenith, Ronald W. Sousa.
Can any text be included in the corpus?
Only published source texts and translations
Only English translated directly from Portuguese
Portuguese translated directly from English
Only human translations!
72 source texts (extracts)
75 translations
Texts
Size
1,549,551 1,436,493words words in in English Portuguese
Possibly the largest existing edited parallel corpus
Interface
Free
Easy to use by people who have never heard of corpora before
Powerful and flexible tool for experienced corpus users
Results good for research and education
www.linguateca.pt/COMPARA/
“nodded”
ST
TT
0
2
4
6
8
10
12
14
100 K words
Distribution of “nodded” in source texts and translations
Users and uses
Language learners and anyone working with PT-EN bilingual dictionary with examples
Language teachers exercises and tests
Translators language equivalents
Translation lecturers exercises & problems
Translation theorists test translation hypotheses
Lexicographers bilingual dictionaries
Computational linguists and language engineers machine translation and other applications
Backstage options
Text tags
EBJB1.ptele revelou-me o seu interesse por Gosse <tnote> Edmund William Gosse (1849-1928), crítico inglês </tnote> e pela sociedade literária inglesa dos finais do século passado.
EBDL2T1.enWhen we sat on the sofa together to watch <title>News at Ten</title>
EBDL1T1.pt passou-me uma receita de <named> Valium </named>
EBJB1.en the white bear, <foreign> thalassarctos maritimus </foreign>, is the aristocrat of bears...
EBDL1T1.ptacaba por se esquecer de ter medo, até que acaba por verificar que não há <emph> de que </emph> ter medo.
Text tags
1 alignment unit = 1 source-text sentence
S
S
S
S
S2
S S(+S)
S
S½
Ø
ST TT
Alignment options and tags
Portuguese: PALAVRAS
Petrus/PROP pediu/V_fmc a/DETartd especialidade/N da/PRP+DETartd casa/N --/PU uma/DETarti paella/N valenciana/ADJ --/PU que/SPECrel comemos/V em/PRP silêncio/N ,/PU acompanhados/V apenas/ADV do/PRP+DETartd saboroso/ADJ vinho/N Rioja/PROP ./PU
Grammar tags
[pos="V.*"] "silêncio"
English: CLAWS (coming soon)
Petrus/NP1 asked/VVD for/IF the/AT specialty/NN1 of/IO the/AT house/NN1 --a/AT1 Valencia/NP1 paella/NN1 --which/DDQ we/PPIS2 ate/VVD in/II silence/NN1 ./.
Grammar tags
I did, too --changed over to the knitted tie at a <sem=“cor”> red </sem>light.
People interested in creating specific tags for their research can do so, as long as they do the tag insertion and revision work
Specific tag revision interface underway (Sousa, in preparation)
e.g. semantic tag for colour (Inácio et al. 2007)
Other tags
1. Observing source texts and translations
2. Constrasting Portuguese and English
3. Comparing translated and untranslated language
4. Examining the characteristics of translated texts
Research work
Studies unthinkable before corporaMany other studies possible!
www.linguateca.pt/COMPARA/ComparaPublications.html