USP workshop
Using the Corpógrafo
Belinda Maia & Luís Sarmento
PoloFLUP
LINGUATECA
USP workshop
First steps
• Get a username and password
• You will receive one automatically
USP workshop
USP workshop
USP workshop
Working with the Corpógrafo
• Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research
• All research done ONLINE• Each username/password = separate space on
our server• At present > anyone can work with it using 10
MB space for FREE• BUT - you get an empty space + tools + tutorial!
USP workshop
Help Files
• Introdução à utilização do Corpógrafo - um pequeno tutorial A tutorial – to be translated into English – describing the whole process of terminiology research using the Corpógrafo. Available in PDF.
• Corpógrafo Roadmap In English and Portuguese – a panoramic view of the Corpógrafo and how it works. Available in PDF.
• The Corpógrafo in Easy Stages In English and Portuguese – User’s guide to the Corpógrafo and FAQ. Available in PDF.
• Also Note > on entry page there is a Glossary of terms and instructions PT > EN
USP workshop
File Manager
Area where each individual or group can:– upload texts to space on server– convert various text formats to .txt– ‘clean’ them of unnecessary material– check tokenization and sentence divisions– register full information on source, domain
and text type– group – and re-group - texts into corpora
USP workshop
File Manager
• 1. Files• >List Files on Server• >Add Files• >Add Files from URL (Experimental!)
2. Corpora • > List Corpora
> Compile New Corpus
USP workshop
USP workshop
USP workshop
EXTEX
• Tool for converting file formats to .txt at:
• http://poloclup.linguateca.pt/ferramentas
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
General corpus analysis
Corpora analysis area:• Concordancing tools for regular
expressions – at sentence level– KWIC concordancing– Collocations
• N-gram tool– Case-sensitive– Alphabetical or frequency ordering
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
Corpora + TDB
• Choose corpus
• Choose related TDB
= All terms, examples, definitions extracted from corpus (semi) automatically transferred to TDB
= All metadata on texts in corpus can be automatically transferred to TDB
USP workshop
Term extraction
• N-grams– Unfiltered– Filtered with restrictions on term in
PT,EN,FR,IT,ES,DE– Filtered with restrictions on term and context
in PT,EN,FR,IT,ES,DE– Singular + plural terms can be combined– Existing terms in TDB need not appear
USP workshop
USP workshop
Term selection from n/grams
• Consultation of list of n-grams
• Check term status of each n-gram via underlying concordances
• Check sources
• Send to TDB
USP workshop
USP workshop
USP workshop
USP workshop
Search for definition candidates
• Already possible via TDB
• Under development
• Research area for Mestrado dissertations and bolseiros
USP workshop
TDB - Terminology database
Databases are designed to be multilingual– Terms listed alphabetically + language tag– General data– Morphological data– Source metadata: Authors, texts etc– Definitions + search for candidates– Translation equivalents– Semantic relations
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
USP workshop
Future developments – general policy
• General testing and improvement
• Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities
• Coordination of individual corpus projects into bigger projects, when possible or necessary