+ All Categories
Home > Science > Corpus studio Erwin Komen

Corpus studio Erwin Komen

Date post: 23-Jan-2018
Category:
Upload: clariah
View: 337 times
Download: 2 times
Share this document with a friend
1
CorpusStudio web application Erwin R. Komen Meertens Instituut // Radboud University Nijmegen // SIL-International [email protected] 1. Background Existing software: CorpusStudio – Windows Cesax – Windows Successfully used in linguistic research Web application version? Central location for corpora (‘last’ version) Platform independent: MacOS/Linux/Windows Fast parallel processing 2. Formats FoLiA xml Dutch: Nederlab, CGN, Sonar/Lassy TEI-Psdx xml English historical + SLA Caucasian: Chechen, Lak, Lezgi Old Welsh Dutch Additional formats Convert via ‘Cesax’ (Alpino, Negra, …) Add handler into CorpusStudio 4. Defining queries Definition editor Constants Functions (Xquery) Query editor Subcategorization (Xquery) Constructor editor Execution order Options (examples, output, complement) Result database Feature editor Xquery user-functions calculate them 6. Availability CorpusStudio sources (build your own version) https://github.com/ErwinKomen CLARIN-NL access http://www.clarin.nl/node/2095 7. References Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010. XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>. van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian AS. User information Project information Definition Editor Query Editor Constructor Editor Result viewer Meta Data Editor Definitions Queries Corpus Research Project (.crpx) Search service: crpp Query Executor Database Creator Output Monitor Results (.xml) Corpus Research Database (.xml) Table Viewer Result Viewer Documents (.xml) xml xml xml xml xml Input Selector json Status xml json Database feature editor Result Grouping Standard grouping (.json) Grouping Viewer Corpus Viewer Result database Result dbase Viewer Result dbase Editor 3. Corpus Research Projects All information for one research project Meta information (author , dates, goal) Input (language, corpus, filter) All definition and query files used Execution order Optional: result database features Exchange Upload/download Compatible with Windows CorpusStudio CorpusStudio components Meta Data Editor Definition Editor Input Selector Query Editor Constructor Editor Output Monitor Query Executor Result Viewer Corpus Viewer Database feature editor 5. Future Grouping editor Group output over meta-data categories User-definable (Xquery) Query/project wizard Tabular input of principal components Relations, names, feature calculations Result database editor View and edit result database records
Transcript
Page 1: Corpus studio Erwin Komen

CorpusStudio web application Erwin R. Komen

Meertens Instituut // Radboud University Nijmegen // SIL-International [email protected]

1. Background • Existing software:

• CorpusStudio – Windows • Cesax – Windows • Successfully used in linguistic research

• Web application version? • Central location for corpora (‘last’ version) • Platform independent: MacOS/Linux/Windows • Fast parallel processing

2. Formats • FoLiA xml

• Dutch: Nederlab, CGN, Sonar/Lassy • TEI-Psdx xml

• English historical + SLA • Caucasian: Chechen, Lak, Lezgi • Old Welsh • Dutch

• Additional formats • Convert via ‘Cesax’ (Alpino, Negra, …) • Add handler into CorpusStudio

4. Defining queries • Definition editor

• Constants • Functions (Xquery)

• Query editor • Subcategorization (Xquery)

• Constructor editor • Execution order • Options (examples, output, complement)

• Result database Feature editor • Xquery user-functions calculate them

6. Availability • CorpusStudio sources (build your own version)

• https://github.com/ErwinKomen • CLARIN-NL access

• http://www.clarin.nl/node/2095

7. References Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010.

XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>. van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive

and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on

treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian AS.

User information Project information

Definition Editor

Query Editor

Constructor Editor

Result viewer

Meta Data Editor

Definitions

Queries

Corpus Research Project (.crpx)

Search service: crpp

Query Executor

Database Creator

Output Monitor

Results (.xml)

Corpus Research Database

(.xml)

Table Viewer

Result Viewer

Documents (.xml)

xml

xml

xml

xml

xml

Input Selector

json

Status

xml

json

Database feature editor

Result Grouping

Standard grouping

(.json)

Grouping Viewer

Corpus Viewer

Result database

Result dbase Viewer

Result dbase Editor

3. Corpus Research Projects • All information for one research project

• Meta information (author, dates, goal) • Input (language, corpus, filter) • All definition and query files used • Execution order • Optional: result database features

• Exchange • Upload/download • Compatible with Windows CorpusStudio

CorpusStudio components

Meta Data Editor

Definition Editor

Input Selector

Query Editor

Constructor Editor

Output Monitor

Query Executor

Result Viewer

Corpus Viewer

Database feature editor

5. Future • Grouping editor

• Group output over meta-data categories • User-definable (Xquery)

• Query/project wizard • Tabular input of principal components • Relations, names, feature calculations

• Result database editor • View and edit result database records

Recommended