CLARIN-PL
CLARIN-PL – Language Technology Infrastructure Open for Users
Maciej Piasecki Wrocław University of Technology, G4.19 Research Group
Violetta Koseska-Toszewa Institute of Slavic Studies PAS
Krzysztof Marasek Polish-Japanese Academy of Information Technology
Adam Pawłowski Wrocław University
Piotr Pęzik University of Łódź
Adam Przepiórkowski Institute of Computer Science PAS
2015-09-16
CLARIN-PL: the Consortium
§ Wrocław University of Technology, § G4.19 Language Technology and Computational Linguistics
Research Group § Institute of Computer Science, Polish Academy of Science § Institute of Slavic Studies, Polish Academy of Science § Polish-Japanese Academy of Information Technology,
§ Chair of Multimedia § University of Łódź,
§ PELCRA group at Chair of English Language and Applied Linguistics
§ Wrocław University § Institute of Library Studies and Scientific Information
CLARINAnnualConference2015
Wrocław2015-10-16
CLARIN-PL
Development Paradigm
§ Bi-directional approach § Technology-centred
§ CLARIN centre § Language Resources and Tools: publishing, linking, developing
§ User-centred § development of a set of research applications
§ Bottom-up § a collected offer approach § focus on accessibility, technical interoperability and
processing chains § Top-down
§ following user-centred design paradigm § research applications for H&SS are a starting point
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
CLARIN-PL: Pillars
§ CLARIN-PL Language Technology Centre www.clarin-pl.eu
§ the Polish node of the CLARIN distributed infrastructure § Complete set of the basic Language Resources & Tools
for Polish § filling gaps in the set of basic Language Resources and Tools
for Polish § Research applications for H&SS
§ first set for key users and selected sub-domains of H&SS
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
CLARIN-PL Language Technology Centre § Location: Wrocław University of Technology
§ based on modified D-Space system from Lindat (Czech CLARIN) § Certified B-type centre § Pioneer.id federation based one login § Repository system for language resources
§ persistent identifiers for resources and tools § CMDI meta-data § interface for Federated Content Search § depositing services
§ Web Services for LRTs (REST and SOAP): § basic processing chain for Polish § prototype system for flexible composition of the natural language processing
chains § Web Applications for LRTs § Knowledge Sharing: expertise and support for the users
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Wrocław University of Technology Resources (selected)
§ plWordNet 3.0 emo § a comprehensive description of the Polish lexico-semantic
system (~200 000 lemmas, ~280 000 senses) § the largest world wordnet, annotated with sentiment and basic
emotions, manually mapped to Princeton WordNet § enWordNet 1.0 expanded Princeton WordNet 3.1 (+10 000
lemmas) § Korpus Politechniki Wrocławskiej
§ an open Polish corpus with rich annotation on several levels § Dictionary of multiword expressions described syntactically § NELexicon 2.0 – a very large lexicon of Polish Proper
Names (2.5 mln)
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
plWordNet 2.3 emo & enWordNet 0.1
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
http://plwordnet.pwr.edu.pl
plWordNet 2.3 emo http://plwordnet.pwr.edu.pl
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Wrocław University of Technology Tools (selected)
§ Inforex – an web-based system for corpus annotation § MeWeX – a system for extraction of multiword expressions
(collocations) § WoSeDon – Word Sense Disambiguation and sense-based
statistical analysis § Information Extraction (Text mining)
§ recognition of Proper Names, anaphoric links, time expressions and spatial expressions
§ event recognition § Shallow semantic dependency parser § Extraction of the semantic-pragmatic information
§ keywords, text semantic relations and text summaries
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Inforex – corpus annotation editor
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Statistical analysis of word sense frequencies (WoSeDon)
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Institute of Computer Science Resources (selected)
§ Resources § A large semantic valency lexicon for Polish predicative lexical units § Polfie – a formal syntactic-semantic grammar of Polish in LFG
formalism § Treebanks with different syntactic and semantic annotation
§ Systems for dictionaries § Slowal – an editor for the valency dictionary § Kuźnia – a system for editing morphological resources § Toposław – an editor for dictionary of Multiword Expressions
§ Poliqarp 2.0 § Search engine for very large richly annotated corpora
§ Including treebanks and semantic annotation § Powerful language for specifying queries
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Institute of Computer Science Tools (selected)
§ Tools adaptable to the domain and user needs: § Segmenter based on user-modified rules § Morfeusz 2.0 – an adaptable morphological analyser § Lemmatiser based on combining taggers § Extended Named Entity Recognition
§ Hybrid dependency parser based on combining taggers and dependency parsers
§ Deep parsers for Polish § Świgra – a syntactic parser Based on DCG grammar § Syntactic-semantic parser based on LFG grammar § Syntactic-semantic parser based on Categorial Grammar
§ Terminology extraction from domain corpora § Statistical method combined with simple extraction rules
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Walenty – a valency dictionary CLARINAnnual
Conference2015Wrocław
2015-10-16CLARIN-PL
POLFIE – an LFG grammar of Polish
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Polish-Japanese Academy of Information Technology
§ Technology § System for long term archiving based on a unique hardware
and software solution § Resources (selected)
§ Transcribed speech database § Tools (selected)
§ Phonetic transcription of texts § Text-to-speech alignment § Speech segmentation
§ Recognition of speaker changes in speech § Recognition of events in speech
§ Searching for keywords in speech recordings
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Services for speech CLARINAnnual
Conference2015Wrocław
2015-10-16CLARIN-PL
Polish-Japanese Academy of Information Technology
Services for speech: integration with Praat
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Polish-Japanese Academy of Information Technology
University of Łódź
§ Resources § Parallel Polish-English (expanded) § Conversational corpus (expanded)
§ Recorded in real-life situations § Described with meta-data
§ Tools § Paralela – a search engine for parallel corpora § Spokes – a search engine for
§ Monolingual corpora § And conversational corpora § Corpora described with meta-data
§ Thematic classifier based on Wikipedia categories § Assigned semantic categories to texts
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Spokes (University of Łódź) http://spokes.clarin-pl.eu
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Paralela (University of Łódź) http://spokes.clarin-pl.eu
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Wrocław University
§ ChronoPress – the Polish Chronological Corpus § Text samples from the years 1945-1954
§ 5760 sample per year, each sample 300 words § Described with meta-data § Statistical representation of the language changes
§ Tools § Lexical trend analysis § Calculation of the descriptive parameters: average,
correlation, cross-correlation etc.
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
ChronoPress – Polish Chronological Corpus
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Wrocław University
Institute of Slavic Studies Resources (selected)
§ Polish-Bulgarian-Russian text corpus § Contemporary texts § Manually aligned on the level of sentences § sub-corpus semantically annotated
§ Polish-Lithuanian text corpus § Contemporary texts § Manually aligned on the level of sentences
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Bi-directional - Top-down Part: First Applications
§ Approaching users § already active, interested, working on large textual and
speech resources, … § covering a maximal variety of research areas, e.g. linguistics,
literary studies, psychology, political studies and sociology § matching the available language tools for Polish § the first set of several prototype applications illustrating
possibilities and facilitating identification of the needs § First applications
§ Spokes – searching corpora of conversational data § A system for collecting Polish text corpora from the Web § A open textometric and stylometric system focused on Polish § Semantic text classification for sociology § Literary Map
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Open Textometric and Stylometric System
§ System designed for characteristic features of Polish § Links together language tools, feature extraction with
frameworks for stylometry and clustering, e.g. Stylo (Eder & Rybicki)
§ Enables the use of features defined on any level of the linguistic structure: § from the level of word forms § up to the level of the semantic-pragmatic structures.
§ Available as Web Application and a Web Service § Combines
§ The stylometry system § With a semantic classification and tagging system
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Open Textometric and Stylometric System
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Literary Map
§ Goal § Support for using maps in the literary criticism § Tool for the identification of all geographical names in the
literary text (or a corpus) and mapping them onto a geographical map
§ Tasks 1. Identification and semantic classification of the referring language expressions 2. Disambiguation of the referents 3. Mapping the referents onto a map (geo-location) 4. Recognition of the semantic relations and statistical analysis
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
Literary Map CLARINAnnual
Conference2015Wrocław
2015-10-16CLARIN-PL
Workshops: Results and Requests
§ Training workshops for CLARIN-PL centre and services § Results
§ large interest in participation three candidates for one place! § workshops: three cities, three days each, min. 25 hours § participants: more than 140 persons from different domains
of H&SS, full professors (>5), professors, researchers, PhD students
§ Requests (on the basis of more than 30 questionnaires) § very warm reception § some criticism, but unexpectedly rare § many concrete suggestions concerning the services and
applications § described research tasks, concrete needs, two proposals to
organise domain-focused workshops
CLARINAnnualConference2015
Wrocław2015-10-16CLARIN-PL
CLARIN-PL
Thank you very much for your attention! www.clarin-pl.eu
Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]