Post on 24-Feb-2016
description
transcript
www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at
NLP Interchange Format
José M. García
www.sti-innsbruck.at
Outline
• What is NIF?• Design requirements• URI schemes• NIF ontologies• Use cases• Relationship with ELRA• Roadmap for NIF 2.0• Conclusions
2
www.sti-innsbruck.at 3
What is NIF?
• Natural Language Processing Interchange Format
• NIF is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.
• Building blocks– URI scheme for identifying elements in texts– Ontology for describing common NLP terms
• Created and maintained by AKSW group of University of Leipzig, during the LOD2 EU project.
• Community project: http://persistence.uni-leipzig.org/nlp2rdf/
www.sti-innsbruck.at 4
NIF design requirements
Compatibility with RDF Coverage Structural
Interoperability
Conceptual Interoperability Granularity Provenance and
Confidence
Simplicity Scalability
www.sti-innsbruck.at 5
URI schemes
• Text needs to be referenceable by URIs
• With URI references text can be used as resources in RDF statements
• NIF distinguishes:– Documents– Text of the document– Substrings of the text.
• URI scheme is an algorithm to create IDs for text and substrings
• URI elements– Document URI– Separator– Character indices
www.sti-innsbruck.at 6
RFC 5147
• Canonical URI scheme for NIF is based on RFC 5147
• It standardizes fragment identifiers for text/plain media type
http://www.w3.org/DesignIssues/LinkedData.html
www.sti-innsbruck.at 7
RFC 5147
• Canonical URI scheme for NIF is based on RFC 5147
• It standardizes fragment identifiers for text/plain media type
http://www.w3.org/DesignIssues/LinkedData.html
http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610
www.sti-innsbruck.at 8
RFC 5147
• Canonical URI scheme for NIF is based on RFC 5147
• It standardizes fragment identifiers for text/plain media type
http://www.w3.org/DesignIssues/LinkedData.html
http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610
http://www.w3.org/DesignIssues/LinkedData.html#char=1206,1218
www.sti-innsbruck.at 9
NIF Core Ontology
• Classes and properties to describe relation between– Documents– Text– Substrings– Corresponding URI schemes
www.sti-innsbruck.at 10
NIF Core Ontology
• Additional classes and properties (unstable/testing)
– More URI schemes
– Text structure (words, sentences, paragraphs…)
– Part of Speech (POS)
– Annotations with Stanbol
– Confidence
www.sti-innsbruck.at 11
Workflows, Modularity and Extensibility of NIF
• Workflows for NLP integration– Normalization– Tokenization– Merge RDF annotations
www.sti-innsbruck.at 12
Workflows, Modularity and Extensibility of NIF
• NIF ontology logical modules– Terminological model– Inference model– Validation model
• Vocabulary modules– FISE– ITS– OLiA– NERD– …
www.sti-innsbruck.at 13
Workflows, Modularity and Extensibility of NIF
• Granularity profiles
www.sti-innsbruck.at 14
ITS Use Case
• The Internationalization Tag Set 2.0 is a W3C working draft that is becoming a Recommendation.
• ITS standardizes HTML and XML attributes which can be used to annotate nodes with processing information for language service providers (i18n, l10n)
• ITS 2.0 RDF ontology was developed using NIF, including a round-trip conversion algorithm from ITS to NIF.
• NIF is expected to receive wide adoption by translation & language service providers
• ITS 2.0 RDF ontology provides properties which can be used to provide best practices for NLP annotations.
www.sti-innsbruck.at 15
OLiA Use Case
• The Ontologies of Linguistic Annotation provide stable identifiers for morpho-syntactical annotation tag sets, so that NLP tools can use these ids for better interoperability.
• OLiA provides Annotation Models and a Reference Model, comprising more than 110 OWL ontologies for over 34 tag sets in 69 languages
• Features– Documentation– Flexible Granularity– Language Independence
• NIF provides two properties– nif:oliaIndividual (links a nif:String to an OLiA Annotation Model)– nif:oliaCategory (links to the Reference Model)
www.sti-innsbruck.at 16
RDFaCE Use Case
• RDFa Content Editor is a rich text editor that supports WYSIWYM authoring including various views of the semantically enriched textual content.
• It combines results of different NLP APIs for automatic content annotation
– Heterogeneous APIs access, URI generation and output data structure– Solution: server-side proxy, hard-coded input and connection of each API.
• NIF simplified the integration, adding an interoperability layer
www.sti-innsbruck.at 17
What is ELRA?
• European Language Resources Association
• http://www.elra.info
• Effort to make available Language Resources (LR) for language engineering and to evaluate language engineering technologies.
• LR marketplace
• Related organizations– ELDA (ELRA’s operational body)– LREC conferences
www.sti-innsbruck.at 18
What is ELRA?
www.sti-innsbruck.at 19
Relationship with NIF
• Different objectives
• LR written resources (esp. Corpora) can be annotated with NIF for further interoperability and integration with NLP tools
• ADVANTAGE: Large test data collection to evaluate NLP tools
• DISADVANTAGE: Cost of LR (though there are free ones)
www.sti-innsbruck.at 20
Roadmap for NIF 2.0
• Release of NIF 1.0– DONE (Nov 2009)
• Release of NIF 2.0 Draft– CURRENT effort on solving pending issues– Adoption in ITS 2.0 W3C (soon-to-be) Recommendation– NIF-Core ontology is becoming stable– RLOG - an RDF Logging Ontology– NIF Validator software available
• Release of NIF 2.0 Core
• Release of NIF 2.0 Extensions– ITS ontology, PROV ontology, Lemon Ontology, NERD, UIMA, MARL opinion
ontology…
www.sti-innsbruck.at 21
Conclusions
• NIF allows to integrate NLP tools using Linked Data
• Ongoing effort
• Many adopters and supporters– LOD2 EU project– Several W3C working groups– Named Entity Recognition and Disambiguation (NERD)– Ontologies of Linguistic Annotation (OLiA)– …
• 27 different implementations and use cases– Some available at http://persistence.uni-leipzig.org/nlp2rdf/
www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at
Thanks for your attention
Questions?
22
www.sti-innsbruck.at
References
1. http://persistence.uni-leipzig.org/nlp2rdf/
2. Integrating NLP using Linked Data by Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer in 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia
23