Post on 31-Mar-2015
transcript
ISOcat
Data Category RegistryDefining widely accepted linguistic concepts
Menzo Windhouwer
1CLARIN-NL MD tutorial, 24-25 September 2009
ISOcat: a reference implementation• ISO 12620:2009
– Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources
– ISO 12620:1999 was a fixed list of data categories, this revision provides a data model and management procedures
• ISO Technical Committee 37– Terminology and other language and content
resources
2CLARIN-NL MD tutorial, 24-25 September 2009
ISO 24613:2008 Lexical Markup Framework
3
Lexicon
Lexical Entry
Form Sense
0..*
0..*1..*
1..*
partOfSpeech
writtenForm
writtenForm
grammaticalNumber
lexicalType
Word Form
Lemma
CLARIN-NL MD tutorial, 24-25 September 2009
Data categories• “result of the specification of a given data
field ” (ISO 12620:2009)• data element concept (ISO 11179)
– “concept for which the definition, identification and conceptual domain are specified independently of any particular representation”
• complex data categories are data element concepts
4CLARIN-NL MD tutorial, 24-25 September 2009
Data category types
5
writtenForm
string
open
grammaticalGender
string
neuter
masculine
feminine
closed
simple:
string
constrained
Constraint: .+@.+
complex:
CLARIN-NL MD tutorial, 24-25 September 2009
Data category relationships• Value domain
membership• Subsumption
relationships between simple data categories
• Relationships between complex data categories are not stored in the DCR
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 6
partOfSpeech
string
pronoun
personalpronoun
Data category specification• Administration Information Section• Description Section
– Data Element Name– Language Section
• Name Section
• Conceptual Domain• Linguistic Section
– Conceptual Domain
7
Mandatory:1.A mnemonic identifier2.An English definition3.An English name4.A conceptual domain
CLARIN-NL MD tutorial, 24-25 September 2009
Guidelines for data categories (I)• Identifier:
– camel case and XML-valid element name (without a namespace)• partOfSpeech• my:POS, 123POS
• Data Element Name:– language independent name for the data category
used in a specific application domain (specified in the source)• PoS in TBX
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 8
Guidelines for data categories (II)• Name Section in a Language Section
– legible name• ‘part of speech’ in the English language section• ‘partie du discours’ in the French language section
• Definition:– intentional definitions (ISO 704)– should consist of a single sentence fragment
• Source:– add a source for any quoted material
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 9
Guidelines for data categories (III)• Justification:
– a simple statement justifying the relevance of the data category to the field of language resources
– especially needed for standardization
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 10
Private versus standard• The standard subset of data categories in
the registry should be coherent• The coherency is guarded by Thematic
Domain Groups and the DCR Board• Standard data categories need to meet
some more constraints then private ones:– mandatory justification– DC relations demand profile overlap– …
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 11
Data Category Selections• Anyone
1. can register with ISOcat
2. can create data categories
3. can create data category selections (DCSs)
4. can share DCSs
5. can make DCSs public
6. may submit DCSs for standardization
12CLARIN-NL MD tutorial, 24-25 September 2009
Profiles versus DCSs• Profile membership is part of the DC
specification– the profile indicates the thematic domain of the DC– the profile view in the UI is created by a query– there are a limited number of profiles
• A DCS is a collection of DCs– hand picked by an user for a specific purpose– can contain DCs from various profiles– there can be an unlimited number of DCSs
• There isn’t (yet) a profile specific view on a DCS ___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 13
ISO standardization process
14
Submissiongroup
Data Category RegistryBoard
Validation
Thematic DomainGroup
Evaluation
Stewardshipgroup
ISO
Publication
CLARIN-NL MD tutorial, 24-25 September 2009
Submission group• The owner, possibly together with a group of
users, which submit a DCS for standardization
• The data categories in the selection should already meet the more stricter constraints for standardized data categories (as far as possible)– justification– profile(s)– …
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 15
Thematic Domain GroupsTDG 1: Metadata
TDG 2: Morphosyntax
TDG 3: Semantic Content Representation
TDG 4: Syntax
TDG 5: Machine Readable Dictionary
TDG 6: Language Resource Ontology
TDG 7: Lexicography
TDG 8: Language Codes
TDG 9: Terminology
TDG 11: Multilingual Information Management
TDG 12: Lexical Resources
TDG 13: Lexical Semantics
TDG 14: Source Identification
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 16
• TDGs are the owner and guardians of a coherent subset of the DCR
• TDGs own one or more profiles
• Each TDG has a chair• A number of judges (assigned
by SC P members)• A number of expert members
(up to 50%)
• TDGs are constituted at the TC37/SC plenary
• New TDGs need to be proposed by a SC
Harmonization• When a DC belongs to multiple profiles
belonging to different TDGs harmonization may be needed– one TDG becomes the owner of the DC– judges from the other TDG(s) are involved in
the evaluation process
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 17
Stewardship group• Members of the TDG who will maintain the
data category• The TDG becomes the owner of a
standardized data category• Changes to the data category need to go
through the standardization procedure (evaluation by the TDG, validation by the DCR Board)
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 18
Using data categories (I)• Each data category has a Persistent
Identifier (PID):http://www.isocat.org/datcat/DC-1297
– once a data category has been created it can never be deleted only deprecated or superseded
– the registration authority of 12620 is obliged to keep these URLs working
19CLARIN-NL MD tutorial, 24-25 September 2009
Using data categories (II)• This PID can be embedded in the schemata
of linguistic resources:– CMD<CMD_Element name="Role" ValueScheme="string" ConceptLink="…/DC-1234">
– Relax NG<rng:element name="gender" dcr:datcat="…/DC-1297">
– XML Schema, TEI ODD, TBX, RDF, XML, …• DC Reference vocabulary:
– http://www.isocat.org/12620/
20CLARIN-NL MD tutorial, 24-25 September 2009
Using data categories (III)• The full data category specification can be
downloaded from ISOcat in the Data Category Interchange Format (DCIF)– DCIF is based on a simplified version of the DCR
data model, and leaves out some administrative information
– DCIF vocabulary:• http://www.isocat.org/12620/
21CLARIN-NL MD tutorial, 24-25 September 2009
Usage scenarios• DC references only:
– find semantic overlap between two or more resources by comparing their DC references
• DC references and a schema/component registry:– find interesting resource (types) by comparing the DC
references of schemas/components in the registry
• DC references and a network of registries:– find (in)direct related resources by related DCs
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 22
Relation Registry• ISOcat contains a ‘flat’ list of concepts• The Relation Registry will support storing
(user-specific) relations between these concepts– is-a– part-of– equivalent-to– related-to– …
23
Will support:1.Ontologies and taxonomies
on top of data categories2.Searches across related data
categories3.…
CLARIN-NL MD tutorial, 24-25 September 2009
Registry network
___ ___ ___ CLARIN-NL MD tutorial, 24-25 September 2009 24
Linguistic resources
Data category registries
Relation registries
MPIDCR
ISODCR
Typological Database SystemRRMPI RR
MPIarchive
TDSdatabaseresource
Status of ISOcat• ISOcat is under active development:
– Now:• You can access public data categories and selections• You can create your own data categories and selections• You can share your data categories and selections with others (everyone, or
a specified group)– Future:
• Some social features (forum to discuss specific data categories)• Cleanup of profiles by TDGs• Import external ‘data category’ sets, such as:
– parts of the ISO Concept Database– Dublin Core– TEI
• Standardization workflow• High availability (mirrors)• Relation registry
25CLARIN-NL MD tutorial, 24-25 September 2009
Thanks for your attention!
http://www.isocat.org/
isocat@mpi.nl
Menzo.Windhouwer@mpi.nl
26CLARIN-NL MD tutorial, 24-25 September 2009