Post on 31-Mar-2015
transcript
ISOcat Data Model:Workflow & Guidelines
Marc Kemps-Snijdersa, Sue Ellen Wrightb, Menzo Windhouwera aMax Planck Institute for Psycholinguistics, bKent State University
Marc.kemps-snijders@mpi.nl, sellenwright@gmail.com, menzo.windhouwer@mpi.nl
NEERIHelsinki
Standards Workshop2009-09-30
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org Data category
The result of the specification of a given data field A data category is an elementary
descriptor in a linguistic structure or an annotation scheme (ISO 1087-2)
Linguistic data categories: /part of speech/, /noun/, /verb/ /definition/, /context/, etc.
© DCR Group 2009 2 of 29
Data Category Applications
DCs are used as: Field names in databases Permissible values for closed and constrained data categories Tag names and attribute values in annotation frameworks
DCs are used by: Different broad thematic domains (e.g., terminology,
morphosyntax, lexicography, etc.) Different communities of practice within a given domain
Data category selections exist as: Resource tag sets (e.g., tagsets used in major corpora) Standardized sets of field names and values (e.g., TBX Basic,
TBX [ISO 29042])
© DCR Group 2009 3 of 29
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Data category
TC 37 practice treats both data fields and enumerated domain values as data categories: Open data categories: e.g., term, which can take any
value designated as a term Closed data categories: e.g., grammatical gender, which
takes a set of enumerated values as its content Constrained data category: e.g., Olympic years, which
takes as its content values defined by a formal constraint (i.e., every fourth year starting from a certain date)
Simple data categories: e.g., masculine, member of an enumerated value domain
© DCR Group 2009 4 of 29
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Data category types
grammaticalGender
enumerated string
neuter
masculine
feminine
closed
simple:
emailAddress
constrained string
constrained
Constraint: .+@.+
complex:
© DCR Group 2009 5 of 29
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Data category relationships
Value domain membership Subsumption relationships
between simple data categories Relationships between complex
data categories are not stored in the DCR
partOfSpeech
pronoun
enumerated string
© DCR Group 2009 6 of 29
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Data Category Registry (DCR)
set of data categories to be used as a reference for the definition of linguistic annotation schemes or any other formats used in the area of language resources Implemented as the TC 37 ISOcat registry Registration Authority: Max Planck Institute for
Psycholinguistics Nijmegen Open and accessible at: http://www.isocat.org Come play with the cat! But – he’s a bit fussy and likes to have people
follow some simple rules! Simple rules are spelled out in the DCR
Guidelines.
© DCR Group 2009 7 of 29
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
ISOcat model and mission & metaphor
Not a layered onion … A segmented aggregate, like knob of garlic instead:
“Cloves” are sets of private data categories The center stem represents the standardization core Many DCs and DCS may never be intended for
standardization Only the standardized core described
in ISO 12620:2009 Need to define non- and pre-
standardization procedures
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
ISOcat Data model
The ISO 12620 data model consists of 3 main parts: Administrative part
Administration and identification Descriptive part
Documentation and information for working language or languages
Data element names and identifiers Data element concept definitions
Linguistic part Conceptual domain of object language Data element type declarations Special object language constraints
© DCR Group 2009 9 of 29
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Data Model and DC life cycle
Part 1 of the ISOcat data model reflects the DC standardization cycle
Major steps in the workflow = classes in the DC model But the creation cycle precedes standardization A DC must be created, and ideally discussed in a group
before the standardization process even begins. Not all DCs will be standardized.
© DCR Group 2009 10 of 29
The process starts out here, and we need to define this process.
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Non- & Pre Standardization Workflow
DC created in private work space Option: DC remains private Option: assign DC to a group Option: DC discussed & revised in
the group to achieve consensus Option: DC used in group Option: DC used widely by public
group Option: DC submitted for
standardization Standards process starts with
submission
Stan-dardized
Core
DCS
DCS
DCS DCS
DCS
DCS
DCSDCS
DCR
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Cascade of Responsibility
ISOcat Model Design the ISOcat development group Approved by TC 37 Standardized in ISO 12620:2009
ISOcat input template, interface presentation Implementation by the ISOcat programmer/system
administrator Approved by development group Scrutiny of beta testers, user community
ISOcat Guidelines for data category specifications http://www.isocat.org/manual/DCRGuidelines.pdf Instantiation by the individual expert user Scrutiny by other users, eventually by DCR TDGs/DCRB
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
ISO 12620 DCR
Data Category
Global Information
Administration Information Section
Administration Record
Registration Group
Submission Group
Stewardship Group
Decision Group
Description Section
Language Section
Name Section
Definition Section Example Section
Explanation Section
Data Element Name Section
Complex Data...
Simple Data...
Closed Data...
Open Data...
Constrained Dat...
Linguistic Section
Closed Linguistic...
Open Linguisti...
Constrained Linguistic...
Conceptual Domain
Value Domain
Open Conceptua...
Conceptual...
Profile Value...
Change Section
Figure 5 - The description part
Figure 6 - The linguistic part
Figure 4 - The administration part
Overview
Three partsLynch pin:Data Category
Part 1 DCR
Data Category
Global Information
Administration Information Section
Administration Record
Registration Group
Submission Group
Stewardship Group
Decision Group
Description Section
Language Section
Name Section
Definition Section Example Section
Explanation Section
Data Element Name Section
Complex Data...
Simple Data...
Closed Data...
Open Data...
Constrained Dat...
Linguistic Section
Closed Linguistic...
Open Linguisti...
Constrained Linguistic...
Conceptual Domain
Value Domain
Open Conceptua...
Conceptual...
Profile Value...
Change Section
Figure 5 - The description part
Figure 6 - The linguistic part
Figure 4 - The administration part
Global Information &AdministrationInformationSection
DCR
Data CategoryGlobal Information
Administration Information Section
Administration Record
Registration Group
Submission Group
Stewardship Group
Decision Group
Change
0..1
1
0..1
1
0..1
1
0..1
1
1..*
1
0..1
1
1..*
1
1
1
1
1
Identifiers – Responsibilities
Required
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Justification – Creator Responsibility
Justification for /part of speech/:
Part of speech obvious, but not true of every DC for every potential user.
Required for standardization Highly desirable for any DC that will be shared outside a
private scope
NeeriHelsinki
2009-09-30
www.isocat.org
Required
Administration Information Section
Implementation of the standardization workflow Embodied in the information workflow associated with the
standardization process Standardized in ISO 12620:2009 in compliance with ISO
Directives Annex ST for Standards as Databases Represented by the flowchart in slide 19/20 Responsibility:
Thematic Domain Groups (TDGs), which act as stewards in maintaining data category specifications (DCs) and data category selections (DCSs)
Data Category Registry Board (DCRB), which validates DCs and DCSs and endeavors to harmonize among TDGs
NeeriHelsinki
2009-09-30
www.isocat.org
Data categoryThe standardization option
Data categories can be kept private or submitted to the standardization process, in which case they are assigned to a Thematic Domain Group which judges them.
DCR Board
TDGmetadata
TDG…..
TDGmorphosyntax
TDGterminology
At regular intervals, snapshots of
the standardized subset of the
DCR will be submitted to ISO to form a “standard as database” according to Annex ST of the ISO/IEC Directives.
NEERIHelsinki
2009-09-30
www.isocat.org
TDG Role: Maintenance TeamNeeri
Helsinki2009-09-30
www.isocat.org
DCRB Role: Validation RoleNeeri
Helsinki2009-09-30
www.isocat.org
Part 2 DCR
Data Category
Global Information
Administration Information Section
Administration Record
Registration Group
Submission Group
Stewardship Group
Decision Group
Description Section
Language Section
Name Section
Definition Section Example Section
Explanation Section
Data Element Name Section
Complex Data...
Simple Data...
Closed Data...
Open Data...
Constrained Dat...
Linguistic Section
Closed Linguistic...
Open Linguisti...
Constrained Linguistic...
Conceptual Domain
Value Domain
Open Conceptua...
Conceptual...
Profile Value...
Change Section
Figure 5 - The description part
Figure 6 - The linguistic part
Figure 4 - The administration part
DescriptivePart
Data Category
Description Section
Language Section
Name Section
Definition Example
Explanation
Data Element Name
0..*
1
1
1
0..*
1
0..*
1
1..*
1
0..*
1
0..*
1
Describes equivalents in working languages;English data element name, definition, and
justification comment required
Database, format or application specific data
element names
Rigorous terminological definition consisting of a single sentence fragment linked to a
logical concept system
Part 2: Guideline Responsibilities
Data Element Name: Language-independent name for the data category used
in a specific application domain (specified in the Source) PoS / POS / pos are all common short forms used for /part of
speech/ in various application environments.
Name Section in a Language Section (Min. one required in English Language Section) (Multiple in multiple Language Sections permitted) Human-legible (mnemonic) name
‘part of speech’ in the English language section ‘partie du discours’ in the French language section
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
NeeriHelsinki
2009-09-30
www.isocat.org
One en Name required.
Multiple Names optional.
Multiple Names in other languages optional.
Part 2: Guideline Responsibilities
Definition: Rigorous intentional definitions (ISO 704) Single sentence fragment Additional information in comments fields, justification, etc. Example: Die Klasse von Wörtern einer Sprache (broader concept) … auf Grund der Zuordnung (characteristic) nach gemeinsamen grammatischen Merkmalen.
(characteristic) Source:
The source for any quoted material; here: Wikipedia
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Part 3 DCR
Data Category
Global Information
Administration Information Section
Administration Record
Registration Group
Submission Group
Stewardship Group
Decision Group
Description Section
Language Section
Name Section
Definition Section Example Section
Explanation Section
Data Element Name Section
Complex Data...
Simple Data...
Closed Data...
Open Data...
Constrained Dat...
Linguistic Section
Closed Linguistic...
Open Linguisti...
Constrained Linguistic...
Conceptual Domain
Value Domain
Open Conceptua...
Conceptual...
Profile Value...
Change Section
Figure 5 - The description part
Figure 6 - The linguistic part
Figure 4 - The administration part
LinguisticPart
Data Category
Complex Data Category
Simple Data Category
Closed Data Category
Open Data Category
Constrained Data Category
Linguistic Section
Closed Linguistic Section
Constrained Linguistic Section
Conceptual Domain
Value Domain
Open Conceptual Domain
Schema Specific Domain
Profile Value Domain
Example Explanation
0..*
1
0..*1
1..*1
0..*
1
1..*
1 0..*
1
0..10..*
11
0..* 1
1..*
1
1..* 1
0..1 1
0..*
1
0..*
1
1..*1
Data categoryLinguistic part
Data Category
Complex Data Category
Simple Data Category
Closed Data Category
Open Data Category
Constrained Data Category
Linguistic Section
Closed Linguistic Section
Constrained Linguistic Section
Conceptual Domain
Value Domain
Open Conceptual Domain
Schema Specific Domain
Profile Value Domain
Example Explanation
0..*
1
0..*1
1..*1
0..*
1
1..*
1 0..*
1
0..10..*
11
0..* 1
1..*
1
1..* 1
0..1 1
0..*
1
0..*
1
1..*1
Complex , constrained and simple data categories are
explicitly modeled here
Constraints for a given object
language
Enumeration of permissible values in closed value domains
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Data categoryLinguistic part (example)
Data category: /grammatical gender/ Conceptual domain: /masculine/, /feminine/, /neuter/
Lists all admissible values for all languages Linguistic Section
Language: fr Value Domain: /masculine/, /feminine/ Lists all admissible values for French Linguistic section values must be subset of the defined
conceptual domain.
Data category: /part of speech/, value: /partitive/ Limited in the Linguistic Section to French Issue with the partitive case in Finnish – some values are very
language dependent
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
QA Components
Option for ad hoc group validation TDG approval during standardization DCRB harmonization & validation ISOcat Checker
NEERIHelsinki
2009-09-30Standards Workshop
www.isocat.org
Thank you for your attentionCome play with the cat!
http://www.isocat.org
http://blogs.warwick.ac.uk/jmiles/tag/shadow_of_the_colossus/