Toward Unified Metadata for the Department of Defense · 3 The data administration process...

Toward Unified Metadata for the Department of Defense1

Arnon Rosenthal, Edward Sciore∗, Scott RennerThe MITRE Corporation, Bedford MA{arnie, sar}@mitre.org, [email protected]

ABSTRACT

Data sharing within and among large organizations is possible only if adequate metadata iscaptured. But administrative and technological boundaries have increased the cost andreduced the effectiveness of metadata exploitation. We examine practices in the Departmentof Defense (DOD) and in industry’s Electronic Data Interchange (EDI) to explore pragmaticdifficulties in three major areas: the collection of metadata; the use of intermediary transferstructures such as formatted messages and file exchange formats; and the adoption ofstandards such as IDEF1X-97.

We realize that in large organizations, a complete metadata specification will need to evolvegradually. We are concerned here with initial steps. We therefore propose a simpleframework, for both databases and transfer structures, which can accommodate varyingdegrees of metadata specification. We then propose some conceptually simple (but rarelypracticed) techniques and policies to increase metadata reusability within this framework.

1. INTRODUCTION

The barriers toward shared and integrated data are severe and well known. They includeheterogeneous requirements and practices, inadequate tools for documentation, andinadequate will to provide good documentation. Efforts to overcome these barriers throughbetter data administration have had limited success. This paper examines some of the shapesthese barriers take within the Department of Defense (DOD), and describes some steps takenby DOD and standards groups to ameliorate those difficulties. Our goal is to describe a visionof a metadata environment that supports data sharing with and within large-scaleorganizations, and to compare this vision with the current situation in DOD and with somerelevant standards.

1 Presented at the IEEE Metadata Conference, Silver Spring, MD, 1997. Proceedings may be

at http://www.llnl.gov/liv_comp/metadata/md97.html or from IEEE home page.

∗ Edward Sciore is also a member of the Computer Science Department, Boston College,Chestnut Hill MA.

2

In this paper, ìdata sharing” will refer both to situations where there is, in effect, an integrateddatabase accessible to multiple applications (sometimes called integration), and also tosituations where one application or database transmits data for use by another (sometimescalled interoperability). In either case the key consideration is dealing with semantic andstructural incompatibilities between the participating systems. We distinguish four ways tocope with such incompatibilities, involving successively lower uniformity, and hence greaterautonomy and evolvability:

1. Require all systems to be implemented using the same interface schema (i.e., tables,attributes, and attribute details, but not necessarily indexes, tablespaces, etc.)

2. Require systems to communicate via a single, global interface schema, while allowingthem to implement local data schemas in other ways.

3. Require systems to communicate via a single, global abstract interface schema. Theabstraction omits representation details, such as implementation datatype or measurementunits. It may also combine or split attributes; for example, mapping Date to Day, Month,and Year.

4. Allow systems to communicate through explicit correspondences between separately-developed interface schemas.

The first approach works on a small scale; it fails whenever the participants must organizetheir representation of the world in radically different forms. The second approach works ona medium scale, especially when there is a dominant partner to dictate the global schema.Neither of these approaches are feasible as complete solutions for an organization on the scaleof the DOD [RRS96].

The third approach abandons the single concrete interface schema, and uses mediators totranslate data from the server’s context to the client’s context [AK91, CHS91, GMS94].System administrators are required to describe the meaning, structure, and representation oftheir system’s data in terms of the abstract interface schema. The mediator uses this metadatafrom the source and client systems to ensure semantically-correct data sharing.

For the DOD, the single interface schema required for the third approach might involve tensof thousands of entity types. It is difficult to believe such a schema could ever be constructedin a single piece. This leaves the fourth approach, in which we have several interfaceschemas, and explicit correspondences between these schemas. While this approach placesthe greatest burden on tools, and is beyond the state of current prototypes, it is the only onethat matches the scale and the degree of autonomy found in the DOD. Part of our goal is toshow how an organization can prepare to collect the system descriptions and correspondencesnow, even though mediator technology is not yet mature.

3

The data administration process responsible for collecting this metadata is a collaborationbetween an enterprise-wide effort on the one hand, and the individual system builders on theother. The builders must be involved, because they possess the knowledge needed for writingsystem descriptions. There must be a central enterprise-wide effort, because individualbuilders will not put forth the effort to build infrastructure for everyone. In fact, buildersoften see metadata collection as a low-priority, low-return burden. In Section 2 we provide aframework that enables metadata to be specified and reused more easily, and suggest policiesand tools that encourage administrators to specify reusable metadata.

There are two typical means for a system to pass data to another: have it converted to therequired format (either explicitly or via a mediator), or put it into a standard transfer format(using a file or a message) and have the receiving system pull it out again. From theviewpoint of a database purist, the use of transfer structures should be dropped. However,their use is well-established within the DOD, the EDI community, and others; the impact onexisting operations and legacy systems would be too traumatic.2 From a practical viewpoint,transfer structures are notoriously expensive to administer and maintain, and inhibit systemflexibility. In Section 3 we show how transfer structures may be accommodated at a morereasonable cost.

Both DOD and various disciplinary groups have addressed requirements for metadataregistries that combine all the above information. Where possible, this paper employs ideasand terminology from a draft standard (ISO 11179) [ISO97], elaborating parts of it to enablegreater metadata reuse. We found the standard quite informative and useful, but fear that keyomissions will greatly reduce its influence. Section 4 discusses this issue, along with problemsassociated with several other related standards efforts.

To summarize, we are proposing a framework for metadata and metadata administrationwhich:

• describes a form for metadata that will be friendly to reuse and sharing,• provides the proper incentives to align parochial interests with global, and• unifies the treatment of metadata describing database and transfer structures

We also identify some desiderata for any metadata framework, including good short termreturn on investment, flexibility to accommodate emerging technologies, and minimumpersonnel costs.

2 Also, we expect that transfer structures will have continuing use in performance

optimizations, as a way of avoiding sending large numbers of similar messages and/or as ameans of compressing information to be sent over a narrow channel.

4

For now, the challenge for user organizations with large scale problems is to do simple thingson a large scale. Their next steps must be made despite uncertainties about which long-termapproaches, formalisms, standards, and tools will attain success. We therefore focus onmetadata-improvement steps that will be valuable in many scenarios and with many differentmediators.

Finally, a disclaimer: The DOD is an enormous organization, with voluminous and sometimesinconsistent documentation. Our descriptions here are unavoidably incomplete, and somedetails may no longer be accurate. The opinions expressed represent only those of the authors.While we have had considerable exposure to the DOD central data administrationorganization (the Defense Information Systems Agency, or DISA), and to some organizationsdoing data administration for command and control (C2) and logistics, we judged that theoutside metadata community would gain more from a rich, best-efforts description than if weconfined discussion to the areas we know best. Our examples are made-up instead of based onreal systems, in order to avoid long and unnecessary detailed explanations.

2. METADATA TO SUPPORT DATA SHARING

In this section we propose some metadata constructs to be used for describing data indatabases and other structures. One may regard this as a partial specification for a repositoryor metadata registry for database information; a few parts of this have been prototyped.Several requirements drive our design choices in ways that differ from much previousresearch:

• reuse of definitions: Reuse leads to uniformity that greatly simplifies sharing. Also, localadministrators will participate more fully if reuse can be made easier than reinvention.

• correspondences between definitions: It will frequently be necessary to combineinformation systems whose data definitions were developed separately. To share betweensuch systems, we need the ability to identify correspondences that support sharing (i.e., “is

, rather than correspondences that simply guide more detailedexploration (e.g., “is somehow related”).

• transfer structures and application program interfaces: The metamodel (constructs) andthe atomic metadata instances here can be largely the same as those used for describingdatabases. Reuse and correspondence constructs still apply.

• incentives: The knowledge required for data sharing resides in system-buildingorganizations, not in a central administration group. Incentives must be structured so thatparochial interests will be aligned with the global interest.

5

Each of these issues is addressed in a corresponding subsection.

2.1 METADATA SPECIFICATION FOR REUSE

The knowledge representation literature identifies three major kinds of descriptiveinformation: structure, meaning, and representation. Structure identifies the atomic units ofinformation and the way the system combines them into aggregated units (such as entities andrelationships). Meaning refers to the abstract concept corresponding to a data item.Representation refers to the way in which the system implements the item. Other informationthat repositories often contain (e.g., tracking “owners” and status of definitions) are notdiscussed here.

Much of the necessary technology for capturing these kinds of metadata in a repository isfamiliar. It needs to be brought together and adapted to meet the needs of sharing rather thandocumenting an individual system. The intent is that there be a repository that supportsseveral constituencies: tools that capture metadata; programmers who write applications thattouch multiple databases; and mediators that automatically translate data (when sufficientmetadata has already been captured). The repository’s data descriptions should coverstructured databases, transfer formats, and (eventually) arguments in application programinterfaces.

2.1.1 Basic Definitions

Data modeling terminology is varied and often conflicting. To avoid confusion, we adopt asimple model, consisting of three kinds of data element: entities (e.g. Aircraft), properties(e.g. CurrentAirspeed), and domains (e.g. Velocity).3 We do not consider methods, norinheritance rules, nor data model heterogeneity. Relationships can be treated similarly toattributes.

Both entities and domains may have properties. Entity properties are often called attributes.Domain properties are often called meta-attributes, and are used to interpret its values. Forexample, the domain Velocity might have the three properties Datatype, SpeedUnit, andPrecision. We assume that all domains have a (possibly implicit) property called Datatype, asa slot to hold the implementation type of the domain (e.g., float, integer).

3 This definition is not commonly used in the DOD data administration community, where

ìdata elementî and ìattributeî are often thought of as synonyms. Here we are following theISO standard.

6

What is needed is a way to encourage reuse (or refinement) of concepts already known. Thiscan be done by viewing the repository in three complementary (though not completelydisjoint) ways: one to describe the structures, another for meanings of data elements, a thirdto describe representations.

2.1.2 Structure

Database administrators are more familiar with database schemas than with data dictionaries,and knowledge engineering terminology is likely to be quite foreign. Therefore, we want thestructural description of the data to look like a database schema, with a little more informationattached.

A schema contains several sorts of structural information, split into collection information andtype information. (In a relational database, the only collections are sets and they are closelyidentified with types, i.e., relations. An object database offers more flexibility.) The typeinformation has several parts: Each property is identified with a pair (entity, domain); eachentity is identified with a set of properties; the entire database has a set of domains that ituses. The following table illustrates some structural metadata for a very simple example aboutaircraft:

NAME KIND INFO

Aircraft entity {id, maxSpeed, grossWt, hoursFlown}Velocity domain {speedUnits}Identifier domain {}Weight domain {}id property from Aircraft to IdentifiermaxSpeed property from Aircraft to VelocitygrossWt property from Aircraft to WeighthoursFlown property from Aircraft to NumberspeedUnits property from Speed to String

An equivalent model of structural information is as a directed graph. The nodes of the graphdenote entities and domains, and the arrows denote properties.

2.1.3 Meaning

Each data element appearing in a schema will be associated with a meaning. The meaning of adata element can be thought of as a pair, consisting of a Body (which gives the actualmeaning) and a Name. The Body may be text, a document URL, or an object identifier(permitting comparisons for “is the meaning object identical”).

7

The following table gives the (grossly oversimplified) meanings of some elements of ouraircraft example:

NAME BODY

Aircraft “flying thing”Velocity “how fast it goes”Identifier “unique id in Navy”Weight ìits weight or massîid ìid of aircraftîmaxSpeed ìmax air speedîgrossWt ìmax weight, fully loadedîhoursFlown ìflying time since last repairîspeedUnits ìmph, kph, etc.î

The combination of structural information and meanings resembles what the knowledgerepresentation community calls an ontology. As a side point, we note that ontologies alsohave classification relationships among the nodes, such as generalization hierarchies.Generalization hierarchies are useful for finding the right definition and for factoring aconcept’s definition (e.g., one can define triangle as 3-sided polygon), and are hence verydesirable for a repository to support.

An ontological concept has no implied representation; i.e., the definition may be used bysystems that implement the concept in different ways. In particular, the meaning of an entityor domain does not imply the existence of any of its properties; the specification of an entityís(or domainís) properties is part of its representation. This definition allows metadata sharingthat might otherwise be impossible. For example, both the Crew_Planning and Maintenancesystems both involve aircraft, but have widely different ideas of what properties an Aircraftentity has. By circumventing the need to agree on all properties, the two communities will beable to agree on the same meaning for the Aircraft entity.

An ontological concept also has no context, which means that a concept cannot meandifferent things to different people, or somehow change its meaning based on the user or stateof the system. For example, developers sometimes use a field of a record for multiplepurposes; this is not supported within the ontology. (Instead, one could map the physicalrepresentation to a view schema in which each field has a specific meaning, and then relateelements of that view to the ontology.)

9

Current standards and systems restrict the kinds of representations that can be specified; theserestrictions correspond to the extremes of our framework. For example: An entityrepresentation in which all properties are fully bound corresponds to a traditional databaserelation scheme. An entity representation in which all properties are unbound corresponds toa reference scheme. A domain representation in which all specified properties are boundcorresponds to what the DOD calls a generic data element [DOD94]. A domainrepresentation in which all specified properties are unbound corresponds to a semantic domain[SSR94].

Our framework avoids arbitrary restrictions on the kinds of representations can be named andshared. We permit administrators to name and reuse any set of such specifications, such as apartial data element, or even a partial database description. There are many purposes forwhich such metadata might be used. In general, when two systems agree on something, it maybe handy to define a concept that captures the amount agreed, even if there remain somedisagreements. For example in the above table, WeightKg is such a domain. This domainspecifies that the value must be in kilograms, but the implementation datatype is unspecified(and will be left to the individual systems).

An advantage of this flexible representation framework is that any amount of propertyspecifications are possible, depending on the data administrator’s needs. This promotesmetadata sharing — an administrator can register representations in the repository asaccurately as possible.

2.2 SPECIFYING CORRESPONDENCES BETWEEN DATA ELEMENTS

Data sharing requires some degree of mediation, based on metadata about correspondencesbetween definitions. (For simplicity, we assume that all mediator-relevant definitions are inrepositories, and that all the repositories use the same schema constructs for expressing thisinformation; heterogeneous repository structures are a second-order problem, and outside ourscope). The necessary correspondence information is substitutability, roughly defined as theappropriateness of using data conforming to one system’s metadata as the value for anelement described in another system. (One can also define substitutability for definitions inregistries, based on substitutability of data elements that conform).

Gathering such information is difficult and costly, but not impossible. Humans are typicallyneeded to confirm whether it is appropriate to assert substitutability4; such assertions areimplicit whenever programmers or users share information among databases. Unfortunately,they are generally not recorded as accessible knowledge, due to inadequate repositorysupport. 4 Full automation seems infeasible, and techniques for automated assistance are outside the

scope of this paper.

10

The construct for registering “substitutability” correspondences needs to say more than“Element X can substitute for element Y.” There is a long distance between “I think thistextual definition fits my data” and “I am sharing data with three other systems that subscribeto this definition.” One cannot assume that all registered correspondences have met thesecond criterion ó and therefore, the repository needs to capture compliance information.

The repository schema must define the form of the assertions. Some additional fields thatought to be supported by any repository include:

• For what purpose (i.e., in what context). Distinctions that are irrelevant to oneorganization (“Price” versus “Bid Price”) may be critical to another purpose.

• Who says so.

• How certain they are (certainty).

The above limited features are intended to allow administrators to record some knowledgeeasily, to support discovery of what might be relevant, and exploitation of what is known tobe relevant. They are intended as a core for supporting large organizations. If one weresupporting schema integration tools over the repository, one would certainly want to recordnegative assertions (“don’t ask me again these are different”), assertions about specialization,and perhaps assertions that certain pairs are candidates for further investigation.

One must be aware that overworked system builders may obey the letter but not necessarilythe spirit of interoperability rules. We give several examples of (ill-advised) mandates that cancreate unintended consequences:

• Mandate that all data elements shared among systems must be recorded in standard formin a repository, without provisions for recording the state of uncertainty. Unless there areincentives and tools to facilitate reuse, it may be easier for system builders to record newelements than to reuse existing ones.

• Assume that all registered correspondences are accurate. They may not be – a shorttextual definition may not adequately capture the assumptions.

• Provide incentives for reuse of data elements, without estimating certainty of compliance.System builders may reuse existing elements, but may impose additional local meaningsthat prevent sharing.

• Require system builders to provide definitions of a quality sufficient to determinesubstitutability. Now system builders will try to avoid describing information that iscurrently local, even if it may someday be useful to others.

11

A better set of incentives would encourage system-builders to speak honestly. It should bepossible for a system builder to say “I agree with that textual definition text, (so perhaps thisis an opportunity to share data), or ìI have checked that I can really substitute data from thatsource, and/or that source has checked and they will accept data from me.” These statementscan be accomplished using the constructs described above.

If the central administration wishes to measure how well each local administrator contributesto data sharing, they need a metric that balances factors of stand-alone quality, reuse, andaccuracy of documentation. Stand-alone quality is the current practice; definition reuse iseasily measured. To avoid the difficulties listed above, one should give bonuses for higherdegrees of certainty, but penalties if if the error rate proves higher than indicated by thecertainty.

2.3 METADATA FOR TRANSFER STRUCTURES

We now discuss the metadata needed to describe the intermediate transfer structures used totransmit data from one system to another. The key insight is that a transfer structure is a kindof database. As with a database, we need metadata to describe the meaning, representation,and structure of everything in the transfer structure. Fortunately, meaning and representationdescriptions are exactly the same as with databases, and can be made with reference to thesame repository. This by itself means that we manage one metadata resource, not two, andthat unnecessary diversity is avoided.

The main difference is in describing structural information. A database can be described as acollection of relational tables. Transfer messages, however, are usually arranged in ahierarchical format; here, we must be able to describe the hierarchy. To transmit a transferstructure (or a database), one needs externalization functions that convert the contents of atable or message to a string of characters or bits, and internalization functions that recreatethe database from the bitstream.

At present, it is necessary to define and administer the message hierarchy and thepreparation/parsing rules by hand. This is not a simple task, and can consume quite a lot ofeffort. In the future, it should be possible to come up with a canonical hierarchy, or a self-describing structure, to automate the business of arranging the message contents. There isreason to belive that the automated arrangement could be as efficient (in terms of bandwidth)as the present manual approach.

2.4 ADMINISTRATIVE POLICIES AND INCENTIVES

It is now generally accepted that the more accurately and completely a system can specify itsmetadata, the greater the interoperability it can have with other systems. However, this goalis not always easy to achieve. A database administrator, when specifying metadata, may be

12

uncertain whether data element descriptions really match an existing concept, or whether anew data element can be created that captures the meaning better. There is an unavoidabletension between the goals of sharing metadata and accurately specifying its meaning.

A solution is likely to involve a mix of policies and automated tools. Some of the requiredtools resemble tools that support software reuse libraries. One needs to provide users easyaccess for browsing metadata, and to provide information-retrieval style searches to identifydefinitions that might be relevant. Ideally, one would also allow administrators to inspectexisting data and (for more technically-oriented administrators) the code that manipulates it.

We propose that metadata specification tools satisfy the following requirements:

A. Be Incremental

• Accommodate gradual progress. Gather partial metadata as a byproduct of developers’efforts to interchange high-priority information between pairs of systems. Provide goodinteroperability services for the parts of systems that are already documented.

• Avoid “everybody together” deadlines on upgrades. When all systems in a family mustcut-over simultaneously (as when a standard message format is changed), the speed ofadaptation may be governed by the slowest development group. Instead, supportcoexistence using backward gateways and data pumps, versioning and configurationmanagement.

• Insist on good short term return on investment. Both the shape of the future and thechance of success are too uncertain to justify projects that are strictly long term.

• Stay flexible. Provide a good start toward supporting the emerging mediationtechnologies, positioning the organization to use whatever products emerge.

B. Minimize Burdens on Individual Administrators (especially within developmentprograms)

• Ask administrators concrete rather than metaphysical questions. A question such as “arethese two concepts the same?” may be difficult to answer, because the answer might be90% true. The more concrete question “would you accept data from this source?” wouldelicit a more useful answer

• Make metadata convenient to access. Information-retrieval tools should aid anadministrator in finding an existing definition that suits a requirement. Automated tools,such as interface generators, mediators, and query tools, should be able to access andexploit the existing metadata.

13

• Minimize the granule of specification and evolution. For example, a user should not needto read a complex SQL query to arrange interchange of one additional data element.

• Allow import of composite definitions, not just atomic ones. The unit of import may bedifferent from the unit of specification and change.

• Minimize the effort of data administration, including retraining effort. The metadatastructure should be administrator-friendly; automated translators could then produce therepresentation required by a mediator product.

C. Plan to Serve a Muddled World

While parts of systems will become well behaved (e.g., communicating standardized datathrough standard interfaces), the scope of our systems will increase. New requirements andthe need to collaborate closely with additional systems will guarantee that heterogeneity willpersist. Therefore:

• Partially-compliant systems must be able to play. Systems that do not conform to thelatest standards will continue to be important. (Some may be legacy systems; others maybelong to other organizations.) We suspect that a little wrapping will make almost anysystem compliant to a small but useful degree.

• Categorization should not be critical. It was never possible to obtain consistency in whatwas modeled as an entity versus as a relationship. Such ambiguities will continue in ourmetadata. Sharing should not depend on getting the categorization “right,” and differentcommunities should not be forced to argue about which approach is “right.”

3. CURRENT PRACTICES

This section discusses current practices in the DOD, with a few brief remarks aboutcommercial practice.

The DOD is a very large organization with complex data needs. Organizations within theDOD at many levels in the hierarchy have a surprising amount of autonomy in acquiringinformation systems and ñ until recently ñ in defining separate data schemas to suit theirneeds. As a result, the DOD is filled with information systems that cannot share data, bothbecause semantically-equivalent items are difficult to identify, and because different names,structures, and representations have been chosen for equivalent items. These failures in datainteroperability have been a known problem for many years. Efforts to solve these datasharing problems have included standard message formats, and more recently, a data elementstandardization program. In this section we describe these two efforts at improved data

14

sharing, note which parts correspond to our metadata framework, and point out whereadditional or different kinds of administration may be needed.

3.1 STANDARD MESSAGE FORMATS

Standard messages have been used to communicate between DOD information systems fordecades. The JINTACCS5 program was created in 1977 as an attempt to link existingcommand and control systems across service boundaries (e.g. Army to/from Air Force). It isresponsible for two families of standard messages: USMTF and TADIL. The USMTF messagefamily is a set of character-based messages with a strong resemblance to EDI. Thesemessages are intended to be man-readable as well as machine-processable. The TADILmessage family consists of binary, fixed-length messages used in tactical data links. There areother similar standards, both for communicating within individual services, and with NATOallies.

These message standards were introduced in order to solve interoperability problems. Theyare widely judged to have failed at this task. The main reason is the expense of maintainingthe standard, the high cost of building and maintaining the interface software which producesand consumes the transmitted messages, and the resulting inflexible architecture of systemswhich can only communicate through messages that must be defined far in advance. This inno way means that their use can be abandoned. In many cases their continued use is requiredby international agreement. There is a huge software base that depends on their continueduse. There are even advantages. Some messages are prepared once, then multicast to severalreceivers. Messages also enabled communication between systems (e.g., in vehicles) that havenot upgraded to modern workstations or even PCs, and cannot afford the overhead of aDBMS.

In this section we describe some problems in the administration and use of standard messageformats and ways that our metadata framework might lead to solutions.

• The bridge software combines functions that should be separate. Frequently a bridgeprogram searches the source system’s database to determine data to extract, transformsthe data to match the target schema, transforms attributes to match the detailed targetrepresentations, and merges the result into the target’s database. The tight coupling makessuch programs very difficult to maintain, and makes it difficult to use mediation. Wewould recommend separating the extraction, transformation, and merge steps, with a viewtoward replacing the transformations with automated mediation.

5 JINTACCS: Joint Interoperability of Tactical Command and Control Systems

TADIL: Tactical Digital Information LinkUSMTF: United States Message Text Format

15

• Correspondences are buried in source code. Bridge programs as described above embodya great deal of knowledge of each system, but that knowledge is effectively lost. Forexample, one project required weeks of analysis using custom tools that reverseengineered C source with embedded SQL in order to figure out what actually happens inone such bridge connecting two large Air Force applications.

• Transfer structure definitions and stringification code are manually maintained. Ideally,one would use commercial software for transferring between databases. Currently, eachmessage family has different rules for externalization and internalization functions.Character-based messages cannot be easily extended to pass binary data (e.g., images),except through an inefficient uuencode-style encoding mechanism.

• No ad-hoc queries can be accommodated. Transfer structures can be large, and theirgeneration can be slow. A program that generates the entire message does not provide ameans of extracting a a small amount of up-to-date information from the source.

• Sources and producers may disagree about the need to change widely-used transferstructures. Some DOD organizations have bridges to old versions of transfer structures,and not all the changes have been backward compatible.

Many of the same problems arise in the commercial EDI community. Message elements arenot yet coordinated with elements in databases, or in business object standards; eachparticipant organization must manually map their databases to files described in terms of thestandard elements of a message. With some products, the mapping must be repeated whenmapping to another message, even if the same element is used. Furthermore, pairwisenegotiation is often needed about the detailed semantics; as a result, dialects have emerged.For example, both Wal-Mart and Kmart are major EDI users, but a supplier whocommunicates with both will probably have redundant administrative tasks.

3.2 DATA ADMINISTRATION AND DATA ELEMENT STANDARDIZATION

Data administration within the DOD is governed by a policy and set of procedures known asthe 8320.1 directive [DOD94]. This policy states that all new and modified informationsystems in the DOD must be built using only standard data elements. The intent is toeliminate data interoperability problems by eliminating all differences in name, structure, andrepresentation of semantically-equivalent data items.6

Until very recently the emphasis in DOD data administration has been in collecting a criticalmass of standard data elements. The intention is that these data elements will be reused in

6 Problems in entity identification and conflict resolution are largely ignored in 8320.1.

16

new and modified systems; new elements are to be added only when equivalent elementscannot be found. Every standard data element is tied to an attribute in an IDEF1X dataschema7. (IDEF1X is an extended entity-relationship model [Bruce]; see also the standardsdiscussion in Section 4.) The policy defines certain metadata to be collected for each dataelement (e.g., datatype, maximum length), and prescribes a naming convention to ensure aunique name (which embeds other metadata) for each element. Data elements are added tothe standard through an approval process which attempts to assign control (or stewardship) ofeach data element to the functional group most concerned with its subject area. The approvalprocess is also intended, through a step known as cross-functional review, to ensure thatschemas and data elements submitted for review are fully integrated into the global DODschema before acceptance, eliminating any redundant representations of semantically-equivalent concepts.

The 8320.1 policy was established in 1991. While the DOD has been successful in collectingdefinitions for approximately 17,000 standard data elements, this has not yet provided majorimprovements in data interoperability. A consensus also holds that the schema integrationsupposed to occur during the review process has not really succeeded ó the current dataelement repository is known to contain many redundant (and poorly-designed) data elements.However, there is no consensus on what ought to be done next: some hold that with renewedeffort and stricter enforcement, the existing process can succeed; others believe that somechanges and improvements to the process are needed.

Major software packages (such as R3 from SAP) also require extensive data administration toconnect to an organization’s systems. Yet here too the lack of repositories and standards iscostly – the results of data administration cannot easily be shared with other organizations.

4. STANDARDS DISCUSSION

This section discusses several “standards”8 that have a large potential role in DOD datamodeling:

• The International Standards Organization standard for data element and othermetadata registries (ISO-11179)

• DOD’s guidance for defining data element dictionaries (8320.1)

17

• The IDEF1X formalism for describing data models.

• STEP and CDIF

4.1 DATA ELEMENT REGISTRY STANDARD (ISO-11179)

The standard describes some important meta-properties and behaviors. However, the standardspecifies only conceptual behaviors. It does not specify schemas or APIs that a registry mustsupport, and as a result it provides absolutely no software interoperability; as a result it doesnot widen a market in a way that the SQL and ODBC standards catalyzed the explosion of4GL and application-development tools. We discuss these issues in greater detail below.

The standard does have significant strengths. It provides a very clear tutorial, and very usefulguidance in many aspects of specifying a registry’s contents and behavior. It standardizesmetadata attributes to be attached to element definitions. It aids politically in getting metadataproviders to agree to a form for the metadata they provide (e.g., “We’re following aninternational standard, so don’t try to redefine it.”) And it enables a certification process forappropriateness of some of the registry’s policies (though for DOD, the benefits of thiscertification are uncertain).

Reflecting the interests of activists who provide large, public datasets (e.g., census orscientific data), the standard contains features that would make data easier for humans to find,judge, and provide bridges for import into user-owned applications. It aims to make thenecessary information visible, but not at a high degree of automation. In contrast, DOD ismore concerned with peer-to-peer sharing among its components.

For us, the main flaw of the standard is that its scope excludes features that are essential toattracting vendor support. We want the metadata registry to promote the development of anindustry, with niches that include repository management software, metadata capture tools,and tools for exploiting the captured metadata. For such an industry to flourish, two thingsmust occur: All the above metadata-driven software must interoperate, to minimize thenumber of interfaces each vendor must support; and the resulting mix should supportinformation-sharing.

To enable software interoperation, one needs to standardize: a schema for (at least a coresubset of) the metadata; a query language (for accessing this information) and a transferformat (for archiving and shipping metadata). The registry standard describes what needs tobe shared, but does not provide such concrete mechanisms. The necessary tasks, then, are tochoose a query language (e.g., a subset of SQL or ODL) and a transfer format (e.g., CDIF),and within the language’s constructs, to standardize a schema by which registries could beaccessed.

18

The main purpose in populating the repository is not to describe a single database, it is toshare information among systems. For the industry’s tools to support such sharing, therepository standard must specify the form of the substitutability relationships discussed insection 2.2.

ISO-11179 needs correspondences, so buyers can have tools for data sharing. With currentscope, it describes information needed for accessing databases, but not for sharing betweenthem; this greatly reduces the utility to our customers. Most organizations in DOD give lowpriority to their role as data providers to others, and are not clamoring for ways to improve.The need is for data sharing across organizations.

4.2 DOD Directive 8320.1

The role of 8320.1 within the DOD was discussed in section 3. We now briefly describe itstechnical underpinnings.

8320.1 is a standard for descriptions of data elements. It parallels the treatment in ISO-11179rather closely, but contains more specifics. For example, 11179 says that one may definedomains for values; 8320 defines numerous useful ones (but imposes central control ondomain-creation).

8320.1 shares the weaknesses of 11179 by not supporting (representation-independent)semantic domains, by not having a mechanism to package arbitrary sets of definitions into areusable unit, and by not having a special construct for substitutability correspondences. Inaddition, it inhibits reuse by not supporting 11179’s representation-independent “conceptualdata elements.” Finally, 8320.1 requires a complex structure for names. While the structureseems well founded (and is reflected in our treatment of semantic domains andrepresentation), it creates unusable names. Now that most users can afford good graphicaltools, the information ought instead to be captured only in a data dictionary to be searchedand displayed in many ways; in the meantime, usable names should be provided.

4.3 IDEF1X VERSUS UML

DOD standard schemas use IDEF1X, an extended entity-relationship (EER) model. TheIDEF1X models have a rich variety of constraint constructs. There are good tools fordisplaying and manipulating diagrams; LogicWorks’ ERWin seems very popular within DOD.The existing standard (IDEF1X-93) has several serious weaknesses, so a new standard isneeded. Omissions from IDEF1X-93 include:

• There are no constructs for mapping between models, i.e., no language for defining views.

19

• There is no standard meta-schema or even file exchange format to enable interchange ofmodels among tools from different vendors. Ideally, one would be able to ship both entiremodels and finer-grained descriptions (e.g., of individual entities) to another organization,to aid collaborative design.

• Multiple inheritance is not supported, making it harder to reuse portions of a specification,or to support an object model that does have multiple inheritance.

• There is no connection to process or organizational modeling.

• IDEF1X defines models, but the standard currently has no facilities for tying those modelsto database schemas. (Some tools can produce database schemas from models, or reverse-engineer models from schemas, but there is no vendor-independent mechanism forrecording the connections.)

We note that some of the supporting tools do have constructs for some of the above features.But without a standard, one cannot expect consistency, let alone interoperability.

A major extension to IDEF1X (IDEF1X-97) has been proposed [IEEE96]. It remedies manyof the above deficiencies, and gives improved support for object-oriented analysis and design,while preserving an easy transition path from IDEF1X-93. Most of its constructs seem welldesigned. Unfortunately, it does not seem in synch with major industry trends – de factostandards, and tighter coupling across lifecycle stages. We compare it with an emergingalternative, the Unified Modeling Language (UML).

IDEF1X-97 rightly includes an object model, a mapping (i.e., query) language, and a transfersyntax. Unfortunately, in each case, the specification goes its own way, rather than adoptingand extending an existing standard. We note also that it is insufficient for IDEF1X-97 toreceive support from data-modeling vendors – the object model and mapping language needsupport from vendors at many other stages of the software lifecycle. For example, one wantsmappings between schemas to be executable (providing database translation), but this requiresthat the mappings be in SQL or (for some object database vendors) ODL. Furthermore, as theobject-relational formalisms used for physical schemas approach EER constructs in richness,the burden of separate formalisms for physical and conceptual specification will becomeunjustified.

In recent months, UML has become an industry juggernaut for object-oriented analysis anddesign, winning support from both Microsoft and the Object Management Group. It providessubstantial integration across the lifecycle, and seems assured of support from many leadingtool vendors, not just those who serve the government. It contains analogs to most of the datamodeling constructs of IDEF1X-97; the missing constructs could be provided as extensions tothe UML model, and would be candidates for future standardization. Its close connections

20

with the object community indicate that it will remain consistent with their formalisms, and itsancestry guarantees that it suits life-cycle stages beyond data modeling.

One further motivator for the shape of IDEF1X-97 was to minimize the pain of transitioningexisting conceptual models to the new standard. However, if one examines the entire softwareengineering lifecycle, the choice seems to be IDEF1X with UML, or UML alone. The formerrequires a translation each time a model is imported or exported from conceptual modelingtools and will inhibit maintaining consistency across stages. The UML-based strategy requiresjust a one-time translation of existing models (probably automatable). The new formalismshould be fairly easy for existing staff to learn, while schools will include UML in theirtraining of future software engineers.

4.4 STEP, CDIF, ETC.

Several existing standards point the way toward hiding the syntax used to transfer designsamong systems. Both the Case Data Interchange Format and the STEP format for exchangeof engineering product data specify a rule for translating an instance of a given schema to adata stream (or file) that can be shipped or archived. STEP has, and CDIF is developing, away to see the transferred information as an object database. (Because it was developed early,STEP defined its own database interface).

The argument for using CDIF seems stronger. It already plays in the CASE arena (and is thebasis for transferring UML models); also, its database interface will be OMG-compliant.Mappings from database format to the transfer form are built in, reused for each schema. Thissaves software development effort, but may be less efficient than a custom encoding.

5. SUMMARY AND CONCLUSIONS

We have presented some “next steps” that large organizations can take to improve metadatacapture and thereby improve data sharing. None requires a technology breakthrough, thoughseveral may require significant changes to organizations and incentives. The problems andopportunities were illustrated with examples from Department of Defense practices,augmented by comments that the Electronic Data Interchange world behaves rather similarly.

We adapted the conventional story about required metadata (i.e., descriptions of structures,concepts, and representation) to enhance sharing by allowing flexibility in the granularity ofreuse. We also addressed the need to support both a shared pool of definitions, and toincrementally record connections among definitions that were developed independently. Aform for such connections was proposed. The treatment can be considered a refinement of

21

existing standards (both ISO-11179 and DOD-8320), to make greater use of semanticdomains and to define a relationship type to handle correspondence information.

We also discussed why metadata administration needs to be shared across technologyboundaries, to capture metadata about databases, transfer formats, and even arguments inapplication interfaces. We gave several reasons why intermediate structures will not bereplaced by direct communication among applications’ databases, but argued that theintermediate structures could be regarded as temporary databases whose schemas are fairlyconventional even though their syntax is peculiar.

After discussing how these concepts play out in practice, we then surveyed some applicablestandards. We discussed why, for different reasons, neither the metadata registries standard(ISO-11179) nor the proposed IDEF1X update seems to adequately answer the concerns ofsoftware vendors, and hence neither seems likely to be supported by many products. Yet thereis much technical insight in both proposals, so they need to be extended (for ISO-11179) ormerged (IDEF1X-97) into UML

The major barrier to adopting our approaches seems to be the weak position of repositories.A new generation of repository products is emerging, and due to low price and web access,seem likely to receive greater acceptance. It would also be highly desirable for vendors andstandards committees to devise repository schemas that are friendly to definition reuse and tocapturing correspondences.

6. REFERENCE LIST

[AK91] Arens, Y., C. Knoblock, ìPlanning and reformulating queries for semantically-modeled multidatabase systems.î In Proceedings of the 1st International Conference onInformation and Knowledge Management (1992), pp. 92-101.

[Bruce] Bruce, T., Designing Quality Databases With IDEF1X Information Models. NewYork: Dorset House, 1992.

[CHS91] Collet, C., M. Huhns, W. Shen, “Resource Integration Using a Large KnowledgeBase in Carnot,î IEEE Computer, 24(12), December 1991.

[DOD94] Department of Defense, Data Element Standardization Procedures, March 1994.DOD 8320.1-M. <http://ssed1.ncr.disa.mil/srp/datadmn/8320-1-m.html>.

[GMS94] Goh, C., S. Madnick, M. Siegel. “Context interchange: overcoming the challengesof large-scale interoperable database systems in a dynamic environment.” Third InternationalConference on Information and Knowledge Management (1994), pp. 337-346.

22

[IEEE96] IEEE IDEF1X Standards Working Group. “IDEF1X-97 Conceptual Modeling,(IDEF-object) Semantics and Syntax”, 1996.

[ISO97] International Standards Organiation (Gilman ed.) “ISO/IEC 11179 - Specificationand Standardization of Data Elements”,http://www.lbl.gov/~olken/X3L8/drafts/draft.docs.html

[RRS96] Renner, S., A. Rosenthal, J. Scarano, “Data Interoperability: Standardization orMediation” (poster presentation), IEEE Metadata Workshop, Silver Spring, MD, April 1996.http://www.computer.org/conferen/meta96/renner/data-interop.html

[SR96] Seligman, L., A. Rosenthal, “A Metadata Resource to Promote Data Integration”, IEEEMetadata Workshop, Silver Spring MD., April 1996http://www.computer.org/conferen/meta96/seligman/seligman.html

[SSR94] Sciore, E., M. Siegel, A. Rosenthal. “Using semantic values to facilitateinteroperability among heterogeneous information systems,” ACM Transactions on DatabaseSystems, June 1994.

ACKNOWLEDGEMENTS

This work was sponsored by the US Air Force Electronic Systems Center (ESC)System Integration Office (SIO) under contract AF19628-94-C-0001.

Date post:	06-Mar-2019
Category:	Documents
Upload:	phungnhu
View:	213 times
Download:	0 times

Toward Unified Metadata for the Department of Defense · 3 The data administration process...

Documents