Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 1 times |
11/9/2000 Information Organization and Retrieval
Information Structures and Metadata
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
11/9/2000 Information Organization and Retrieval
Review
• Change of schedule… (Thesauri later)
• Metadata
• Controlled Vocabularies
• Dublin Core
11/9/2000 Information Organization and Retrieval
Metadata• Metadata is:
– “data about data” (from Database)– Information about Information– Structures and Languages for the Description of
Information Resources and their elements (components or features)
– “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)
11/9/2000 Information Organization and Retrieval
Type of Metadata systems and standards
• Naming and ID systems – URLs, ISBNs• Bibliographic description – MARC, Dublin Core,
TEI, etc.• Music -- SMDL• Images and objects – CIMI, VRA Core Categories• Numeric Data – DDI, SDSM• Geospatial Data – FGDC • Collections – EAD
11/9/2000 Information Organization and Retrieval
Controlled Vocabularies
• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.
11/9/2000 Information Organization and Retrieval
The problem
• Proliferation of the forms of names– Different names for the same person– Different people with the same names
• Examples – from Books in Print (semi-controlled but not
consistent)– ERIC author index (not controlled)
11/9/2000 Information Organization and Retrieval
Conditions of Authorship?
• Single person or single corporate entity• Unknown or anonymous authors
– Fictitiously ascribed works
• Shared responsibility• Collections or editorially assembled works• Works of mixed responsibility (e.g.
translations)• Related Works
11/9/2000 Information Organization and Retrieval
Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973
Different names for thesame person
11/9/2000 Information Organization and Retrieval
Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)
11/9/2000 Information Organization and Retrieval
Name authority filesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)
Different people writing with the same name
11/9/2000 Information Organization and Retrieval
Other Types of Controlled Vocabularies
• Gazetteers (Geographic Names)
• Code lists (e.g. LC Language Codes)
• Subject Heading Lists
• Classification Schemes
• Thesauri
11/9/2000 Information Organization and Retrieval
Today
• SGML
• XML
• DTDs
• Document Markup
• Uses of XML
11/9/2000 Information Organization and Retrieval
SGML & XML
• What is SGML/XML?
• Document Type Definitions
• Document Markup
• Sources and Resources
11/9/2000 Information Organization and Retrieval
What is SGML/XML?
• A. SGML stands for Standard Generalized Markup Language– XML stands for eXtended Markup Language
• B. What it is NOT:– Not a visual document description– Not an application specific markup– Not proprietary
11/9/2000 Information Organization and Retrieval
What is SGML/XML?• What it is:
– An international standard (SGML- ISO 8879:1986)
– A generic language for describing the structure of documents, and markup that can be used for those documents
– Intended for generating markup for content rather than form elements
• XML is a simplified subset of SGML (being established by W3C)
11/9/2000 Information Organization and Retrieval
The Documents of Commerce• Customer Profiles• Vendor Profiles• Catalogs• Datasheets• Price Lists• Purchase Orders• Invoices• Inventory Reports
• Bill of Materials• Contracts• Credit Reports• Bank Statements• Proposals• Directories• Transportation Schedules• Receipts
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Alternatives for Exchanging Documents
Format
based
API
based
Publish information for a universalclient
Batch and high-volumeexchangebetween tradingpartners
Application Integration
HTML EDI CORBA / COM
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Limitations of each Exchange Model
Format
based
API
based
Formattingmarkup “for eyes”
“Scrape and hope” integration
Must bepre-arranged
High cost
Rigid and inflexible
Pre-wired
Heavyweightto implement
Not native to the web
HTML EDI CORBA / COM
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Having our Cake and Eating it TooWe need:
• the precision of APIs• the simplicity of HTML
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
XML to the Rescue (SGML-- and HTML++)
• Extensible Markup Language– a simplification of SGML, the Standard Generalized
Markup Language – instead of a fixed set of format-oriented tags like
HTML, XML allows you to create the schema -- whatever set of tags are needed --for your information type or application
– this makes any XML instance “self-describing” and easily understood by computers and people
• Version 1.0 ratified by W3C in 2/98; backed by Microsoft, Sun, Netscape, many others
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Why XML is Revolutionary• XML enables a business to preserve any
“document type” or “database schema” when it publishes on the Web
• XML enables a business to send self-describing “business messages” that can be understood by programs, not just “by eye”
• This information cannot be encoded in HTML• XML-encoded information is smart enough to
support new classes of Web applications
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
XML Enables New Web Applications
• Data interchange between Web clients– use Web for application integration without
information loss (example: product information in supply chain, EDI)
• Moving processing from server to client– reduce network traffic and server load
(example: download airline schedule, find best flights without “back-and-forth” thrashing)
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
XML Enables New Web Applications
• Multiple client-side views of same data– expert and novice versions– manager and worker versions– localization (currency or measurement
conversions)• “Information push” from personalized
applications– selecting information based on user
preferences (example: custom news feed by matching article keywords against user profile)
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
The First Generation Web
Computers Browsers
.. making information accessible through browsers
scripts
HTML
Eyeballs onlyNo automationLimited integration
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
HTML Airline Schedule Seen “By Eye”
Airline Schedule Flight Information United Airlines #200 San Francisco 9:30 AM Honolulu 12:30 PM $368.50
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
HTML Airline Schedule Seen “By Computer”
<Title>Airline Schedule</Title><Body><H2>Flight Information</H2><H3>United Airlines #200</H3><UL><LI>San Francisco
<LI>9:30 AM<LI>Honolulu
<LI>12:30 PM <LI>$368.50 </UL></Body>
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Next Generation Web
Java
Computers Computers
.. making information and services accessible to computers (and people)
XML
Structured searchesAgentsNew models
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Airline Schedule in XML<TransportSchedule Type=“Airline”><Segment Id=“United Airlines #200”> <Origin>San Francisco</Origin><DepartTime>9:30 AM</DepartTime> <Destination>Honolulu</Destination><ArriveTime>12:30 PM</ArriveTime> <Price Currency=“USD”>368.50</Price></Segment></TransportSchedule>
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
XML is a Foundation for Interoperability
Format based
API based
WEB EDI CORBA / COM
XML
.. exchange information in an application and vendor neutral format
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Industry Initiatives for Information Exchange
OBI Corporate Procurement AMEX, Wal-Mart, National Semi
OTP Retail Mastercard, Mondex
OFX Personal Finance Intuit, Microsoft
ECOM Computer Supply Chain Ingram + 24 largest channel players
HL/7 Health care HP, Major Hospitals, Insurance companies
All will use XML; this list will continue to grow …..
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
OM
G/ C
BO
HL
/7
ED
I-lite
OB
I
OT
P
SC
OR
OF
X
CIP
VC
I
Pin
nacles
ED
I-lite
OP
S
Go
ld
AP
I
IMS
ED
I X.12
ICE
Digital Anarchy - stovepipe protocols
•Narrowly defined
•Semantic Conflicts
•Time to Market/ Development costsSource Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Open framework for commerce
Co
mp
uter
Au
tom
otiv
e
Pin
nacles
HL
/7
Common Business Language
Procure Retail
XM
L/ E
DIO
BI
OT
PSC
OR
OF
X
•Shared Semantics•Extensible and “aggressively interoperable”
Health Care
Office
Co
nsu
mer
Manufac-turing-
Supply Chain
Appliances
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Shared Semantics for Time and Location
Shared semantics for location and time in all schemas that need them enables richer “commerce networks” of services:
<TransportSchedule Type=“Airline”> ...<Destination>Honolulu</Destination>
<Accommodation Type=“Hotel”>...<Destination>Honolulu</Destination>
<Event Type=“Concert”>…<Destination>Honolulu</Destination>
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Automated Vacation Planning Service
• Book me the cheapest flight to Honolulu the first week of January
• Find a hotel room for the day I arrive
• What concerts are taking place the next day?
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
The Common Business Language• Specifies common semantics, common syntax,
and message packaging for information held by and exchanged among transaction partners and market participants
• These documents are the interfaces among the commerce components envisioned in the overall eCo architecture being realized in a current ATP project being carried out by CNgroup, CommerceNet, BusinessBots, and Tesserae
• CBL’s focus is on the functions and information that are common to all business domains
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
CBL and XML• CBL documents are described by XML
DTDs to make them “self-descriptive” and validatable
• CBL builds on existing standard or industry semantics where possible
• Complex descriptions and messages can be composed from primitives
• Domain-specific XML applications can be implemented in “native” form or as “hybrids” for maximal interoperability
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Common Business Language Building BlocksCBL DocumentsCBL Documents
Business Forms
CatalogCatalog
Purchase OrderPurchase Order
InvoiceInvoice
Business Descriptions
VendorVendor
ServicesServices
ProductsProducts
Measurements
TimeTime
CurrencyCurrency
WeightWeight
Locale
AddressAddress
CountryCountry
LanguageLanguage
Classification
SICSIC
NAICSNAICS
FSCFSC
core
core
core
core
core
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
Common Business Language Building BlocksCBL DocumentsCBL Documents
Business Forms
CatalogCatalog
Purchase OrderPurchase Order
InvoiceInvoice
Business Descriptions
VendorVendor
ServicesServices
ProductsProducts
Measurements
TimeTime
CurrencyCurrency
WeightWeight
Locale
AddressAddress
CountryCountry
LanguageLanguage
Classification
SICSIC
NAICSNAICS
FSCFSC
core
core
core
core
core
Source Dr. Robert J Glushko
11/9/2000 Information Organization and Retrieval
SGML/XML Structure
• An SGML document consists of three parts:– The SGML Declaration– The Document Type Definition (DTD)– The Document Instance
• An XML document REQUIRES only the document instance, but for effective processing a DTD is very important.
11/9/2000 Information Organization and Retrieval
Document Type Definitions• The DTD describes the structural elements and
"shorthand" markup for a particular document type. It defines:– Names of "legal" elements– How many times elements can appear– The order of elements in a document– Whether markup can be omitted (SGML only)– Contents of elements (i.e., nested structures)– Attributes associated with elements– Names of "entities"– short-hand conventions for element tags. (SGML only)
11/9/2000 Information Organization and Retrieval
DTD Components
• The major components of a DTD are:– Entity Declarations– Element Declarations– Attribute Declarations
11/9/2000 Information Organization and Retrieval
Document Type Definitions• Entity Declarations are a "macro" definition facility for
both DTD and Document instance parts.– General Internal Entity Definitions
<!ENTITY name "substitute string">referenced by &name;
– General External Entity Definitions<!ENTITY name SYSTEM "file path">referenced by &name;
– Parameter Entity Definitions (used only inside DTDs)<!ENTITY %name "substitute string">or<!ENTITY %name SYSTEM "file path">referenced by %name; or %name
11/9/2000 Information Organization and Retrieval
Document Type Definitions
• Element Declarations define the structural elements of a document and its associated markup.<!ELEMENT name - - content_model or declared_content +(include_list) -(exclude_list) >– Omitted tag minimization indicates whether
start-tags or end-tags can be omitted in the markup (o) or (-) are required in SGML but can NOT be used in XML
11/9/2000 Information Organization and Retrieval
Document Type Definitions• Content model provides a nested structural
description of the elements that make up this element, e.g.:<!ELEMENT memo - - ((to & from), body, close?)><!ELEMENT body - O (p)* ><!ELEMENT p - O (#PCDATA | q)*<!ELEMENT q - - (#PCDATA)>...– ANY (in SGML) may be used to indicate a content
model of any elements in the DTD, in any order.
11/9/2000 Information Organization and Retrieval
Document Type Definitions
• Same Content model in XML<?xml version = “1.0”?><!DOCTYPE memo [<!ELEMENT memo ((to | from)+, body, close?)>
<!ELEMENT body (p)* ><!ELEMENT p (#PCDATA | q)* ><!ELEMENT q (#PCDATA)>…
]>– Note the XML Processing instruction “Prolog”– Note that & in previous page is not legal XML
11/9/2000 Information Organization and Retrieval
Document Type Definitions• Declared content can be:
PCDATA, CDATA, RCDATA, EMPTY• Inclusion and Exclusion lists can be used to
indicate elements that can occur or are forbidden to occur in any sub-elements of the content model. (NOT in XML) E.g.:– <!ELEMENT memo -- ((to & from), body close?) +(fn)>
– says that element fn can appear anyplace in the memo.
11/9/2000 Information Organization and Retrieval
Document Type Definitions
• Attribute Declarations define attributes associated with (potentially) each element of a document and provide the acceptable values for those attributes.
11/9/2000 Information Organization and Retrieval
Attributes Example• <!ATTLIST associate_element attribute_name declared_value
default_value >• <!ATTLIST memo status (PUBLIC | CONFIDENTIAL)
PUBLIC>– In markup of a document:
<memo status="CONFIDENTIAL">also, because of the default set:<memo>would be the same as <memo status="PUBLIC">There are a variety of special defaults and data types that can be given in attribute definitions
11/9/2000 Information Organization and Retrieval
Sample SGML DTD<!doctype ELIB-TEXTS [
<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->
<!ELEMENT ELIB-TEXTS o o (ELIB-BIB*)>
<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB - - (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>
<!-- We won't make any assumptions about content... all PCDATA -->
<!ELEMENT ID - o (#PCDATA)><!ELEMENT ABSTRACT - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL - o (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING - o (#PCDATA)>… etc… ]>
11/9/2000 Information Organization and Retrieval
XML version<!doctype ELIB-TEXTS [
<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->
<!ELEMENT ELIB-TEXTS(ELIB-BIB*)>
<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>
<!-- We won't make any assumptions about content... all PCDATA -->
<!ELEMENT ID (#PCDATA)><!ELEMENT ABSTRACT (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING (#PCDATA)>… etc… ]>
11/9/2000 Information Organization and Retrieval
Document Using That DTD<ELIB-BIB><BIB-VERSION>ELIB-v1.0 </BIB-VERSION><ID>6</ID><ENTRY>February 13 1995</ENTRY><DATE>March 1, 1993</DATE><TITLE>Water Conditions in California Report 2</TITLE><ORGANIZATION>California Department of Water Resources</ORGANIZATION><SERIES>120-93</SERIES><TYPE>bulletin</TYPE><AUTHOR-INSTITUTIONAL>California Department of Water Resources </AUTHOR-INSTITUTIONAL><PAGES>17</PAGES><TEXT-REF>/elib/data/disk/disk5/documents/6/HYPEROCR/hyperocr.html </TEXT-REF><PAGED-REF>/elib/data/disk/disk5/documents/6/OCR-ASCII-NOZONE </PAGED-REF></ELIB-BIB>
11/9/2000 Information Organization and Retrieval
A More Complex DTD<!DOCTYPE USMARC [<!-- USMARC DTD. UCB-SLIS v.0.08 --><!-- By Jerome P. McDonough, April 1, 1994 --><!ELEMENT USMARC - - (Leader, Directry, VarFlds)><!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE) "BK" id CDATA #IMPLIED><!-- Author's Note: the id attribute for the USMARC element is intended to hold a unique record number for each MARC record in the local database. That is to say, it is intended ONLY as an aid in maintaining the local database of MARC records -->
<!ELEMENT Leader - O (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount, BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)><!ELEMENT Directry - O (#PCDATA)><!ELEMENT VarFlds - O (VarCFlds, VarDFlds)>
<!-- Component parts of Leader --><!-- Logical Record Length --><!ELEMENT LRL - O (#PCDATA)>…etc…
11/9/2000 Information Organization and Retrieval
More complex DTD (cont.)<!-- Variable Data Fields --><!ELEMENT VarDFlds - O (NumbCode, MainEnty?, Titles, EdImprnt?, PhysDesc?, Series?, Notes?, SubjAccs?, AddEnty?, LinkEnty?, SAddEnty?, HoldAltG?, Fld9XX?)>
<!-- Component Parts of Variable Data Fields --><!-- Numbers & Codes --><!ELEMENT NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017*, Fld018?,
Fld019*, Fld020*, Fld022*, Fld023*, Fld024*, Fld025*, Fld027*,
Fld028*, Fld029*, Fld030*, Fld032*, Fld033*, Fld034*, Fld035*, Fld036?, Fld037*, Fld039*, Fld040?, Fld041?, Fld042?, Fld043?, Fld044?, Fld045?, Fld046?, Fld047?, Fld048*, Fld050*, Fld051*, Fld052*, Fld055*, Fld060*, Fld061*, Fld066?, Fld069*, Fld070*, Fld071*, Fld072*, Fld074*, Fld080?, Fld082*,
Fld084*, Fld086*, Fld088*, Fld090*, Fld096*)>
<!-- Main Entries --><!ELEMENT MainEnty - O (Fld100?, Fld110?, Fld111?, Fld130?)>
<!-- Titles --><!ELEMENT Titles - O (Fld210?, Fld211*, Fld212*, Fld214*, Fld222*,
Fld240?, Fld242*, Fld243?, Fld245, Fld246*, Fld247*)>
<!-- Edition, Imprint, etc. --><!ELEMENT EdImprnt - O (Fld250?, Fld254?, Fld255*, Fld256?, Fld257?, Fld260?, Fld261?, Fld262?, Fld263?, Fld265?)>
<!-- Physical Description, etc. --><!ELEMENT PhysDesc - O (Fld300*, Fld305*, Fld306?, Fld310?, Fld315?,
Fld321*, Fld340*, Fld350?, Fld351*,Fld355*, Fld357*, Fld362*)>
…etc…
11/9/2000 Information Organization and Retrieval
Complex DTD (cont.)
<!-- Title Statement --><!ELEMENT Fld245 - O (Six?, (a|b|c|f|g|h|k|n|p|s)+)><!ATTLIST Fld245 AddEnty (No|Yes|Blank) #IMPLIED NFChars (0|1|2|3|4|5|6|7|8|9|Blnk) #IMPLIED>
…etc…
<!-- Subfield Element Declarations --><!ELEMENT a - O (#PCDATA)><!ELEMENT b - O (#PCDATA)><!ELEMENT c - O (#PCDATA)><!ELEMENT d - O (#PCDATA)>
<!ELEMENT e - O (#PCDATA)>
11/9/2000 Information Organization and Retrieval
Document Markup• All document markup is derived from the DTD for the
particular document type.• The DTD must be referenced in the document using
the DOCTYPE declaration:– <!DOCTYPE name SYSTEM "file_path" >
or<!DOCTYPE name SYSTEM "file_path" [doctype_declaration_subset]>or<!DOCTYPE name [doctype_declaration_subset]>The doctype_declaration_subset can be any combination of elements, entity, and attribute declarations.
11/9/2000 Information Organization and Retrieval
XML-Data
• Proposal to W3C for a “schema language” based on XML for describing a large variety of metadata descriptions and structures
• More generally -- as seen in the SGML examples previously -- XML can be used as a record description language for metadata records.
11/9/2000 Information Organization and Retrieval
HTML
• HTML was not originally "real" SGML, the DTD was invented after the language.
• It is often more concerned with the form of the output on the screen than with the structural contents of the HTML docs.
• Relies on the application (such as Netscape) to implement interesting actions like hypertext linking.
11/9/2000 Information Organization and Retrieval
How can you describe an information-bearing object?
11/9/2000 Information Organization and Retrieval
Dublin Core
• Review…
• Simple metadata for describing internet resources.
• For “Document-Like Objects”
• 15 Elements.
11/9/2000 Information Organization and Retrieval
Dublin Core Elements
• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type
• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management
11/9/2000 Information Organization and Retrieval
Title
• Label: TITLE
• The name given to the resource by the CREATOR or PUBLISHER.
11/9/2000 Information Organization and Retrieval
Author or Creator
• Label: CREATOR
• The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.
11/9/2000 Information Organization and Retrieval
Subject and Keywords
• Label: SUBJECT • The topic of the resource, or keywords or phrases that
describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art
and Architecture Thesaurus descriptors) as well.
11/9/2000 Information Organization and Retrieval
Description
• Label: DESCRIPTION • A textual description of the content of the resource,
including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.
11/9/2000 Information Organization and Retrieval
Publisher
• Label: PUBLISHER
• The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.
11/9/2000 Information Organization and Retrieval
Other Contributors• Label: CONTRIBUTORS • Person(s) or organization(s) in addition to
those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).
11/9/2000 Information Organization and Retrieval
Date
• Label: DATE• The date the resource was made available in its
present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.
11/9/2000 Information Organization and Retrieval
Resource Type
• Label: TYPE • The category of the resource, such as home
page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at the following URL: http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html
11/9/2000 Information Organization and Retrieval
Format• Label: FORMAT • The data representation of the resource, such as text/html,
ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.
11/9/2000 Information Organization and Retrieval
Resource Identifier• Label: IDENTIFIER • String or number used to uniquely identify
the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.
11/9/2000 Information Organization and Retrieval
Source
• Label: SOURCE
• The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.
11/9/2000 Information Organization and Retrieval
Language
• Label: LANGUAGE
• Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html
11/9/2000 Information Organization and Retrieval
Relation
• Label: RELATION• Relationship to other resources. The intent of specifying
this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
11/9/2000 Information Organization and Retrieval
Coverage
• Label: COVERAGE
• The spatial locations and temporal duration characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
11/9/2000 Information Organization and Retrieval
Rights Management
• Label: RIGHTS • The content of this element is intended to be a link (a URL
or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.
11/9/2000 Information Organization and Retrieval
SGML and XML Sources and Resources
• Books: van Herwijnen, Eric. Practical SGML. (2nd Ed.) Boston: Kluwer Academic Publishers, 1994.Goldfarb, Charles F. The SGML Handbook. Oxford: Clarenden Press, 1990. (And MANY XML books)
• Web Sites:– Robin Cover’s SGML/XML Site
http://www.oasis-open.org/cover/sgml-xml.html
–