+ All Categories
Home > Documents > 2002.10.08 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...

2002.10.08 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...

Date post: 19-Dec-2015
Category:
View: 215 times
Download: 2 times
Share this document with a friend
Popular Tags:
62
2002.10.08 - SLIDE 1 IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/ is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 12: Metadata and Markup
Transcript

2002.10.08 - SLIDE 1IS 202 – FALL 2002

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/

SIMS 202:

Information Organization

and Retrieval

Lecture 12: Metadata and Markup

2002.10.08 - SLIDE 2IS 202 – FALL 2002

Lecture Overview

• Review– Thesaurus Design And Development– Thesaurus Design– Steps In Thesaurus Development

• Metadata And Markup– XML As A Metadata Lingua Franca– XML DTD Construction– XML For Protocols And Metadata Languages

2002.10.08 - SLIDE 3IS 202 – FALL 2002

Lecture Overview

• Review– Thesaurus Design And Development– Thesaurus Design– Steps In Thesaurus Development

• Metadata And Markup– XML As A Metadata Lingua Franca– XML DTD Construction– XML For Protocols And Metadata Languages

2002.10.08 - SLIDE 4IS 202 – FALL 2002

Structure of an IR System

SearchLine

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

2002.10.08 - SLIDE 5IS 202 – FALL 2002

Thesauri

• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms

2002.10.08 - SLIDE 6IS 202 – FALL 2002

Thesauri (cont.)

• Examples– The ERIC Thesaurus of Descriptors– The Medical Subject Headings (MESH) of the

National Library of Medicine– The Art and Architecture Thesaurus

2002.10.08 - SLIDE 7IS 202 – FALL 2002

Why Develop a Thesaurus?

• To provide a conceptual structure or “space” for a body of information– To make it possible to adequately describe

the topical contents of informational objects at an appropriate level of generality or specificity

– To provide enhanced search capabilities and to improve the effectiveness of searching (i.e., to retrieve most of the relevant material without too much irrelevant material)

2002.10.08 - SLIDE 8IS 202 – FALL 2002

Development of a Thesaurus

• Term selection

• Merging and development of concept classes

• Definition of broad subject fields and subfields

• Development of classificatory structure

• Review, testing, application, revision

2002.10.08 - SLIDE 9IS 202 – FALL 2002

Flow of Work in Thesaurus Construction

Select Sources

Assign codes

Select Terms

Record Selected Terms

Sort Terms

Merge identical Terms

Define Broad SubjectFields

Merge Terms in SameConcept class

Sort Terms into BroadSubject Fields

Define Subfields withinone Subject Field

Work out detailed structureof the Subject Field

Select Preferred Terms

All Subfields of BroadSubject finished?

All BroadSubjects finished?

Improve Class Structure

Yes

Yes

No

No

Print Classified Indexand review

Discuss with Experts andUsers

Select descriptors andchecklist items

Produce Full Thesaurusand Check references

Assign Notation

Review and Test

Many Modifications?

Based on Soergel, pp 327-333

Yes

No

Revise asneeded

2002.10.08 - SLIDE 10IS 202 – FALL 2002

The Indexing Process

• Concept identification

• Term selection (via thesaurus)

• Term assignment

2002.10.08 - SLIDE 11IS 202 – FALL 2002

Application: The Indexing Process (Manual)

IsTerm

suitable

NOSelect Alternativeterm to represent

Concept

WouldConcept be

better representedby one of

these terms

Is There

Another Concept

Consider Preferred

Term

Select Preferred

Term

Establish TermDenoting Concept

Examine Documentand Identify Significant Concepts

Consider First

Concept

PreferredTerm?

StartNO

NO

NO

NO

NO

YES YES YES

YES

YESYES

DoesThesaurus

contain termfor

Concept

Consider anyassociated terms inThesaurus (NT,BT)

Admit New TermInto Thesaurus

Can Conceptbe expressed

combining terms?

Consider Each ofThese Terms

Assign Termsto

Document

Prefer Alternative

Term(s)

End

Adapted from ISO 5963, p.5

2002.10.08 - SLIDE 12IS 202 – FALL 2002

Lecture Overview

• Review– Thesaurus Design And Development– Thesaurus Design– Steps In Thesaurus Development

• Metadata And Markup– XML As A Metadata Lingua Franca– XML DTD Construction– XML For Protocols And Metadata Languages

2002.10.08 - SLIDE 13IS 202 – FALL 2002

What is SGML/XML?

• SGML stands for Standard Generalized Markup Language– XML stands for eXtended Markup Language

• What it is NOT:– Not a visual document description– Not an application specific markup– Not proprietary

2002.10.08 - SLIDE 14IS 202 – FALL 2002

What is SGML/XML?

• What it is:– An international standard (SGML- ISO

8879:1986)– A generic language for describing the

structure of documents, and markup that can be used for those documents

– Intended for generating markup for content rather than form elements

• XML is a simplified subset of SGML (established by W3C)

2002.10.08 - SLIDE 15IS 202 – FALL 2002

The Documents of Commerce

• Customer profiles• Vendor profiles• Catalogs• Datasheets• Price lists• Purchase orders• Invoices• Inventory reports

• Bill of materials• Contracts• Credit reports• Bank statements• Proposals• Directories• Transportation

schedules• Receipts

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 16IS 202 – FALL 2002

Alternatives for Exchanging Documents

Format

based

API

based

Publish information for a universalclient

Batch and high-volume exchange between tradingpartners

Application Integration

HTML EDI CORBA / COM

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 17IS 202 – FALL 2002

Limitations of Each Exchange Model

Format

based

API

based

Formatting markup “for eyes”

“Scrape and hope” integration

Must bepre-arranged

High cost

Rigid and inflexible

Pre-wired

Heavyweightto implement

Not native to the webHTML EDI CORBA / COM

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 18IS 202 – FALL 2002

Having Our Cake And Eating It Too

• We need:

– The precision of APIs– The simplicity of HTML

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 19IS 202 – FALL 2002

XML to the Rescue (SGML and HTML++)

• Extensible Markup Language– A simplification of SGML, the Standard Generalized

Markup Language – Instead of a fixed set of format-oriented tags like

HTML, XML allows you to create the schema— whatever set of tags are needed—for your information type or application

– This makes any XML instance “self-describing” and easily understood by computers and people

• Version 1.0 ratified by W3C in 2/98– Backed by Microsoft, Sun, Netscape, many others

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 20IS 202 – FALL 2002

Why XML is Revolutionary

• XML enables a business to preserve any “document type” or “database schema” when it publishes on the Web

• XML enables a business to send self-describing “business messages” that can be understood by programs, not just “by eye”

• This information cannot be encoded in HTML• XML-encoded information is smart enough to

support new classes of Web applications

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 21IS 202 – FALL 2002

XML Enables New Web Applications

• Data interchange between Web clients– Use Web for application integration without

information loss (example: product information in supply chain, EDI)

• Moving processing from server to client– Reduce network traffic and server load

(example: download airline schedule, find best flights without “back-and-forth” thrashing)

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 22IS 202 – FALL 2002

XML Enables New Web Applications

• Multiple client-side views of same data– Expert and novice versions– Manager and worker versions– Localization (currency or measurement

conversions)

• “Information push” from personalized applications– Selecting information based on user

preferences (example: custom news feed by matching article keywords against user profile)

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 23IS 202 – FALL 2002

The First Generation Web

Computers Browsers

.. making information accessible through browsers

scripts

HTML

Eyeballs onlyNo automationLimited integration

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 24IS 202 – FALL 2002

HTML Airline Schedule Seen “By Eye”

Airline Schedule Flight Information United Airlines #200 San Francisco 9:30 AM Honolulu 12:30 PM $368.50

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 25IS 202 – FALL 2002

HTML Airline Schedule Seen “By Computer”

<Title>Airline Schedule</Title><Body><H2>Flight Information</H2><H3>United Airlines #200</H3><UL><LI>San Francisco

<LI>9:30 AM<LI>Honolulu

<LI>12:30 PM <LI>$368.50 </UL></Body>

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 26IS 202 – FALL 2002

Next Generation Web

Java

Computers Computers

.. making information and services accessible to computers (and people)

XML

Structured searchesAgentsNew models

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 27IS 202 – FALL 2002

Airline Schedule in XML

<TransportSchedule Type=“Airline”><Segment Id=“United Airlines #200”><Origin>San Francisco</Origin><DepartTime>9:30 AM</DepartTime><Destination>Honolulu</Destination><ArriveTime>12:30 PM</ArriveTime><Price Currency=“USD”>368.50</Price></Segment></TransportSchedule>

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 28IS 202 – FALL 2002

Shared Semantics for Time and Location

• Shared semantics for location and time in all schemas that need them enables richer “commerce networks” of services:– <TransportSchedule Type=“Airline”> ...– <Destination>Honolulu</Destination>

– <Accommodation Type=“Hotel”>...– <Destination>Honolulu</Destination>

– <Event Type=“Concert”>…– <Destination>Honolulu</Destination>

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 29IS 202 – FALL 2002

Automated Vacation Planning Service

• Book me the cheapest flight to Honolulu the first week of January

• Find a hotel room for the day I arrive

• What concerts are taking place the next day?

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 30IS 202 – FALL 2002

The Common Business Language

• Specifies common semantics, common syntax, and message packaging for information held by and exchanged among transaction partners and market participants

• These documents are the interfaces among the commerce components envisioned in the overall eCo architecture being realized in a current ATP project being carried out by CNgroup, CommerceNet, BusinessBots, and Tesserae

• CBL’s focus is on the functions and information that are common to all business domains

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 31IS 202 – FALL 2002

CBL and XML

• CBL documents are described by XML DTDs to make them “self-descriptive” and validatable

• CBL builds on existing standard or industry semantics where possible

• Complex descriptions and messages can be composed from primitives

• Domain-specific XML applications can be implemented in “native” form or as “hybrids” for maximal interoperability

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 32IS 202 – FALL 2002

CBL Building Blocks

CBL DocumentsCBL Documents

Business Forms

CatalogCatalog

Purchase OrderPurchase Order

InvoiceInvoice

Business Descriptions

VendorVendor

ServicesServices

ProductsProducts

Measurements

TimeTime

CurrencyCurrency

WeightWeight

Locale

AddressAddress

CountryCountry

LanguageLanguage

Classification

SICSIC

NAICSNAICS

FSCFSC

core

core

core

core

core

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 33IS 202 – FALL 2002

CBL Building Blocks

CBL DocumentsCBL Documents

Business Forms

CatalogCatalog

Purchase OrderPurchase Order

InvoiceInvoice

Business Descriptions

VendorVendor

ServicesServices

ProductsProducts

Measurements

TimeTime

CurrencyCurrency

WeightWeight

Locale

AddressAddress

CountryCountry

LanguageLanguage

Classification

SICSIC

NAICSNAICS

FSCFSC

core

core

core

core

core

Source Dr. Robert J. Glushko

2002.10.08 - SLIDE 34IS 202 – FALL 2002

If Interested In CBL

• Visit: – http://www.xcbl.org/

• And for e-commerce applications using CBL, visit:– http://www.commerceone.com/

2002.10.08 - SLIDE 35IS 202 – FALL 2002

Lecture Overview

• Review– Thesaurus Design And Development– Thesaurus Design– Steps In Thesaurus Development

• Metadata And Markup– XML As A Metadata Lingua Franca– XML DTD Construction– XML For Protocols And Metadata Languages

2002.10.08 - SLIDE 36IS 202 – FALL 2002

SGML/XML Structure

• An SGML document consists of three parts:– The SGML Declaration– The Document Type Definition (DTD)– The Document Instance

• An XML document REQUIRES only the document instance, but for effective processing a DTD is very important

• XML Schema provides an alternative to DTDs for XML applications

2002.10.08 - SLIDE 37IS 202 – FALL 2002

Document Type Definitions

• The DTD describes the structural elements and "shorthand" markup for a particular document type and defines:– Names of "legal" elements– How many times elements can appear– The order of elements in a document– Whether markup can be omitted (SGML only)– Contents of elements (i.e., nested structures)– Attributes associated with elements– Names of "entities"– Short-hand conventions for element tags (SGML only)

2002.10.08 - SLIDE 38IS 202 – FALL 2002

DTD Components

• The major components of a DTD are:– Entity Declarations– Element Declarations– Attribute Declarations

2002.10.08 - SLIDE 39IS 202 – FALL 2002

Document Type Definitions

• Entity Declarations are a "macro" definition facility for both DTD and Document instance parts– General Internal Entity Definitions

<!ENTITY name "substitute string">referenced by &name;

– General External Entity Definitions<!ENTITY name SYSTEM "file path">referenced by &name;

– Parameter Entity Definitions (used only inside DTDs)<!ENTITY %name "substitute string">or<!ENTITY %name SYSTEM "file path">referenced by %name; or %name

2002.10.08 - SLIDE 40IS 202 – FALL 2002

Document Type Definitions

• Element Declarations define the structural elements of a document and its associated markup<!ELEMENT name - - content_model or declared_content +(include_list) -(exclude_list) >– Omitted tag minimization indicates whether

start-tags or end-tags can be omitted in the markup (o) or (-) are required in SGML but can NOT be used in XML

2002.10.08 - SLIDE 41IS 202 – FALL 2002

Document Type Definitions

• Content model provides a nested structural description of the elements that make up this element, e.g.:<!ELEMENT memo - - ((to & from), body,

close?)><!ELEMENT body - O (p)* ><!ELEMENT p - O (#PCDATA | q)*><!ELEMENT q - - (#PCDATA)>...– ANY (in SGML) may be used to indicate a

content model of any elements in the DTD, in any order

2002.10.08 - SLIDE 42IS 202 – FALL 2002

Document Type Definitions

• Same content model in XML<?xml version = “1.0”?><!DOCTYPE memo [<!ELEMENT memo ((to | from)+, body,

close?)><!ELEMENT body (p)* ><!ELEMENT p (#PCDATA | q)* ><!ELEMENT q (#PCDATA)>…

]>– Note the XML processing instruction “Prolog”– Note that & in previous page is not legal XML

2002.10.08 - SLIDE 43IS 202 – FALL 2002

Document Type Definitions

• Declared content can be:PCDATA, CDATA, RCDATA, EMPTY

• Inclusion and Exclusion lists can be used to indicate elements that can occur or are forbidden to occur in any sub-elements of the content model (NOT in XML), e.g.:<!ELEMENT memo -- ((to & from), body close?)

+(fn)>– Says that element fn can appear anyplace in

the memo

2002.10.08 - SLIDE 44IS 202 – FALL 2002

Document Type Definitions

• Attribute Declarations define attributes associated with (potentially) each element of a document and provide the acceptable values for those attributes

2002.10.08 - SLIDE 45IS 202 – FALL 2002

Attributes Example

• <!ATTLIST associate_element attribute_name declared_value default_value >

• <!ATTLIST memo status (PUBLIC | CONFIDENTIAL) PUBLIC>– In markup of a document:

<memo status="CONFIDENTIAL">also, because of the default set:<memo>would be the same as <memo status="PUBLIC">There are a variety of special defaults and data types that can be given in attribute definitions

2002.10.08 - SLIDE 46IS 202 – FALL 2002

Sample SGML DTD

<!doctype ELIB-TEXTS [

<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->

<!ELEMENT ELIB-TEXTS o o (ELIB-BIB*)>

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB - - (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT ID - o (#PCDATA)><!ELEMENT ABSTRACT - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL - o (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING - o (#PCDATA)>… etc… ]>

2002.10.08 - SLIDE 47IS 202 – FALL 2002

XML Version<!doctype ELIB-TEXTS [

<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->

<!ELEMENT ELIB-TEXTS(ELIB-BIB*)>

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT ID (#PCDATA)><!ELEMENT ABSTRACT (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING (#PCDATA)>… etc… ]>

2002.10.08 - SLIDE 48IS 202 – FALL 2002

Document Using That DTD

<ELIB-BIB><BIB-VERSION>ELIB-v1.0 </BIB-VERSION><ID>6</ID><ENTRY>February 13 1995</ENTRY><DATE>March 1, 1993</DATE><TITLE>Water Conditions in California Report 2</TITLE><ORGANIZATION>California Department of Water Resources</ORGANIZATION><SERIES>120-93</SERIES><TYPE>bulletin</TYPE><AUTHOR-INSTITUTIONAL>California Department of Water Resources </AUTHOR-INSTITUTIONAL><PAGES>17</PAGES><TEXT-REF>/elib/data/disk/disk5/documents/6/HYPEROCR/hyperocr.html </TEXT-REF><PAGED-REF>/elib/data/disk/disk5/documents/6/OCR-ASCII-NOZONE </PAGED-REF></ELIB-BIB>

2002.10.08 - SLIDE 49IS 202 – FALL 2002

A More Complex DTD

<!DOCTYPE USMARC [<!-- USMARC DTD. UCB-SLIS v.0.08 --><!-- By Jerome P. McDonough, April 1, 1994 --><!ELEMENT USMARC - - (Leader, Directry, VarFlds)><!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE) "BK" id CDATA #IMPLIED><!-- Author's Note: the id attribute for the USMARC element is intended to hold a unique record number for each MARC record in the local database. That is to say, it is intended ONLY as an aid in maintaining the local database of MARC records -->

<!ELEMENT Leader - O (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount, BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)><!ELEMENT Directry - O (#PCDATA)><!ELEMENT VarFlds - O (VarCFlds, VarDFlds)>

<!-- Component parts of Leader --><!-- Logical Record Length --><!ELEMENT LRL - O (#PCDATA)>…etc…

2002.10.08 - SLIDE 50IS 202 – FALL 2002

More Complex DTD (cont.)

<!-- Variable Data Fields --><!ELEMENT VarDFlds - O (NumbCode, MainEnty?, Titles, EdImprnt?, PhysDesc?, Series?, Notes?, SubjAccs?, AddEnty?, LinkEnty?, SAddEnty?, HoldAltG?, Fld9XX?)>

<!-- Component Parts of Variable Data Fields --><!-- Numbers & Codes --><!ELEMENT NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017*, Fld018?,

Fld019*, Fld020*, Fld022*, Fld023*, Fld024*, Fld025*, Fld027*,

Fld028*, Fld029*, Fld030*, Fld032*, Fld033*, Fld034*, Fld035*, Fld036?, Fld037*, Fld039*, Fld040?, Fld041?, Fld042?, Fld043?, Fld044?, Fld045?, Fld046?, Fld047?, Fld048*, Fld050*, Fld051*, Fld052*, Fld055*, Fld060*, Fld061*, Fld066?, Fld069*, Fld070*, Fld071*, Fld072*, Fld074*, Fld080?, Fld082*,

Fld084*, Fld086*, Fld088*, Fld090*, Fld096*)>

<!-- Main Entries --><!ELEMENT MainEnty - O (Fld100?, Fld110?, Fld111?, Fld130?)>

<!-- Titles --><!ELEMENT Titles - O (Fld210?, Fld211*, Fld212*, Fld214*, Fld222*,

Fld240?, Fld242*, Fld243?, Fld245, Fld246*, Fld247*)>

<!-- Edition, Imprint, etc. --><!ELEMENT EdImprnt - O (Fld250?, Fld254?, Fld255*, Fld256?, Fld257?, Fld260?, Fld261?, Fld262?, Fld263?, Fld265?)>

<!-- Physical Description, etc. --><!ELEMENT PhysDesc - O (Fld300*, Fld305*, Fld306?, Fld310?, Fld315?,

Fld321*, Fld340*, Fld350?, Fld351*,Fld355*, Fld357*, Fld362*)>

…etc…

2002.10.08 - SLIDE 51IS 202 – FALL 2002

Complex DTD (cont.)

<!-- Title Statement --><!ELEMENT Fld245 - O (Six?, (a|b|c|f|g|h|k|n|p|s)+)><!ATTLIST Fld245 AddEnty (No|Yes|Blank) #IMPLIED NFChars (0|1|2|3|4|5|6|7|8|9|Blnk) #IMPLIED>

…etc…

<!-- Subfield Element Declarations --><!ELEMENT a - O (#PCDATA)><!ELEMENT b - O (#PCDATA)><!ELEMENT c - O (#PCDATA)><!ELEMENT d - O (#PCDATA)>

<!ELEMENT e - O (#PCDATA)>

2002.10.08 - SLIDE 52IS 202 – FALL 2002

Document Markup

• All document markup is derived from the DTD for the particular document type

• The DTD must be referenced in the document using the DOCTYPE declaration:

<!DOCTYPE name SYSTEM "file_path" >or<!DOCTYPE name SYSTEM "file_path" [doctype_declaration_subset]>or<!DOCTYPE name [doctype_declaration_subset]>The doctype_declaration_subset can be any combination of elements, entity, and attribute declarations

2002.10.08 - SLIDE 53IS 202 – FALL 2002

HTML

• HTML was not originally "real" SGML, the DTD was invented after the language

• It is often more concerned with the form of the output on the screen than with the structural contents of the HTML docs

• Relies on the application (such as Netscape) to implement interesting actions like hypertext linking

2002.10.08 - SLIDE 54IS 202 – FALL 2002

Lecture Overview

• Review– Thesaurus Design And Development– Thesaurus Design– Steps In Thesaurus Development

• Metadata And Markup– XML As A Metadata Lingua Franca– XML DTD Construction– XML For Protocols And Metadata Languages

2002.10.08 - SLIDE 55IS 202 – FALL 2002

Dublin Core

• Review…

• Simple metadata for describing internet resources

• For “Document-Like Objects”

• 15 Elements

2002.10.08 - SLIDE 56IS 202 – FALL 2002

Dublin Core Elements

• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type

• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management

2002.10.08 - SLIDE 57IS 202 – FALL 2002

DC DTD Implementation

• There have been various versions

• This one is the one recommended (required) by the Open Archives Initiative Metadata Harvesting Protocol (OAI-MHP)

• Uses XML Name Spaces• Available at

http://dublincore.org/documents/2001/09/20/dcmes-xml/

2002.10.08 - SLIDE 58IS 202 – FALL 2002

DC Element and Attribute Definitions

<!-- The elements from DCMES 1.1 -->

<!-- The name given to the resource. --> <!ELEMENT dc:title (#PCDATA)> <!ATTLIST dc:title xml:lang CDATA #IMPLIED>

<!-- An entity primarily responsible for making the content of the resource. --> <!ELEMENT dc:creator (#PCDATA)> <!ATTLIST dc:creator xml:lang CDATA #IMPLIED>

<!-- The topic of the content of the resource. --> <!ELEMENT dc:subject (#PCDATA)> <!ATTLIST dc:subject xml:lang CDATA #IMPLIED>

<!-- An account of the content of the resource. --> <!ELEMENT dc:description (#PCDATA)> <!ATTLIST dc:description xml:lang CDATA #IMPLIED>

<!-- The entity responsible for making the resource available. --> <!ELEMENT dc:publisher (#PCDATA)> <!ATTLIST dc:publisher xml:lang CDATA #IMPLIED>

<!-- An entity responsible for making contributions to the content of the resource. --> <!ELEMENT dc:contributor (#PCDATA)> <!ATTLIST dc:contributor xml:lang CDATA #IMPLIED>

<!-- A date associated with an event in the life cycle of the resource. --> <!ELEMENT dc:date (#PCDATA)> <!ATTLIST dc:date xml:lang CDATA #IMPLIED>

2002.10.08 - SLIDE 59IS 202 – FALL 2002

DC Element Definitions (cont.)

<!-- The nature or genre of the content of the resource. --> <!ELEMENT dc:type (#PCDATA)> <!ATTLIST dc:type xml:lang CDATA #IMPLIED>

<!-- The physical or digital manifestation of the resource. --> <!ELEMENT dc:format (#PCDATA)> <!ATTLIST dc:format xml:lang CDATA #IMPLIED>

<!-- An unambiguous reference to the resource within a given context. --> <!ELEMENT dc:identifier (#PCDATA)> <!ATTLIST dc:identifier xml:lang CDATA #IMPLIED> <!ATTLIST dc:identifier rdf:resource CDATA #IMPLIED>

<!-- A Reference to a resource from which the present resource is derived. --> <!ELEMENT dc:source (#PCDATA)> <!ATTLIST dc:source xml:lang CDATA #IMPLIED> <!ATTLIST dc:source rdf:resource CDATA #IMPLIED>

<!-- A language of the intellectual content of the resource. --> <!ELEMENT dc:language (#PCDATA)> <!ATTLIST dc:language xml:lang CDATA #IMPLIED>

<!-- A reference to a related resource. --> <!ELEMENT dc:relation (#PCDATA)> <!ATTLIST dc:relation xml:lang CDATA #IMPLIED> <!ATTLIST dc:relation rdf:resource CDATA #IMPLIED>

<!-- The extent or scope of the content of the resource. --> <!ELEMENT dc:coverage (#PCDATA)> <!ATTLIST dc:coverage xml:lang CDATA #IMPLIED>

<!-- Information about rights held in and over the resource. --> <!ELEMENT dc:rights (#PCDATA)> <!ATTLIST dc:rights xml:lang CDATA #IMPLIED>

2002.10.08 - SLIDE 60IS 202 – FALL 2002

Other Protocols and Metadata Systems Using XML

• SOAP (Simple Object Access Protocol)• DAV/DASL (Distributed Authoring and

Versioning)• SDLIP (Simple Digital Library Interoperability

Protocol)• RDF (Resource Description Framework)• ADL Gazetteer Protocol • OAI-MHP (already discussed)• MPEG-7• Also versions of MARC and other formats in

XML

2002.10.08 - SLIDE 61IS 202 – FALL 2002

SGML and XML Sources and Resources

• Books: – van Herwijnen, Eric. Practical SGML. (2nd Ed.)

Boston: Kluwer Academic Publishers, 1994.– Goldfarb, Charles F. The SGML Handbook. Oxford:

Clarenden Press, 1990. (and MANY XML books)

• Web Sites:– The W3C web site (all XML standards documents)

• http://www.w3.org

– Robin Cover’s SGML/XML Site• http://www.oasis-open.org/cover/sgml-xml.html

2002.10.08 - SLIDE 62IS 202 – FALL 2002

Next Time

• Assignment 5 Due

• Come to class having thought about the strengths and weaknesses of the consolidated photo classification


Recommended