Date post: | 26-Mar-2015 |
Category: |
Documents |
Upload: | cody-flores |
View: | 214 times |
Download: | 0 times |
METS and TEI
Richard Gartner
Oxford University
Introduction (verbal)
• METS provides framework within which any data or metadata can be referenced or embedded
• This presentation shows how easily METS and TEI can be used in tandem
• The context is an image database with full OCR’d text encoded in TEI
Cobbett’s Parliamentary History
Incorporating TEI into METS
<fileGrp ID="modhis006-aab-TEI">
<file GROUPID="TEI" MIMETYPE="text/xml" ADMID="modhis006-aab-001-TEI">
<FLocat LOCTYPE="URL“ xlink:href="modhis006-aab.xml"/>
</file>
</fileGrp>
Incorporating TEI into METS
<div ID="modhis006-aab-div.1.1.1" LABEL="Half page">
<fptr FILEID="modhis006-aab-fgrp-0001"> <area FILEID="modhis006-aab-TEI " BEGIN="modhis006-aab-TEI.pb.1“
END="modhis006-aab-TEI.pb.2"/>
</fptr>
</div>
Incorporating TEI into METS
<pb id="modhis006-aab-aaa.pb.3"/>
THEParliamentary History
OFENGLAND,FROMTHE EARLIEST PERIODTOTHE YEAR 1803.FROM WHICH LAST-MENTIONED EPOCH IT IS CONTINUED DOWNWARDS IN THE WORK ENTITLED,'� THE PARLIAMENTARY DEBATES."VOL. II. A.D. 1625�1642.LONDON:PRINTED BY T. C. HANSARD, PETERBOROUGH-COURT, FLEET-STREET s �RLONGMAN, HURST, REES, ORME, & BROWN; J. RICHARDSON; BLACK,PARRY, & co,; j. HATCH ARD; J.RIDGWAY; E.JEFFERY; J.BOOKER;J- RODWELL; CRADOCK & JOY; R. H. EVANS; J. BUDD; J. BOOTH; T. C. HANSARD.1807. ;
<pb id="modhis006-aab-aaa.pb.4"/>
OCR -> TEI
• TEI in Libraries level 1 – simplest level of encoding designed for OCR texts– One <div> element enclosing complete
text– One <p> element within this– Page breaks marked with <pb>
OCR -> TEI (verbal)
• OCR’d text put into skeletal TEI file with minimal header
• Page-breaks in file replaced with <pb> • A simple stylesheet assigns a
sequential ID to each <pb>• Another stylesheet adds <area>
elements to METS structural map pointing to <pb> elements
<?xml version="1.0" encoding="utf-8"?><tei.2> <teiHeader status="new" type="text"> <fileDesc> <titleStmt> <title>modhis006-aab OCR text</title> </titleStmt> <publicationStmt>
<publisher>Oxford Digital Library</publisher> </publicationStmt> <sourceDesc default="NO">
<p >OCR text from modhis006-aab</p></sourceDesc>
</fileDesc> </teiHeader> <text>
<body> <div0 id="modhis006-aab-aaa.div.1" part="N“ sample="complete" org="uniform">
<p>
</p> </div0> </body> </text></tei.2>
Put your OCR text here!
<pb/>Parliamentary History.VOL. n.<pb/>
□Parliamentary History.VOL. n.□
<pb/>Parliamentary History.VOL. n.<pb/>
<xsl:template match="//pb"> <xsl:element name="pb"> <xsl:attribute name="id"> <xsl:value-ofselect="$idstem"/>
.pb.<xsl:number count="pb" format="1“ level="any"/>
</xsl:attribute> </xsl:element></xsl:template>
<pb id="modhis006-aab-aaa.pb.1"/>Parliamentary History.VOL. n.<pb id="modhis006-aab-aaa.pb.2"/>
<xsl:element name="fptr"> <xsl:attribute name="FILEID"> <xsl:value-of select="@FILEID"/> </xsl:attribute>
<xsl:element name="area"> <xsl:attribute name="FILEID">
<xsl:value-of select="$idstem"/> </xsl:attribute>
<xsl:attribute name="BEGIN"><xsl:value-of select="$idstem"/>.pb.<xsl:number count="mets:fptr" format="1" level="any"/>
</xsl:attribute>
<xsl:attribute name="END"><xsl:value-of select="$idstem"/>.pb.<xsl:value-of select="$currentcount+1"/>
</xsl:attribute></xsl:element>
<div ID="modhis006-aab-div.1.1.1" LABEL="Half page">
<fptr FILEID="modhis006-aab-fgrp-0001"> <area FILEID="modhis006-aab-TEI " BEGIN="modhis006-aab-TEI.pb.1“
END="modhis006-aab-TEI.pb.2"/>
</fptr>
</div>
Why use METS and TEI together?
• Images
• Overlapping hierarchies
Verbal
• Images– AS far as P4, TEIs image facilities clumsy
• Have to use entity references only – no URLs URIs etc• No way to distinguish between inline images (designed
for these) and whole-page images• No scope for administrative metadata
• Overlapping hierarchies– CONCUR was SGML mechanism for this –
clumsy to use and gone in XML – various other approaches all distinguised by notational complexity
Images
<figure entity=“page1”>
<head>Page 1</head>
</figure>
<ENTITY page1 SYSTEM “location_of_image_file” NDATA jpeg>
Overlapping hierarchies
• Some approaches used with TEI– CONCUR (SGML)– MECS (Wittgenstein archive)– Stand-off markup: XLink mechanisms to
impose markup (varying hierarchies) – TexMECS – Witt: PROLOG
Images in METS
• List all variants of image files in <fileSec>• Each can have extensive administrative or
descriptive metadata attached• Reference them by URLs, URIs etc or embed
them in the METS file• FILEID element in <structMap> indicates
exact correspondence of image to part of the item
Overlapping hierarchies<structMap type=“physical”>
<div LABEL=“Page 1”>
<fptr FILEID=“image_file_for_page_1”>
<area FILEID=“teifile” BEGIN=“page1” END=“page2”>
</fptr>
</div>
</structMap>
<structMap type=“logical”>
<div LABEL=“Chapter 1”>
<fptr FILEID=“image_file_for_page_1”>
<area FILEID=“teifile” BEGIN=“page1” END=“page23”>
</fptr>
</div>
</structMap>
Overlapping hierarchies
<structMap >
<div LABEL=“Chapter 1”>
<div LABEL=“Page1”>
<fptr FILEID=“image_file_for_page_1”>
<area FILEID=“teifile” BEGIN=“page1” END=“page2”>
</fptr>
</div>
</div>
</structMap>
More information
• http:www.loc.gov/standards/mets
• http://www.jisc.ac.uk/index.cfm?name=techwatch_report_0205