Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | michael-fields |
View: | 217 times |
Download: | 0 times |
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
XML and TEI
From Metadata to Linked Data
July 2011(with thanks to Kevin
Hawkins)
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
proprietary vs. non-proprietary formats
closed vs. open standards
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Plain text isn’t good enough for many editorial
projects123 Kelly Road
Dublin 19
15 January 2009
Dear Awards Committee:
The candidate has fine penmanship.
Sincerely yours,
Jane Murphy
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
What if you want to …
• Publish a collection of letters and decide after beginning that you want to have the sender’s address and closing always right-aligned?
• Search your collection of letters to extract a list of all senders and another list of all recipients?
You need to make explicit certain features of text in order to aid the processing of that text by computer programs.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Word processor styles: create your own!
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Extensible Markup Language (XML): word processor styles
on steroids• Can have one style inside
another (‘nesting’)• Can give properties to these
styles, e.g.,– This salutation is formal.– This sentence contains
sarcasm.– This word is misspelled.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
XML in brief (1)
• Open, non-proprietary standard• Stored in plain text but usually
thought of as contrasting with it (as above)
• Marks beginning and ends of spans of text using tags:<sentence>This is a
sentence.</sentence>
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
XML in brief (2)
• Spans of text must nest properly:
Wrong:<sentence>Overlap is <emphasis>not allowed!</sentence></emphasis>
Right:<sentence>Overlap is <emphasis>not allowed!</emphasis></sentence>
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Elements (tags), attributes, values, content
<sentence type=“declarative”>This is a sentence.</sentence>
<sentence type=“interrogative”>Is this is a sentence?</sentence>
Elements may have zero, one, or more than one attribute.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Wait, this all looks a lot like HTML!
HTML is a specific implementation of XML (well, actually, its predecessor SGML) that has pre-defined elements and attributes. You can’t create your own elements, so its usefulness is limited.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
XML as a tree
Remember, everything must nest properly!
We use family tree terms: parent, child, sibling, ancestor, and descendent.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
XML as a tree
Remember, everything must nest properly!
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Schemas (DTDs & others)
A syntax for your XML documents, specifying:
• Which elements may nest inside of others
• In what order these elements must occur
• How many times they may repeat• What attributes they may have• What values those attributes may
have
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Why would you want to constrain your document
structure like this?
• Prevent errors in creating the XML
• Make it easier to search the text
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Structure, not appearance
Most people use XML to describe the structure of a document rather than its appearance.
Information about how to render various components of the document is usually stored separately, in a stylesheet.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Brilliant. But how do we keep from reinventing the wheel
in determining good ways to constrain our document
structure? And wouldn’t it be good to make sure we use
the same vocabulary of element and attribute
names as our colleagues so that we can use each other’s
texts?
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Use an existing schema!
• Dublin Core• DocBook• Encoded Archival Description• National Library of Medicine• VRA Core• MARC
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
www.tei-c.org
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
TEI Mission
“The mission of the Text Encoding Initiative is to develop and maintain a set of high-quality guidelines for the encoding of humanities texts, and to support their use by a wide community of projects, institutions, and individuals.”
http://www.tei-c.org/About/mission.xml
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
TEI Guidelines
• The term means two things:• the formal documentation,
printed or online, produced by the TEI Consortium to define and describe the encoding system.
• the encoding scheme (markup language and tag set) described in that documentation.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Defining terms: “markup” and “encoding”
Both the formal schema and the documentation work together to provide a means to make explicit certain features of a text in such a way as to aid the processing of that text by computer programs regardless of platform and operating system
the process of making text explicit is called markup or encoding
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
What does TEI make explicit?
structural divisions within a texttitle-page, chapter, scene, stanza,
line, etc
typographical elementschanges in typeface, special
characters, etc
other textual featuresgrammatical structures, location
of illustrations, variant forms, etc
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
TEI Guidelines: what flexibility facilitates . . .
• documentary texts• literary texts• linguistics• dictionaries• corpora creation• written texts• spoken texts• born digital texts
• ancient texts– on papyri, stone
• medieval texts– illuminated msc
• modern texts– variorum– handwritten– typewritten– born digital
Modular and customizable schema to encompass a wide range of texts, periods & purposes:
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
– allows the search engine to find like strings even when spelled differently, or referred to by another name
– allows us, via attribute values, to add this intelligence to the document without altering the original text.
TEI and Regularising
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Ensuring the use of the TEI by a large community with customization
http://www.tei-c.org/Roma/
Example: RomaExample: Roma
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
o all TEI documents follow the same essential format . . .
• TEI header– documents the electronic
edition being created
• TEI body– contains the content being
created
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
TEI Header and TEI Body
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
o the common core of textual features are easily shared . . .
The TEI Class System : To aid comprehension, modularity, and modification, elements are grouped into Classes based on commonality:
(1)attribute classes: elements share some set of attributes
(2)model classes: elements that appear in the same locations in a content model (e.g., ).
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Customization: Roma and TEI Lite
TEI Lite:• a pre-compiled subset of TEI designed
to meet average encoding needs• ‘conceived of as a simple
demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community’
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
TEI Lite is most suited to
• printed texts • light encoding• a good way to start new
projects
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Goals of TEI Lite
• it includes most of the TEI ‘core’ tag set, which contains elements relevant to virtually all text types and all kinds of text-processing work;
• handles adequately a reasonably wide variety of texts, at the level of detail found in existing practice (for ex the Oxford Text Archive);
• useful for the production of new documents as well as encoding of existing ones
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
The TEI header
The TEI header (<teiHeader>) is the ‘virtual title page’ of a TEI document. It contains metadata (information about the TEI document).
<teiHeader> is the first, mandatory child element of the root <TEI> element; therefore, it appears at the top (‘at the head’) of every TEI document.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
The header metadata provides:
• a bibliographic record of the electronic text as well as the source from which the electronic text is derived
• documentation of the encoding and editorial principles used in tagging the electronic text
• terms for indexing, searching, and retrieval
• a record of changes made to the electronic document
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Structure of the header
The header contains many specialised elements not found anywhere in the ‘body’ of a TEI document (everything after <teiHeader>). These elements allow for highly structured descriptions of the document.
Many parts of the header allow free-form prose descriptions as an alternative to the highly structured descriptions.
Few header elements are required, so a header can be quite minimal.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
The four children of <teiHeader>
1. <filedesc>: bibliographic info (required)
2. <encodingDesc>: description of encoding practices (optional)
3. <profileDesc>: search terms (optional)
4. <revisionDesc>: record of changes (optional)
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Each can have many child elements
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
The four children of <teiHeader>, plus children of
<fileDesc>1. <filedesc>: bibliographic info (required)1. <titleStmt> (required)2. <editionStmt> (optional)3. <extent> (optional)4. <publicationStmt> (required)5. <seriesStmt> (optional)6. <notesStmt> (optional)7. <sourceDesc> (required)
2. <encodingDesc>: description of encoding practices (optional)
3. <profileDesc>: search terms (optional)4. <revisionDesc>: record of changes (optional)
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Children of <encodingDesc> (2)
• <editorialDecl> explains editorial principles of encoding or transcribing texts. Can contain a prose description or use up to seven specialised child elements to describe:– corrections or normalisation performed during the
transcription– handling of quotation marks and hyphenation– any standardisation of dates or numbers
performed– analytic or interpretive information added to the
text
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
a lot of work …
Creating good, consistent metadata for a collection of documents is hard, and it’s not something most of us find interesting.
However, digital texts, just like the primary source material we all study, often end up being studied in ways that the authors never intended or even imagined. It’s good to give as much context about the text as is feasible to help others make use of the TEI document in the future.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Controlled vocabularies, thesauri and authority files
A controlled vocabulary is a standard set of keywords designed to cover a particular area of study.
A thesaurus or authority file is a controlled vocabulary containing synonyms pointing to the ‘authorised’ form that you should use. Some thesauri even contain a hierarchy of terms.
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Controlled vocabularies, thesauri and authority files
Some controlled vocabularies are built into the TEI (like codes for languages). Others are given in the TEI as suggestions (like Library of Congress Subject Headings).
If you use the authorized forms of names, you can disambiguate people with similar names, and your users will be able to search your materials with other materials.
There are lots of controlled vocabularies out there. Don’t ‘reinvent the wheel’!
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Some examples
• Library of Congress Authorities:– subject headings (LCSH)– names of authors, editors, etc.– titles of well-known literary works
• Getty Thesaurus of Geographical Names
• Placenames Database of Ireland• Northern Ireland Place-Name Project• Dictionary of Irish Biography
Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities
Questions?