Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to...

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

XML and TEI

From Metadata to Linked Data

July 2011(with thanks to Kevin

Hawkins)


proprietary vs. non-proprietary formats

closed vs. open standards


Plain text isn’t good enough for many editorial

projects123 Kelly Road

Dublin 19

15 January 2009

Dear Awards Committee:

The candidate has fine penmanship.

Sincerely yours,

Jane Murphy


What if you want to …

• Publish a collection of letters and decide after beginning that you want to have the sender’s address and closing always right-aligned?

• Search your collection of letters to extract a list of all senders and another list of all recipients?

You need to make explicit certain features of text in order to aid the processing of that text by computer programs.


Word processor styles: create your own!


Extensible Markup Language (XML): word processor styles

on steroids• Can have one style inside

another (‘nesting’)• Can give properties to these

styles, e.g.,– This salutation is formal.– This sentence contains

sarcasm.– This word is misspelled.


XML in brief (1)

• Open, non-proprietary standard• Stored in plain text but usually

thought of as contrasting with it (as above)

• Marks beginning and ends of spans of text using tags:<sentence>This is a

sentence.</sentence>


XML in brief (2)

• Spans of text must nest properly:

Wrong:<sentence>Overlap is <emphasis>not allowed!</sentence></emphasis>

Right:<sentence>Overlap is <emphasis>not allowed!</emphasis></sentence>


Elements (tags), attributes, values, content

<sentence type=“declarative”>This is a sentence.</sentence>

<sentence type=“interrogative”>Is this is a sentence?</sentence>

Elements may have zero, one, or more than one attribute.


Wait, this all looks a lot like HTML!

HTML is a specific implementation of XML (well, actually, its predecessor SGML) that has pre-defined elements and attributes. You can’t create your own elements, so its usefulness is limited.


XML as a tree

Remember, everything must nest properly!

We use family tree terms: parent, child, sibling, ancestor, and descendent.


XML as a tree

Remember, everything must nest properly!


Schemas (DTDs & others)

A syntax for your XML documents, specifying:

• Which elements may nest inside of others

• In what order these elements must occur

• How many times they may repeat• What attributes they may have• What values those attributes may

have


Why would you want to constrain your document

structure like this?

• Prevent errors in creating the XML

• Make it easier to search the text


Structure, not appearance

Most people use XML to describe the structure of a document rather than its appearance.

Information about how to render various components of the document is usually stored separately, in a stylesheet.


Brilliant. But how do we keep from reinventing the wheel

in determining good ways to constrain our document

structure? And wouldn’t it be good to make sure we use

the same vocabulary of element and attribute

names as our colleagues so that we can use each other’s

texts?


Use an existing schema!

• Dublin Core• DocBook• Encoded Archival Description• National Library of Medicine• VRA Core• MARC


www.tei-c.org


TEI Mission

“The mission of the Text Encoding Initiative is to develop and maintain a set of high-quality guidelines for the encoding of humanities texts, and to support their use by a wide community of projects, institutions, and individuals.”

http://www.tei-c.org/About/mission.xml


TEI Guidelines

• The term means two things:• the formal documentation,

printed or online, produced by the TEI Consortium to define and describe the encoding system.

• the encoding scheme (markup language and tag set) described in that documentation.


Defining terms: “markup” and “encoding”

Both the formal schema and the documentation work together to provide a means to make explicit certain features of a text in such a way as to aid the processing of that text by computer programs regardless of platform and operating system

the process of making text explicit is called markup or encoding


What does TEI make explicit?

structural divisions within a texttitle-page, chapter, scene, stanza,

line, etc

typographical elementschanges in typeface, special

characters, etc

other textual featuresgrammatical structures, location

of illustrations, variant forms, etc


TEI Guidelines: what flexibility facilitates . . .

• documentary texts• literary texts• linguistics• dictionaries• corpora creation• written texts• spoken texts• born digital texts

• ancient texts– on papyri, stone

• medieval texts– illuminated msc

• modern texts– variorum– handwritten– typewritten– born digital

Modular and customizable schema to encompass a wide range of texts, periods & purposes:


– allows the search engine to find like strings even when spelled differently, or referred to by another name

– allows us, via attribute values, to add this intelligence to the document without altering the original text.

TEI and Regularising


Ensuring the use of the TEI by a large community with customization

http://www.tei-c.org/Roma/

Example: RomaExample: Roma


o all TEI documents follow the same essential format . . .

• TEI header– documents the electronic

edition being created

• TEI body– contains the content being

created


TEI Header and TEI Body


o the common core of textual features are easily shared . . .

The TEI Class System : To aid comprehension, modularity, and modification, elements are grouped into Classes based on commonality:

(1)attribute classes: elements share some set of attributes

(2)model classes: elements that appear in the same locations in a content model (e.g., ).


Customization: Roma and TEI Lite

TEI Lite:• a pre-compiled subset of TEI designed

to meet average encoding needs• ‘conceived of as a simple

demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community’


TEI Lite is most suited to

• printed texts • light encoding• a good way to start new

projects


Goals of TEI Lite

• it includes most of the TEI ‘core’ tag set, which contains elements relevant to virtually all text types and all kinds of text-processing work;

• handles adequately a reasonably wide variety of texts, at the level of detail found in existing practice (for ex the Oxford Text Archive);

• useful for the production of new documents as well as encoding of existing ones


The TEI header

The TEI header (<teiHeader>) is the ‘virtual title page’ of a TEI document. It contains metadata (information about the TEI document).

<teiHeader> is the first, mandatory child element of the root <TEI> element; therefore, it appears at the top (‘at the head’) of every TEI document.


The header metadata provides:

• a bibliographic record of the electronic text as well as the source from which the electronic text is derived

• documentation of the encoding and editorial principles used in tagging the electronic text

• terms for indexing, searching, and retrieval

• a record of changes made to the electronic document


Structure of the header

The header contains many specialised elements not found anywhere in the ‘body’ of a TEI document (everything after <teiHeader>). These elements allow for highly structured descriptions of the document.

Many parts of the header allow free-form prose descriptions as an alternative to the highly structured descriptions.

Few header elements are required, so a header can be quite minimal.


The four children of <teiHeader>

1. <filedesc>: bibliographic info (required)

2. <encodingDesc>: description of encoding practices (optional)

3. <profileDesc>: search terms (optional)

4. <revisionDesc>: record of changes (optional)


Each can have many child elements


The four children of <teiHeader>, plus children of

<fileDesc>1. <filedesc>: bibliographic info (required)1. <titleStmt> (required)2. <editionStmt> (optional)3. <extent> (optional)4. <publicationStmt> (required)5. <seriesStmt> (optional)6. <notesStmt> (optional)7. <sourceDesc> (required)

2. <encodingDesc>: description of encoding practices (optional)

3. <profileDesc>: search terms (optional)4. <revisionDesc>: record of changes (optional)


Children of <encodingDesc> (2)

• <editorialDecl> explains editorial principles of encoding or transcribing texts. Can contain a prose description or use up to seven specialised child elements to describe:– corrections or normalisation performed during the

transcription– handling of quotation marks and hyphenation– any standardisation of dates or numbers

performed– analytic or interpretive information added to the

text


a lot of work …

Creating good, consistent metadata for a collection of documents is hard, and it’s not something most of us find interesting.

However, digital texts, just like the primary source material we all study, often end up being studied in ways that the authors never intended or even imagined. It’s good to give as much context about the text as is feasible to help others make use of the TEI document in the future.


Controlled vocabularies, thesauri and authority files

A controlled vocabulary is a standard set of keywords designed to cover a particular area of study.

A thesaurus or authority file is a controlled vocabulary containing synonyms pointing to the ‘authorised’ form that you should use. Some thesauri even contain a hierarchy of terms.


Controlled vocabularies, thesauri and authority files

Some controlled vocabularies are built into the TEI (like codes for languages). Others are given in the TEI as suggestions (like Library of Congress Subject Headings).

If you use the authorized forms of names, you can disambiguate people with similar names, and your users will be able to search your materials with other materials.

There are lots of controlled vocabularies out there. Don’t ‘reinvent the wheel’!


Some examples

• Library of Congress Authorities:– subject headings (LCSH)– names of authors, editors, etc.– titles of well-known literary works

• Getty Thesaurus of Geographical Names

• Placenames Database of Ireland• Northern Ireland Place-Name Project• Dictionary of Irish Biography


Questions?

Date post:	13-Jan-2016
Category:	Documents
Upload:	michael-fields
View:	217 times
Download:	0 times

Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to...

Documents