+ All Categories
Home > Documents > Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to...

Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to...

Date post: 13-Jan-2016
Category:
Upload: michael-fields
View: 217 times
Download: 0 times
Share this document with a friend
43
Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)
Transcript
Page 1: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

XML and TEI

From Metadata to Linked Data

July 2011(with thanks to Kevin

Hawkins)

Page 2: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

proprietary vs. non-proprietary formats

closed vs. open standards

Page 3: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Plain text isn’t good enough for many editorial

projects123 Kelly Road

Dublin 19

15 January 2009

Dear Awards Committee:

The candidate has fine penmanship.

Sincerely yours,

Jane Murphy

Page 4: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

What if you want to …

• Publish a collection of letters and decide after beginning that you want to have the sender’s address and closing always right-aligned?

• Search your collection of letters to extract a list of all senders and another list of all recipients?

You need to make explicit certain features of text in order to aid the processing of that text by computer programs.

Page 5: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Word processor styles: create your own!

Page 6: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Extensible Markup Language (XML): word processor styles

on steroids• Can have one style inside

another (‘nesting’)• Can give properties to these

styles, e.g.,– This salutation is formal.– This sentence contains

sarcasm.– This word is misspelled.

Page 7: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

XML in brief (1)

• Open, non-proprietary standard• Stored in plain text but usually

thought of as contrasting with it (as above)

• Marks beginning and ends of spans of text using tags:<sentence>This is a

sentence.</sentence>

Page 8: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

XML in brief (2)

• Spans of text must nest properly:

Wrong:<sentence>Overlap is <emphasis>not allowed!</sentence></emphasis>

Right:<sentence>Overlap is <emphasis>not allowed!</emphasis></sentence>

Page 9: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Elements (tags), attributes, values, content

<sentence type=“declarative”>This is a sentence.</sentence>

<sentence type=“interrogative”>Is this is a sentence?</sentence>

Elements may have zero, one, or more than one attribute.

Page 10: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Wait, this all looks a lot like HTML!

HTML is a specific implementation of XML (well, actually, its predecessor SGML) that has pre-defined elements and attributes. You can’t create your own elements, so its usefulness is limited.

Page 11: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

XML as a tree

Remember, everything must nest properly!

We use family tree terms: parent, child, sibling, ancestor, and descendent.

Page 12: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

XML as a tree

Remember, everything must nest properly!

Page 13: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Schemas (DTDs & others)

A syntax for your XML documents, specifying:

• Which elements may nest inside of others

• In what order these elements must occur

• How many times they may repeat• What attributes they may have• What values those attributes may

have

Page 14: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Why would you want to constrain your document

structure like this?

• Prevent errors in creating the XML

• Make it easier to search the text

Page 15: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Structure, not appearance

Most people use XML to describe the structure of a document rather than its appearance.

Information about how to render various components of the document is usually stored separately, in a stylesheet.

Page 16: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Brilliant. But how do we keep from reinventing the wheel

in determining good ways to constrain our document

structure? And wouldn’t it be good to make sure we use

the same vocabulary of element and attribute

names as our colleagues so that we can use each other’s

texts?

Page 17: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Use an existing schema!

• Dublin Core• DocBook• Encoded Archival Description• National Library of Medicine• VRA Core• MARC

Page 18: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

www.tei-c.org

Page 19: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

TEI Mission

“The mission of the Text Encoding Initiative is to develop and maintain a set of high-quality guidelines for the encoding of humanities texts, and to support their use by a wide community of projects, institutions, and individuals.”

http://www.tei-c.org/About/mission.xml

Page 20: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

TEI Guidelines

• The term means two things:• the formal documentation,

printed or online, produced by the TEI Consortium to define and describe the encoding system.

• the encoding scheme (markup language and tag set) described in that documentation.

Page 21: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Defining terms: “markup” and “encoding”

Both the formal schema and the documentation work together to provide a means to make explicit certain features of a text in such a way as to aid the processing of that text by computer programs regardless of platform and operating system

the process of making text explicit is called markup or encoding

Page 22: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

What does TEI make explicit?

structural divisions within a texttitle-page, chapter, scene, stanza,

line, etc

typographical elementschanges in typeface, special

characters, etc

other textual featuresgrammatical structures, location

of illustrations, variant forms, etc

Page 23: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

TEI Guidelines: what flexibility facilitates . . .

• documentary texts• literary texts• linguistics• dictionaries• corpora creation• written texts• spoken texts• born digital texts

• ancient texts– on papyri, stone

• medieval texts– illuminated msc

• modern texts– variorum– handwritten– typewritten– born digital

Modular and customizable schema to encompass a wide range of texts, periods & purposes:

Page 24: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

– allows the search engine to find like strings even when spelled differently, or referred to by another name

– allows us, via attribute values, to add this intelligence to the document without altering the original text.

TEI and Regularising

Page 25: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Ensuring the use of the TEI by a large community with customization

http://www.tei-c.org/Roma/

Example: RomaExample: Roma

Page 26: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

o all TEI documents follow the same essential format . . .

• TEI header– documents the electronic

edition being created

• TEI body– contains the content being

created

Page 27: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

TEI Header and TEI Body

Page 28: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

o the common core of textual features are easily shared . . .

The TEI Class System : To aid comprehension, modularity, and modification, elements are grouped into Classes based on commonality:

(1)attribute classes: elements share some set of attributes

(2)model classes: elements that appear in the same locations in a content model (e.g., ).

Page 29: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Customization: Roma and TEI Lite

TEI Lite:• a pre-compiled subset of TEI designed

to meet average encoding needs• ‘conceived of as a simple

demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community’

Page 30: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

TEI Lite is most suited to

• printed texts • light encoding• a good way to start new

projects

Page 31: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Goals of TEI Lite

• it includes most of the TEI ‘core’ tag set, which contains elements relevant to virtually all text types and all kinds of text-processing work;

• handles adequately a reasonably wide variety of texts, at the level of detail found in existing practice (for ex the Oxford Text Archive);

• useful for the production of new documents as well as encoding of existing ones

Page 32: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

The TEI header

The TEI header (<teiHeader>) is the ‘virtual title page’ of a TEI document. It contains metadata (information about the TEI document).

<teiHeader> is the first, mandatory child element of the root <TEI> element; therefore, it appears at the top (‘at the head’) of every TEI document.

Page 33: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

The header metadata provides:

• a bibliographic record of the electronic text as well as the source from which the electronic text is derived

• documentation of the encoding and editorial principles used in tagging the electronic text

• terms for indexing, searching, and retrieval

• a record of changes made to the electronic document

Page 34: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Structure of the header

The header contains many specialised elements not found anywhere in the ‘body’ of a TEI document (everything after <teiHeader>). These elements allow for highly structured descriptions of the document.

Many parts of the header allow free-form prose descriptions as an alternative to the highly structured descriptions.

Few header elements are required, so a header can be quite minimal.

Page 35: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

The four children of <teiHeader>

1. <filedesc>: bibliographic info (required)

2. <encodingDesc>: description of encoding practices (optional)

3. <profileDesc>: search terms (optional)

4. <revisionDesc>: record of changes (optional)

Page 36: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Each can have many child elements

Page 37: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

The four children of <teiHeader>, plus children of

<fileDesc>1. <filedesc>: bibliographic info (required)1. <titleStmt> (required)2. <editionStmt> (optional)3. <extent> (optional)4. <publicationStmt> (required)5. <seriesStmt> (optional)6. <notesStmt> (optional)7. <sourceDesc> (required)

2. <encodingDesc>: description of encoding practices (optional)

3. <profileDesc>: search terms (optional)4. <revisionDesc>: record of changes (optional)

Page 38: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Children of <encodingDesc> (2)

• <editorialDecl> explains editorial principles of encoding or transcribing texts. Can contain a prose description or use up to seven specialised child elements to describe:– corrections or normalisation performed during the

transcription– handling of quotation marks and hyphenation– any standardisation of dates or numbers

performed– analytic or interpretive information added to the

text

Page 39: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

a lot of work …

Creating good, consistent metadata for a collection of documents is hard, and it’s not something most of us find interesting.

However, digital texts, just like the primary source material we all study, often end up being studied in ways that the authors never intended or even imagined. It’s good to give as much context about the text as is feasible to help others make use of the TEI document in the future.

Page 40: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Controlled vocabularies, thesauri and authority files

A controlled vocabulary is a standard set of keywords designed to cover a particular area of study.

A thesaurus or authority file is a controlled vocabulary containing synonyms pointing to the ‘authorised’ form that you should use. Some thesauri even contain a hierarchy of terms.

Page 41: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Controlled vocabularies, thesauri and authority files

Some controlled vocabularies are built into the TEI (like codes for languages). Others are given in the TEI as suggestions (like Library of Congress Subject Headings).

If you use the authorized forms of names, you can disambiguate people with similar names, and your users will be able to search your materials with other materials.

There are lots of controlled vocabularies out there. Don’t ‘reinvent the wheel’!

Page 42: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Some examples

• Library of Congress Authorities:– subject headings (LCSH)– names of authors, editors, etc.– titles of well-known literary works

• Getty Thesaurus of Geographical Names

• Placenames Database of Ireland• Northern Ireland Place-Name Project• Dictionary of Irish Biography

Page 43: Dr Susan Schreibman Long Room Hub Senior Lecturer in Digital Humanities XML and TEI From Metadata to Linked Data July 2011 (with thanks to Kevin Hawkins)

Dr Susan SchreibmanLong Room Hub Senior Lecturer in Digital Humanities

Questions?


Recommended