UVA MDST 3073 Texts and Models-2012-09-11

Post on 17-Nov-2014

533 views 1 download

Tags:

description

 

transcript

Lecture 4: Texts and Models

Prof. AlvaradoMDST 3703/7703

11 September 2012

Review

• Posting “Hello, World!”– Put file in the public_html directory of your UVA

Home Directory– Create a post and insert a link to this file– Categorize as: 09.06: (S) HTML

• If you cannot get to your home directory, try uploading tohttp://homedir.virginia.edu

Some Quick Corrections

• Digital text is not necessary– It’s an open question (i.e. do we have to have it?)

• Nelson did not conceive of “trails,” Bush did• HTML is not the “first big idea” in the

liberal arts; hypertext is (according to me)• The idea that “text shapes knowledge” is

not ancient, but relatively new– Media determinism is a 20th century perspective– Although Plato notes the effects of literacy in the Phaedo

• Not everything can be translated into HTML– i.e. HTML is not the richest framework for digital representation

Your Questions and Observations

• Is commercialization killing creativity? – What is the relationship between how the web is

organized economically and how it shapes expression? EFFECT OF SOCIAL ORGANIZATION

• What happens if the associations that someone makes is ‘off’ and illogical to others?– Does it loosen the way logical connections can be

made and argued? EFFECT ON LOGIC

Your Questions and Observations

• Computers in general still heavily rely on a hierarchical structure – To what extent rationalization has occurred with the

invention of hypertext?• Do things lose value and meaning in

exchange for digital coding?– What is the effect of digitization on value?

• Hypertexts and links online can be distracting– Non-linear thinking or mindless surfing?

Your Questions and Observations

• People are trying to create the same exact classroom experience online that exists in the physical classroom, which is impossible– We need to rethink and restructure the online learning

experience as a new and unique learning experience• How can we keep hypertext from

altering us too much?• The beauty and the risk of an open

source web

Practical Questions• How can an HTML webpage on your own

computer be found by the search bar but not be on the web?– Your browser lives on your machine– The protocol name tells it where to look

• I wondered if the picture from my computer would still show up if I opened the page from another computer?

• It is interesting to see how one little thing out of place can ruin the entire code– Computers are stupid in that way

• Why should coders learn HTML? – HTML is an interface language that can be easily generated from print

statements in your code

What is HTML?

• HTML is not a programming language– Programming languages express IF … THEN logic– But it is code that obeys a syntax & gets interpreted– And it is produced and consumed by programs

• HTML is a very general interface language

• HTML is written in XML, which we discuss today– Technically called “XHTML”– The original version was written in SGML

In general, don’t conflate HTML with hypertext or with digital representation in general

HTML is a language that generates a species of hypertext

which is, in turn, a species of digital representation

A provisionaltaxonomy

Is hypertext new?

[Study Bible]

1 = Mishna, the first major transcription of the oral law2 = Gemara, analytical discussions3 = Rashi, glossary4 = Tosefos, additions5 = Hananel, comments6 = Eye of Justice, legal decisions8 = Light of the Bible, references to Biblical quotations.9 = Bach's Annotations 10 = Gra's Annotations

[Talmud]

[Charrette]

[The Wasteland]

[Critical Edition]

[OED]

These are all examples of traditional texts

They exhibit “latent hypertext”

Landow

• The concept of hypertext parallels poststructuralist views of text– Barthes, Foucault, Derrida, Kristeva, et al.

• In this view, a text is not, and has never been, a bounded, closed thing– it is a network of signifiers that connect meanings

across time and space …

Digital humanists have been concerned with encoding historical texts since at least 1949

Father Busa

• Creator of the Index Thomisticus• Saw the computer as a solution to

indexing the works of Aquinas in 1949– 13,000,000 words– “in” took 4 years

• Solution:– Lemmatization– Variations tagged as

instances of a type

The complete works of Aquinas will be typed onto punch cards; the machines will then work through the words and produce a systematic index of every word St. Thomas used, together with the number of times it appears, where it appears, and the six words immediately preceding and following each appearance (to give the context). This will take the machines 8,125 hours; the same job would be likely to take one man a lifetime.

Time Magazine, 1956, “Religion: Sacred: Electronics”

So, what is text?

Let’s look at some material examples

page o’ text

Real world text comes packaged in documents

How is text conveyed in a document?

A document is a material artifact

What is text?

Visual Signifiers

• Small caps• Indentation• Alignment• Italics• Space

All used to signify elements of text

Documents have thee Levels: Content, Structure, Style

• Content– TEXT, images, video clips, etc.

• Structure– The organization of content into units (elements)

and logical relationships (e.g. reading order)• Style– Screen and print layout– Fonts, colors, etc.

Descriptive markup languages allow us to define structure of documents for

computational purposes

Theoretically, they do not specify layout or content

[PDF, Procedural Markup]

In contrast to procedural markup like PDF

So, how are docs structured?

Hierarchically …

(theoretically)

Document Elements and StructuresPlay– Act +

• Scene +– Line +

Book– Chapter +

• Verse +

Letter

– Heading• Return Address• Date• Recipient Info

– Name– Title– Address

– Content• Salutation• Paragraph +• Closing

These are all “trees”

XML is a markup language

What is XML?

• Stands for eXtensible Markup Language– Actually invented after the web– A simplification of SGML, the language used to create HTML– It specifies a set of rules for creating specialized markup

languages such as HTML and TEI• It is simplified version of the SGML

– Standard Generalized Markup Language• SGML was invented in the early 1970s to

wrest the control of documents from computer people who were taking over industries like law and accounting

XML looks like this

Notice how the element names reference units, not layout or style

Also markup for “in-line” elements

XML Premises

1. All documents are comprised of elements.

2. Elements contain content.3. Elements have no layout.4. Elements are hierarchically

ordered.5. Elements are to be indicated by

“markup” – tags that define the beginning and end of an element

XML Markup Rules

• Tags signify structural elements• Three kinds of tag– Start and End, e.g <p> and </p>– Singleton, e.g <br />

• Start and singleton tags can have attributes– Simple key/value pairs– <div class="stanza" style="color:red;">

• Basic rules– All attributes must be quoted– All tags must nest (no overlaps!)

Documents in XML that meet these rules are “well formed”

XML also provides Document Types• A Document Type Definition (DTD)

defines a set of tags and rules for using them– Specifies elements, attributes, and possible combinations– E.g. in HTML, the ol and ul elements must contain li elements

• A DTD is just one kind of schema system used by XML

• Schema express data models of/for texts– TEI is a powerful way of describing primary source materials

for scholars• Documents that use a schema properly

are called “valid”

Originally, DTDs defined “genres” like business letter or mortgage form

They were later used to define more abstract models of textual content

XML is used everywhere

• HTML– E.g. Embed codes

• TEI (Text Encoding Initiative)• RSS• Civilization IV• Playlists (e.g. XSPF or “spiff”)• Google Maps (KML)

A Look Again at HTML

• aka XHTML– And now becoming HTML5

• An instance of XML (formerly SGML)

• An interface language• Language of the World Wide Web• Defined by a DTD that prescribes a

specific set of elements and relations

HTML Document Structure

• Head– Title– [Directives]

• Body– H1+– H2+• P+• UL

– LI

Basic Elements with associated TagsElement Tags Attributes

Paragraph <p> ... </p>

Numbered List <ol> <li> ... </li></ol>

Bulleted List <ul> <li> ... </li></ul>

Table <table> <tr> <td> ... </td> </tr></table>

Anchor <a> ... </a> href, target

Image <img/> src, border

Object <object> ... </object>

The Text Encoding Initiative created TEI to mark up scholarly documents

Mainly primary sources such as books and

manuscripts

TEI

• The dominant language used to encode scholarly text

• The current room was the locations of UVa’s EText Center– World famous for text encoding– Now part of the library and catalog

• Scholars create their own schema to match what they are interested in

Examples

• The TEI Header– http://tbe.kantl.be/TBE/examples/TBED02v00.

htm• TEI Prose– http://tbe.kantl.be/TBE/examples/TBED03v00.

htm • Find others at the TEI By

Example Project– http://tbe.kantl.be/TBE/

XML contains an implicit theory of text

What is it?

OCHO

• XML (and therefore HTML and TEI) imply a certain theory of text– A text is an OHCO

• OHCO– Ordered Hierarchy of Content Objects

• An OHCO is a kind of tree– Elements follow each other in sequences– Elements can contain other elements

What are the advantages of this view?

OHCO allows for easy processing

• Every element has a precise address in the text– E.g. HTML/body/p[1]

• Texts can be described in the language of kinship– Ancestors, parents, siblings, children, etc.

• Texts can be restructured and manipulated by known patterns and algorithms– Traversing– Pruning– Cross-referencing

What are the disadvantages of OCHO?

Logical vs. Physical Structure

Two common structures that overlap

Pages and Paragraphs

<page n=“2”>. . .<p id=“foo”>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife</p> </page><page n=“3”><p id=“bar” prev_id=“foo”> a very superior character to anything deserved by his own.</p>. . .</page>

Solution 1: Split Elements

<p>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife <pb n=“3” /> a very superior character to anything deserved by his own.</p>

Solution 2: Use “Milestones”

One structure gets backgrounded

Wittgenstein’s Manuscripts

What about this?

[Charrette]

The problem of overlap suggests the need for a richer set of tools

What tools do McCarty and Unsworth reference?

Tables

A database for Ovid

McCarty

• A different use of markup – From document description to interpretation – Creative “misuse”

• Reverse engineering a “grammar” of personification from a markup strategy– Thickness = description (of text)– Depth = explanation (of text by reference to grammar)

• Is forced to use tables in collaboration with markup

Thick description = MarkupDeep description = Tables

How to reconcile these tools?

A Proposed Model

• Texts are not documents– Documents are media, Texts are messages

• Texts and documents are part of a system comprised of “levels”– They are effectively archaeology sites with stratigraphic layers– Erasures are like cities building on top of each other

• Each level of the system is described by an appropriate set of tools– Document structures XML– Textual structures, embedded ontologies Tables

Basic Levels

• Document– Physical objects (paper)– Logical objects (defined by space, style, punctuation, etc.)– Style and layout (also defined by space, color, etc.)– Can have superimposed versions

• Text– Sequences of characters– Grammatical features– Figures and poetic features– Etc.