Mdst3705 2013-02-19-text-into-data

transcript

Text into Data

Prof AlvaradoMDST 3705

19 February 2013

Business• Quiz 1 graded

– Let me know if you have questions• Readings

– Apologies for mis-posting!

Review• Last week, we took a 30,000 foot view of

the use of databases in the digital humanities– We found that databases are everywhere

• Databases form the foundation of all projects– Even if a database management system is

not used• Relational databases are sophisticated

and mature choices for foundations

Overview• We began this course by looking at code

as language– Code structured like natural language– Code implies, models, and creates a world

• We then looked at the opposite process – looking at language, and the products of culture, as code– We called this “reverse engineering”

• Today we continue this and look specifically at text

What do you remember when you read a book?

We remember scenes, images, plot lines, values, etc.

We sometimes remember verbatim passages

We don’t normally remember the words

We get much of our culture through books (and other "cultural models" in Colby's words)

Like cigarettes, books are a delivery mechanism

(not of nicotine, but of culture)

Colby's theory

CULTURE

If texts contain cultural meanings . . .

How do we get to them?

How do we represent them?

Models of Text

Competing Approaches• A common approach to model text is to

use XML– XML is like HTML, but more general– It allows you to mark up a text

• XML assumes a text is like a tree– An “ordered hierarchy of content objects”

• XML was also specifically designed to work with text

XML looks like this

Notice how the element names reference units, not layout or style

Text as Tree

XML turns out to be very useful for defining the physical or

logical structure of a text, but not for figures and meanings

Texts are actually more like networks

This image shows three "figures" in the text of an Old French poem. Note how they do not "nest" neatly into the structure of the text, but instead cross-cut it.

It is hard to model this kind of data with XML.

Relational databases are a better choice for this since they are more abstract

The problem is, what data model to use?

How do you model text in a relational database?

Liu and Smith argue for a radical model, in which text is parsed at the workd level

Each word gets its own record

The Princeton Charrette Project used a database-driven application called Figura

It was designed to represent the critical edition of an Old French poem along with

the figural annotations of the text made by scholars

A “figure” is a figure of speech or rhetorical device, like rhyming or the use

of chiasmus

The database stored information about grammar, manuscript images, figures, and other data that had been accumulated over the years prior to building

the database

At the heart of the database is the text model that links figures to text

In my model and in Liu & Smith’s, the text becomes a database

The readable text is just a query

As is the index, table of contents, etc.

The database of words and figures can be read by a

program to generate a visually rich and interactive edition on

the web

But it can also be used to discover patterns in the text not visible to

the reader

It can help us discover the cultural patterns that are “delivered” by

the text to our brains

The results of a query showing the relationship between proper nouns (agents) and figure types

A structural reading of the data

Form and content are interwoven, each reinforcing the other

Form – the delivery system – is used to transmit the meaningful content,

the stuff that remains in your brain after reading or hearing the story

This is a "hypergraph" of the same data, also easily generated from the database by code

Text is like this

http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps

A text is a signal

Culture is a transmitter

Mdst3705 2013-02-19-text-into-data

Documents