+ All Categories
Home > Education > 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

Date post: 26-Jun-2015
Category:
Upload: dirk-roorda
View: 285 times
Download: 1 times
Share this document with a friend
Description:
The arduous process of producing a digital text of Descartes' letters, including mathematical formulas. It was a subtask of the CKCC project at the Huygens Institute. Lessons learned. With Erik-Jan Bos, Utrecht.
Popular Tags:
31
Letters from Descartes in digital format An exercise in conversion Dirk Roorda @ eHumanities 2012-01-26
Transcript
Page 1: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

Letters from Descartes in

digital formatAn exercise in conversion

Dirk Roorda@ eHumanities 2012-01-26

Page 2: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

the task the method the lessons the result

◦ demo

overview

Page 3: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The Task: converting from ...JapAM

Descartes Correspondence

ca. 700 letters

69,237 lines

600 formulas

4.2 MB (without the 311 pictures)

Page 4: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 5: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The task: converting to ...CKCC corpus Descartes

XML : Text Encoding Initiative (TEI)

~ 35,000 elements, of which7,200 metadata

7,700 paragraphs6,200 formulas

6,000 text-formattings4,200 structure

2,900 page-breaks538 images

Page 6: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 7: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The (re)Sources

EJB

Metadata

Google Books

EJB ‘s head

Page 9: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

observation

non-algorithmic changes

consolidation

proofs

The method

Page 10: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

use digital equipment:

-your text-editor

-your scripting language

-your regular expressions

Observation

Page 12: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

observation: italic scopes

replace=(.*?)$

by<italic>match1</italic>

???

Aargh!#@\€]

Page 13: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

observation: greek

Page 14: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 15: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

non-algorithmic changes

Page 16: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

closers: hints

Page 17: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

consolidating: metadata

... formulas meta closers ...

conversion process

canonical

initial

corrected

improved

checked metadata combining

Page 18: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

merging meta

Page 19: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

proofs: formulas

Page 20: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

proofs: formulas in gif

Page 21: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

quick formula checking

Page 22: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The anatomy of conversion

convert.pl

100 KB of program code text=25 densely typed pages=3427 lines

of which

2175 real code lines

Code/Input = 1/32

Page 23: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 24: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

1/3 of the tasks need 2/3 of the codeformulas: (2) 37 %headers, openers, closers: (3) 16 %meta and images: (3) 11 %

run time of same tasksformulas: (2) 29 %headers, openers, closers: (3) 6 %meta and images (3) 10 %total run time (25) 40 sec

Statistics

Page 25: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

1. Unicode is your friend2. Split into many subtasks3. task = configuration + workflow4. Count and check5. Performance matters6. Do not give up automation

The tricks of conversion

Page 26: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

1. Unicode is your friend

Page 27: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

(2a) that can be run separately

(2b) that can be reordered easily

2. Split into many subtasks

Page 28: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

3. task = config + workflow

Page 29: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

4. Count and check (ad nauseam)

Page 30: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

was 30+ secondsis now 2.07 secondsmany new subtasks based on same template(gain = 15 * 30 = 7.5 min per run)many, many runs before everything is OK(gain = 100 * 7.5 = 12.5 hours CPU-time)

5. Performance matters!

Page 31: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

we used a lot of expert knowledgewhich has all been transferred to- the source- consolidated extra inputsso the conversion is still repeatable and modifiable

6. Do not give up automation

source formulas meta closers results

corrections hints hints hints CKCC

conversion program

Thank You


Recommended