Post on 11-Jan-2016
transcript
Digital Reformatting of Text
Aaron ChoateDigital Library Production Services
The University of Texas Libraries
From last time:
Calculating potential file size (no really… this time we got it!)
file size = height x width x bit-depth x dpi2
8 bits per byte
imagingBenchmarking
Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.
imagingBenchmarking
Physical Type, size and presentation
imagingBanchmarking
Physical condition• Darkening pages
• Fading ink
• Stains
• bleed-through
• Uneven printing
• Fold lines
• smearing
imagingBenchmarking
Document classification• Simple text / printed line art
• Distinct-edge based representationBitonal?
• Manuscripts• Soft-edge-based
Grayscale / color
• Mixed material
imagingBenchmarking
Medium and support• Support – (paper, clay tablet, etc.)
• Thin paper? (bleed through)
• Medium – (graphite pencil, inks, etc)• Fading of ink
• Variations in color or density
imagingBenchmarking
Tonal Representation
imagingBenchmarking
Color Appearance• Is color reproduction necessary to the
document’s meaning?
• What purpose does the color serve?
• How important is maintaining the color appearance?
imagingBenchmarking
Detail• Printed text –
• Measure the height of the smallest lowercase letter that typifies the item or group of items.
• Manuscripts, line art –• Measure the finest stroke-width that must be
represented and characterize the needed level of quality
imagingBenchmarking
QI…(Quality Index)• Defining detail as character height
• ANSI/AIIM preservation microfilming standard for determining requirements for text legibility
• Defines a range from barely legible through excellent that maps to technical test targets
imagingBenchmarking
Line pairs
Excellent = 8 line pairs
Good = 5 line pairs
Marginal = 3.6 line pairs
Barely legible = 3.0 line pairs
imagingBenchmarking
Digital QI Bitonal (only black pixels)
QI = (dpi x .039h)/3
h = 3QI/.039dpi
dpi = 3QI/.039h
Tonal images (grayscale for printed text)QI = (dpi x .039h)/2
h = 2QI/0.39dpi
dpi = 2QI/.039h
Text Capture
Methods• Rekeying
• OCR
Accuracy …
Software
Scansoft - Omnipage Pro Abbyy – Fine Reader Adobe Acrobat … PrimeOCR – Prime Recognition
Encoding
XML vs SGML
SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages
XML is a subset of SGML with an intent on being the format for use on the Internet.
XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)
xmlDTDs vs Schemas
xmlTEI
Text Encoding Initiative• Initially launched in 1987, the TEI is an
international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.
xmlTEI
Levels of encoding• Level 1: Fully Automated Conversion and En
coding
• Level 2: Minimal Encoding
• Level 3: Simple Analysis
• Level 4: Basic Content Analysis
• Level 5: Scholarly Encoding Projects
Character sets
Unicode –
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
character setsUnicode
Greek & Coptic
Software
XMetal Oxygen Cooktop
Software
MetaE