+PDF/AA Preservation Format
Mid-Atlantic Regional Archives Conference21 October 2011
Geof [email protected]
+File Format Confusion
From 5,000 to 15,000 extant file formats
Most are proprietary
The numbers add complexity to preservation
Real preservation formats are few in number
And we can really count on none of them
+Two General Classes of Formats
Proprietary Controlled by one company Underlying code is a trade secret If the company goes under, the file format becomes obsolete
Open Controlled by a standards body, a consortium, wiki-like bodies Code is free and open to all In absence of an “owner,” can still use the code to make a
reader
Neither Guarantees Preservation But open formats give you an opening to preservation
+Proprietary Formats
Tend to be rich in features
Limited readers for each format
Limited ability to exchange data
Difficult for long-term accessibility
Greater associated costs
+Advantages of Open Formats
More choice in what application to use
Better exchange of data
Better support of long-term preservation
Possible lower costs
Ability to create own readers
+Format/Software Confusion
Software Creates a file in the format Reads the file for you Allows you to interact with the file
Format Is the specific technical form in which a certain file exists Can be created by one software product or many
Examples Adobe Acrobat (and many others) vs PDF Microsoft Word vs .doc (and .docx, etc.)
+Criteria for Preservation Formats(and Files) Ubiquitous
Long-lived
Documented
Metadata-supporting
Accurate
Open
Uncompressed
Unencrypted
+When to Use a Preservation Format
Creation Begin with a format you know will last If so, choose a format that allows modification to a file
Recordation When information becomes a record, save it in a chosen format This freezes the file and demonstrates it is a record
Archiving Convert to persistent formats those records needed long-term The conversion preserves the records and marks is as
permanent
Early Action Can Save Money and Time
+Normalization(action at the point of archiving)
Conversion to a format
Not expected to change
Not expected to disappear
Not expected to become unreadable
Usually conversion to a different format from original
Generally how preservation formats are used
Still, may cause data loss or corruption
+Options for Preservation of Text
American Standard Coding for Information Interchange (ASCII)
Unicode
Portable Document Format / Archive (PDF/A)
Extensible Markup Language (XML)
Open Document Format (ODF) (ISO/IEC 26300:2006)
Office Open XML (OOXML) (ISO/IEC 29500:2008)
+What is Portable Document Format?
Originally developed by Adobe in 1991
Specifications made available for free in 2001
Format made an open international standard in 2008
Includes text and image features
+Advantages of PDF
Has accessibility across platforms
Saves look and searchability of original
Embeds fonts (if desired)
Allows copying of text from files
Remains fairly stable and universal
Is difficult to modify
Has enhanced document security
Supports authenticity
+Disadvantages of PDF
Won’t always perfectly represent original
Some files are more difficult to convert
Some formatting may be lost if saved back to original file
format
Limited ability to modify
A complex format saving image and text
Tends to be larger than a word processing document
+PDF’s Advantage over Others
Image and text in one bundle
Intelligent text
Accepts importance of format to meaning
Ubiquity of format and readers
+Conversion Practices
Have necessary fonts installed
Ensure lossless compression
Important for embedded images
When converting PDF to PDF/A
Eliminate prohibited features
Check beforehand or fix during
+Flavors of the PDF Standard
PDF (vanilla)
PDF/A (for archival preservation)
PDF/X (for publishing)
PDF/E (for engineering drawings)
PDF/VT (for variable data and transactional printing)
PDF/UA (for accessibility—in development)
PDF/H (for healthcare records—a guide, not a standard)
GeoPDF (for geospatial records—only based on standards)
+Portable Document Format / Archive Standards
PDF/ A-1
ISO Standard 19005-1:2005
Based on PDF Reference 1.4 (Acrobat 5)
PDF/A-2
ISO Standard 19005-2:2011
Based on PDF Reference 1.7
Published 20 June 2011
New versions of PDF/A expected
+Uses of PDF/A
Standard textual documents
Paper documents
Word-processing and PDF documents
Sequences of related digital images
Documents where appearance matters
Static documents
+Less Appropriate for PDF/A
Webpages
Databases
Spreadsheets
Dynamic documents
+Creating PDF/As
Need a product that can produce one
Like Adobe Acrobat 8 Professional
Can convert documents individually
Opening and converting one at a time
Can use batch processing
Converting multiple documents at once
Supported by Acrobat 8
+General Goals of PDF/A
Specifies limited stable set of features
To ensure long-term validity
Eliminate features that are not “archival”
An open preservation standard
Format designed to be a preservation standard
+Required in PDF/A
All fonts embedded
Unlimited legal use of embedded fonts
Device-independent color
Metadata describing the file
File must self-identify the PDF/A version
+Excluded from PDF/A-1
Audio and video content
JavaScript and executable files
Encryption
LZW and JPEG 2000 image compression
Reference to outside content
Transparency
Embedded files
+Differences in PDF/A-2 Allows embedding of OpenType fonts
Allows JPEG2000 image compression
Supports transparent objects
Supports layers, which can be hidden for viewing
Defines use of digital signatures Defines rules via PDF Advanced Electronic Signatures (PAdES)
Specifies requirements for custom XMP metadata
Allows embedded files, but in only one context In a PDF/A-2 you can embed PDF/A files Allows creation of sets of documents in a single file (e.g. emails)
All PDF/A-1s are compliant with PDF/A-2 standard PDF/A-2 is an extension of PDF/A-1
+PDF/A-1 Conformance Levels
PDF/A-1, Level A (full compliance)
Preserves document’s logical structure
Preserves text stream in reading order
Requires language specification
Requires UNICODE mapping
PDF/A-1, Level B (minimal compliance)
Preserves visual appearance
Doesn’t require as much descriptive info
Less “accessible” format
+Flavors of PDF/A
PDF/A-1a (a = accessible) RGB Color CMYK Color
PDF/A-1b (b = basic) Same color choices
PDF/A-2a (extension of A-1a)
PDF/A-2b (extension of A-1b)
PDF/A-2u (u = Unicode) Must use Unicode Does not require representation of logical structure
+PDF/A Product Lines
Adobe Acrobat (www.adobe.com)
Apago (www.apagoinc.com)
Callas (www.callassoftware.com)
Compart (www.compart.net)
PDFlib (www.pdflib.com)
PDF Tools AG (www.pdf-tools.com)
+PDF/A Validation Tools
Adobe Acrobat Preflight Function (www.adobe.com)
Callas Software pdfaPilot (www.callassoftware.com)
PDF Tools AG's 3-Heights PDF Validator (www.pdf-
tools.com)
+Formats are Not Everything
Preservation Programs Require Work Conversion procedures Quality control Version control Environmental controls Metadata creation and maintenance
Metadata about the records and their information Metadata about your preservation actions
Data management controls (backups, etc.) Ensuring that chosen normalized formats are still valid Vigilance