1eLegislature September 12, 2005
Discussion PointsPreservation of the records of the e-Legislature
Richard [email protected]
2eLegislature September 12, 2005
3eLegislature September 12, 2005
Methodologies for Preservation & Access of Software-dependent Electronic RecordsExploring the ability to compare versions of records, retrieve changes, run historical queries.
SALT
Research & develop prototypes that will lead to the creation of useful tools for archivists to preserve and provide access to electronic records over the long-term
While an RMA is capable of storing and providing access to electronic records, it cannot ensure that they remain accessible as software becomes obsolete.
Test and evaluate the best available means to implement a cost-effective application and architecture for preserving electronic records.
1996 1998 2000 2002 2005
SRB
DigArch
2001 2003
Chronopolis ?
e-Legislature
2007
Cassis ?
MDAS
DOCT
LoC
NSDL
NDIIPP w. CDL
PERM
ICAP
PAT
Archivist Workbench
NARA
Integration of DB and archival storage – Elimination of notion of fileLarge scale persistent object computational support -- distributed document handling system
Interpares
Color Legend (Funding Agency)
DARPA
NARA
NHPRC
NSF
LoC
Newly Created Research Labs at SDSC
Storage Resource Broker – Arcot RajasekarSustainable Archives & Library Technologies – Richard Marciano
4eLegislature September 12, 2005
“Born Digital” Content
DIGARCH / VanMAP / e-Legislature
“CASSIS”
The Opportunity…
5eLegislature September 12, 2005
Storage Resource Broker
The SDSC Storage Resource Broker is a client server middleware that virtualizes data space by providing a unified view to multiple heterogeneous storage Resources over the network.
It is a software that sits in between users and resources and provides a storage service by managing users, file locations, storage resources and metadata information
SRB Space
6eLegislature September 12, 2005
Storage Resource Broker Collections at SDSC (8/2/2005)GBs of
datastored
Numberof files
Userswith
ACLsData Grid Ź Ź ŹNSF/ITR - National Virtual Observatory 53,862 9,536,751 100NSF - National Partnership for Advanced Computational Infrastructure 36,149 7,539,180 380Static collections Š Hayden planetarium 8,013 161,352 227Pzone Š public collections 12,998 6,707,952 68NSF/NPACI - Biology and Environmental collections 40,155 76,083 67NSF/NPACI Š Joint Center for Structural Genomics 15,731 1,577,260 55NSF - TeraGrid, ENZO Cosmology simulations 176,730 2,125,945 3,267NIH - Biomedical Informatics Research Network 10,561 7,596,888 303Digital Library Ź Ź ŹNSF/NPACI - Long Term Ecological Reserve 256 9,033 36NSF/NPACI - Grid Portal 2,620 53,048 460NIH - Alliance for Cell Signaling microarray data 741 84,594 21NSF - National Science Digital Library SIO Explorer collection 2,733 1,083,998 27NSF/ITR - Southern California Earthquake Center 131,010 2,702,421 73Persistent Archive Ź Ź ŹNHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) 100 382,186 28UCSD Libraries archive 4,147 408,050 29NARA- Research Prototype Persistent Archive 1,478 893,434 58NSF - National Science Digital Library persistent archive 3,600 27,034,150 136TOTAL 501 TB 68 million 5,335
7eLegislature September 12, 2005
Senate Legislative Collection
from the 106th US Congress database
keeps track of Senate bills, resolutions, and amendments
raw format: 99 RTF (Rich Text Format) files on CD-ROM (provided by NARA)
one file per senator
8eLegislature September 12, 2005
Legislative Bio Collection:NARA: 106th Senate
Raw DataRaw DataRaw Data: rtf
Senator 1:
Senator 2:
Senator 99:...
Paul S. Sarbanes of MarylandJanuary 06, 1999 to March 31, 2000
Section I: Sponsored measuresSection II: Cosponsored measuresSection III: Sponsored measures organized by committee referral
* Senate: Armed Services* Senate: Banking* House: Judiciary
Section IV: Cosponsored measures organized by committee referral* Senate: Agriculture* House: Science
Section V: Sponsored amendmentsSection VI: Cosponsored amendmentsSection VII: Subject index to measures and amendments
**** S. 151Date Introduced: 01/19/1999Cosponsors: NONEOfficial title: A bill to amend the International
Maritime Satellite Telecommunications Act…Latest status: Jan 19, 1999 Read twice and referred to the
Committee on CommerceAbstract: NONE
Subject Index:Academic Performance: S.7, S.514, S.564Access to Health Care: S.6, S.1678, S.1690
…Zoning and zoning law: S.9, S.Con.Res.10, S.Res.41, S.J.Res.39
9eLegislature September 12, 2005
Senate Legislative Collection
• What you see:
… is maybe NOT what you get (a not so well documented format):
**** S. 345DATE INTRODUCED: 02/03/1999
SPONSOR: AllardOFFICIAL TITLE
A bill to amend the Animal Welfare Act to remove the limitation that permits interstate movement of live birds, for the purpose of fighting, to States in which animal fighting is lawful.
LATEST STATUSFeb 3, 1999 Read twice and referred to the Committee on
Agriculture.
^@^@y^K^@^@\206^K^@^@Ê^K^@^@Ô^K^@^@^@^L^@^@^N^L^@^@u^L^@^@\202^L^@^@È^L^@^@Ò^L^@^@ÿ\^L^@^@^M^M^@^@j^M^@^@w^M^@^@»^M^@^@Æ^M^@^@ô^M^@^@^B^N^@^@\203^N^@^@÷ëßÓëßǹ¹®¨®Â\Â\230Â\230 Â\230Â\230®¨®Â Â\230Â\230 Â\230Â\230 Â\230Â\230 Â\230Â\230 Â\230Â\
^N6^H\201OJ^C^@QJ^C^@]^H\201^@^N5^H\201OJ^C^@QJ^C^@\^H\201^@^K^B^H\201OJ^C^@QJ^C^@^\...
^ction sent to the House.^M^M**** S. 345^MDATE INTRODUCED: 02/03/1999^MSPONSOR: Alla\rd^MOFFICIAL TITLE^MA bill to amend the Animal Welfare Act to remove the limitation\that permits interstate movement of live birds, for the purpose of fighting, to St\
ates in which animal fighting is lawful.^MLATEST STATUS^MFeb 3, 1999 Read twice \and referred to the Committee on Agriculture.^M^M**** S. 387^MDATE INTRODUCED: 02/0\8/1999^MSPONSOR: McConnell^MOFFICIAL TITLE^MA bill to amend the Internal Revenue Co\d
10eLegislature September 12, 2005
Senate Collection Example… the XML can be lifted from the presentation level:
… to the information level:
<p bold="off">**** S. 345</p> <p align="right" bold="off">DATE INTRODUCED: 02/03/1999</p> <p bold="off">SPONSOR: Allard</p> <p align="center" bold="off" italic="off">OFFICIAL TITLE</p> <p bold="off" italic="off">A bill to amend the Animal Welfare Act to remove the lim\itation that permits interstate movement of live birds, for the purpose of fighting\, to States in which animal fighting is lawful.</p> <p align="center" bold="off" italic="off">LATEST STATUS</p> <p><string>Feb 3, 1999&tab;Read twice and referred to the Committee on Agriculture\.</string></p> <p></p>
<bill name="S.345"> <committees>
<committee>SENATE: AGRICULTURE</committee> </committees> <date_introduced>02/03/1999</date_introduced> <latest_status_list>
<latest_status> <ls_date>Feb 3, 1999</ls_date> <ls_txt>Read twice and referred to the Committee on Agriculture</ls_txt>
</latest_status> </latest_status_list> <official_title>A bill to amend the Animal Welfare Act to remove the limitation that permits interstate movement of live birds, for
the purpose of fighting, to States in which animal fighting is lawful.</official_title> <sponsor>Allard, Wayne [CO]</sponsor>
</bill>
11eLegislature September 12, 2005
Ingestion Network: Y2K Example
.xml .XML
.XML
.rtf
Convert (Omnimark)
Lift
consolidate
.TM
.XML
archive
decomposeS1 S2 S3
S5S4
S6
S0
generate generate
DIPSIP AIPLegend (stages):
12eLegislature September 12, 2005
XML as an Archival FormatInformation level “schema” as an XML DTD:
<?xml version="1.0" encoding="UTF-8"?><!ELEMENT bills (bill*)><!ELEMENT bill ( abstract?, committees?, congressional_record?, cosponsors?, date_introduced?,
digest?, latest_status_list?, official_title?, sponsor?, statement_of_purpose?, submitted_by?, submitted_for?)>
<!ATTLIST bill_name CDATA #REQUIRED><!ELEMENT committees (committee*)><!ELEMENT cosponsors (cosponsor*)><!ELEMENT digest (#PCDATA)><!ELEMENT latest_status_list (latest_status*)><!ELEMENT latest_status (ls_date, ls_txt)><!ELEMENT abstract (#PCDATA)><!ELEMENT committee (#PCDATA)><!ELEMENT congressional_record (#PCDATA)><!ELEMENT cosponsor (co_name)><!ELEMENT co_name (#PCDATA)><!ATTLIST co_name a-date CDATA #IMPLIED><!ELEMENT date_introduced (#PCDATA)>… <!ELEMENT statement_of_purpose (#PCDATA)><!ELEMENT submitted_by (#PCDATA)><!ELEMENT submitted_for (#PCDATA)>
Check Demo
13eLegislature September 12, 2005
PAT Project: Persistent Archives Testbed
Building Preservation Environments for State Archives
Testing proven technologies in new environments
Exploring the use of data grid technology to support preservation
14eLegislature September 12, 2005
Participants & Observers
STATES:• California • Kentucky • Michigan • Minnesota • Ohio
FEDERAL AGENCIES & FOREIGN COUNTRIES:• NHPRC & NARA • Stanford SLAC • Korea
CULTURAL HERITAGE (museums/libraries/archives):• Getty Research Institute
NEWSPAPERS:• Los Angeles Times
UNIVERSITIES:• Georgia Tech • University of California Los Angeles • University of California San Diego• University of Florida • University of Illinois Urbana Champaign • Yale
RESEARCH LABS:• San Diego Supercomputer Center
15eLegislature September 12, 2005
PAT Testbed
Explore advantages of national preservation infrastructure• Shared use of preservation resources • Shared evaluation of new technology
• Demonstration of generic infrastructure• Shared development of archival procedures
• Expanded assessment of technology across diverse types of records• Shared risk
Collaborative use of preservation environment by multiple states
16eLegislature September 12, 2005
PAT Project
Test a community model for electronic records management, with archival and technological functions in a distributed network (using the SRB: Storage Resource Broker – data grid technology)
Initial Test sites: (1) Michigan Department of History, Arts and Libraries, (2) Ohio Historical Society, (3) Kentucky Department for Libraries and Archives,(4) Minnesota Historical Society, (5) SLAC Stanford Linear Accelerator Archives and History Office.
Participants:(a) California State Archives (b) Kansas State Historical Society(c) University of Illinois Urbana Champaign(d) University of California Los Angeles (UCLA): (e) Yale Manuscripts and Archives(f) Georgia Tech
Observers:(a) Getty
17eLegislature September 12, 2005
Shared Infrastructure
KentuckyGrid Brick
SDSCArchive
MCAT
MichiganGrid Brick
MinnesotaGrid Brick
OhioGrid Brick
SLACStorage
Local Storage Resources
Shared Preservation Environment
Metadata Catalog(Oracle)
Archival Storage(HPSS, Sam-QFS)
18eLegislature September 12, 2005
Shared Development of Archival Processes
The processes that are being automated are: • appraisal, • accessioning, • arrangement, • description,• Preservation, • access.
19eLegislature September 12, 2005
Archival Processes explored in PAT
KentuckyWeb
MichiganRMA -Precinct
Results DB
MinnesotaSpatial
OhioE-mail
SLACDocuments
Appraisal X
Accession X X X
Arrangement X X X X
Description X X X X X
Preservation X X X X
Access X X X X
20eLegislature September 12, 2005
National Archives and Records Administration - Research Prototype Persistent Archives
NARA SDSC
MCAT MCAT
U Md
MCAT
Powerful Platform for Collaborative Research
• Synchronization across zones• Interoperability across diverse platforms• Sufficient metadata to ensure complete and authentic records• Mitigation of risk of data loss
• Replication of data• Federation of catalogs
• Deep archive
Federation of Four Independent Data Grids
GTech
MCAT
21eLegislature September 12, 2005
Local Storage Resource
Grid Brick - commodity based disk system• Current cost is about $2,000 per Terabyte• 2.8 Ghz CPU• 1 Gbyte of memory• Raid Controller• Gigabit / 100 / 10 Ethernet• 1-5 Terabytes of disk
22eLegislature September 12, 2005
What is BIRN?• Testbed for Biomedical Knowledge Infrastructure
• Biomedical Informatics Research Network• Creation and support federated bioscience databases• Data integration• Interoperable analysis tools• Data mining software• Scalable and extensible• Complex Access Control –
• HIPAA Requirements
• Three Fairly Large test beds interoperating with each other:• Morphological, Functional, Mouse Brain
Virtual Data Grid (SRB) MCAT
Duke UCLANCMIR
CalTech SDSC
23eLegislature September 12, 2005
The BIRN Data Grid
24eLegislature September 12, 2005
Monitoring Grid Status
0.7 TB
5.2 TB
0 TB
1.6 TB
0.8 TB
0.8 TB
3.2 TB
0.8 TB
2.4 TB
0.8 TB
0.8 TB
2.4 TB
1.6 TB
0.8 TB
5.0 TB
0.78 TB
0.08 TB
25eLegislature September 12, 2005
Logical Resources in BIRN
26eLegislature September 12, 2005
Data Grids
Data grids provide the ability to name, organize, and manage data on distributed storage resources
Federation provides a way to name, organize, and manage data on multiple data grids.
27eLegislature September 12, 2005
Data Grids
Distributed data sources• Inter-realm authentication and authorization
Heterogeneity• Storage repository abstraction
Scalability• Differentiation between context and content management
Preservation• Support for automated processing (migration, archival processes)
28eLegislature September 12, 2005
The SRB, an example of a Data Grid
A distributed file system (Data Grid), based on a client-server architecture.
It’s also more: It provides a way to access files and computers based on their attributes rather than just their names or physical locations.
It replicates, syncs, archives, and connects heterogeneous resources in a logical manner using abstraction mechanisms.
29eLegislature September 12, 2005
Data Grid Support for PreservationAuthenticity – the assurance that the records are what they purport to be
• Support for metadata necessary to maintain the provenance of the records • Support for maintaining the essential characteristics of the records across
transformationsIntegrity - the assurance that the electronic records are not corrupted
• Support for integrity metadata (audit trails, access controls, checksums, replicas)• Support for distributed environments (replication, federation)
Infrastructure Independence• Manage electronic records independently of the choice of storage system• Standard operations across databases and storage repositories
30eLegislature September 12, 2005
Eliminate Infrastructure Dependence Upon NamingNeed persistent identifiers for
• Archivists• Records• Metadata• Storage resources• Access controls
Need automatic update of mapping from persistent identifiers to names used in storage systems
31eLegislature September 12, 2005
Unix Shell
NT Browser,Kepler Actors
Linux I/O
HTTP,WSDL,
(WSRF),GIS
OAI,DSpace,
OpenDAP,GridFTP
Archives - Tape,Sam-QFS, DMF,
HPSS, ADSM,UniTree, ADS
Databases -DB2, Oracle,
Sybase, Postgres, mySQL, Informix
File SystemsUnix, NT,Mac OSX
Application
ORB
Storage Repository AbstractionDatabase Abstraction
Databases -DB2, Oracle, Sybase,
Postgres, mySQL,Informix
C, C++Library
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency & Metadata Management / Authorization, Authentication, Audit
JavaDLL /
Python,Perl,
Windows
Federation Management
Storage Resource Broker 3.3.1
Infrastructure Independence
Preservation Processes (Accession to Access)
32eLegislature September 12, 2005
Mitigating Risk of Data Loss
Replication - keeping two copies of the records• Protect against media corruption
Data replication to remote site• Protect against local operational error• Protect against natural disaster
Data replication to another type of storage• Protect against systemic vendor problem
Data federation - keeping two independent metadata catalogs• Protect against malicious users
33eLegislature September 12, 2005
Federation Between Data Grids
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection A
Access controls and consistency constraints on cross registration of digital entities
34eLegislature September 12, 2005
Summary
Working preservation environment• Infrastructure independence• Authenticity• Integrity
Collaborative use of preservation environment by multiple states• Shared technology assessment• Shared infrastructure• Shared development• Shared risk