+ All Categories
Home > Documents > Preserving scientific data on our physical universe: a new strategy for archiving the nation's...

Preserving scientific data on our physical universe: a new strategy for archiving the nation's...

Date post: 11-Sep-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
195
Preserving Scientific Data On Our Physical Universe A New Strategy for Archiving the Nation's Scientific Information Resources Steering Committee for the Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government Commission on Physical Sciences, Mathematics, and Applications National Research Council NATIONAL ACADEMY PRESS Washington, D.C. 1995
Transcript
Page 1: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

PreservingScientificDataOnOurPhysicalUniverse

ANewStrategyforArchivingtheNation'sScientificInformationResources

SteeringCommitteefortheStudyontheLong-termRetentionofSelectedScientificandTechnicalRecordsoftheFederalGovernment

CommissiononPhysicalSciences,Mathematics,andApplications

NationalResearchCouncil

NATIONALACADEMYPRESSWashington,D.C.1995

Page 2: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

title:

PreservingScientificDataOnOurPhysicalUniverse:ANewStrategyforArchivingtheNation'sScientificInformationResources

author:publisher: NationalAcademiesPress

isbn10|asin: 030905186Xprintisbn13: 9780309051866ebookisbn13: 9780585022888

language: English

subject

Communicationinscience--Governmentpolicy--UnitedStates,Science--UnitedStates--Dataprocessing,Technology--UnitedStates--Dataprocessing,Informationstorageandretrievalsystems--Science.

publicationdate: 1995lcc: Q224.3.U6N371995ebddc: 353.00819

subject:

Communicationinscience--Governmentpolicy--UnitedStates,Science--UnitedStates--Dataprocessing,Technology--UnitedStates--Dataprocessing,Informationstorageandretrievalsystems--Science.

Page 3: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

NOTICE:TheprojectthatisthesubjectofthisreportwasapprovedbytheGoverningBoardoftheNationalResearchCouncil,whosemembersaredrawnfromthecouncilsoftheNationalAcademyofSciences,theNationalAcademyofEngineering,andtheInstituteofMedicine.Themembersofthecommitteeresponsibleforthereportwerechosenfortheirspecialcompetencesandwithregardforappropriatebalance.

ThisreporthasbeenreviewedbyagroupotherthantheauthorsaccordingtoproceduresapprovedbyaReportReviewCommitteeconsistingofmembersoftheNationalAcademyofSciences,theNationalAcademyofEngineering,andtheInstituteofMedicine.

TheNationalAcademyofSciencesisaprivate,nonprofit,self-perpetuatingsocietyofdistinguishedscholarsengagedinscientificandengineeringresearch,dedicatedtothefurtheranceofscienceandtechnologyandtotheiruseforthegeneralwelfare.UpontheauthorityofthechartergrantedtoitbytheCongressin1863,theAcademyhasamandatethatrequiresittoadvisethefederalgovernmentonscientificandtechnicalmatters.Dr.BruceAlbertsispresidentoftheNationalAcademyofSciences.

TheNationalAcademyofEngineeringwasestablishedin1964,underthecharteroftheNationalAcademyofSciences,asaparallelorganizationofoutstandingengineers.Itisautonomousinitsadministrationandintheselectionofitsmembers,sharingwiththeNationalAcademyofSciencestheresponsibilityforadvisingthefederalgovernment.TheNationalAcademyofEngineeringalsosponsorsengineeringprogramsaimedatmeetingnationalneeds,encourageseducationandresearch,andrecognizesthesuperiorachievementsofengineers.Dr.RobertM.WhiteispresidentoftheNationalAcademyofEngineering.

TheInstituteofMedicinewasestablishedin1970bytheNational

Page 4: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

AcademyofSciencestosecuretheservicesofeminentmembersofappropriateprofessionsintheexaminationofpolicymatterspertainingtothehealthofthepublic.TheInstituteactsundertheresponsibilitygiventotheNationalAcademyofSciencesbyitscongressionalchartertobeanadvisertothefederalgovernmentand,uponitsowninitiative,toidentifyissuesofmedicalcare,research,andeducation.Dr.KennethI.ShineispresidentoftheInstituteofMedicine.

TheNationalResearchCouncilwasestablishedbytheNationalAcademyofSciencesin1916toassociatethebroadcommunityofscienceandtechnologywiththeAcademy'spurposesoffurtheringknowledgeandadvisingthefederalgovernment.FunctioninginaccordancewithgeneralpoliciesdeterminedbytheAcademy,theCouncilhasbecometheprincipaloperatingagencyofboththeNationalAcademyofSciencesandtheNationalAcademyofEngineeringinprovidingservicestothegovernment,thepublic,andthescientificandengineeringcommunities.TheCouncilisadministeredjointlybybothAcademiesandtheInstituteofMedicine.Dr.BruceAlbertsandDr.RobertM.Whitearechairmanandvicechairman,respectively,oftheNationalResearchCouncil.

SupportforthisprojectwasprovidedbytheNationalArchivesandRecordsAdministration(underContractNo.NAMA-S-92-0019),theNationalOceanicandAtmosphericAdministration(underContractNo.50-DGNE-3-00105),andtheNationalAeronauticsandSpaceAdministration(underContractNo.S-54040-Z).Theviewsexpressedinthisreportarethoseoftheauthorsanddonotnecessarilyreflecttheviewsofthesponsoringagenciesorsubagencies.

LibraryofCongressCatalogCardNumber94-68991InternationalStandardBookNumber0-309-05186-X

Additionalcopiesofthisreportareavailablefrom:

Page 5: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

NationalAcademyPress2101ConstitutionAve.,NWBox285Washington,DC20055800-624-6242202-334-3313(intheWashingtonMetropolitanArea)

B-499

Copyright1995bytheNationalAcademyofSciences.Allrightsreserved.

PrintedintheUnitedStatesofAmerica

Page 6: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pageiii

SteeringCommitteeForTheStudyOnTheLong-TermRetentionOfSelectedScientificAndTechnicalRecordsOfTheFederalGovernmentJEFFDOZIER,UniversityofCalifornia,SantaBarbara,Chair

SHELTONALEXANDER,PennsylvaniaStateUniversity

MARJORIECOURAIN,Consultant(deceased,January14,1994)

JOHNA.DUTTON,PennsylvaniaStateUniversity

WILLIAMEMERY,UniversityofColorado

BRUCEGRITTON,MontereyBayAquariumResearchInstitute

ROYJENNE,NationalCenterforAtmosphericResearch

WILLIAMKURTH,UniversityofIowa

DAVIDLIDE,Consultant,Gaithersburg,Maryland

B.K.RICHARD,TRW

JOANWARNOW-BLEWETT,AmericanInstituteofPhysics

NationalResearchCouncilStaff

PaulF.Uhlir,AssociateExecutiveDirector,CommissiononPhysicalSciences,Mathematics,andApplications

MarkDavidHandel,ProgramOfficer,BoardonAtmosphericSciencesandClimate

AliceKillian,ResearchAssociate,CommissiononGeosciences,Environment,andResources

JamesE.Mallory,StaffOfficer,ComputerScienceandTelecommunicationsBoard

Page 7: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ScottT.Weidman,SeniorProgramOfficer,BoardonChemicalSciencesandTechnology

JulieM.Esanu,ResearchAssistant,CommissiononPhysicalSciences,Mathematics,andApplications

DavidJ.Baskin,ProjectAssistant,CommissiononPhysicalSciences,Mathematics,andApplications

Page 8: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pageiv

CommissionOnPhysicalSciences,Mathematics,AndApplicationsRICHARDN.ZARE,StanfordUniversity,Chair

RICHARDS.NICHOLSON,AmericanAssociationfortheAdvancementofScience,ViceChair

STEPHENL.ADLER,InstituteforAdvancedStudy

SYLVIAT.CEYER,MassachusettsInstituteofTechnology

SUSANL.GRAHAM,UniversityofCaliforniaatBerkeley

ROBERTJ.HERMANN,UnitedTechnologiesCorporation

RHONDAJ.HUGHES,BrynMawrCollege

SHIRLEYA.JACKSON,DepartmentofPhysics

KENNETHI.KELLERMANN,NationalRadioAstronomyObservatory

HANSMARK,UniversityofTexasatAustin

THOMASA.PRINCE,CaliforniaInstituteofTechnology

JEROMESACKS,NationalInstituteofStatisticalSciences

L.E.SCRIVEN,UniversityofMinnesota

A.RICHARDSEEBASSIII,UniversityofColorado

LEONT.SILVER,CaliforniaInstituteofTechnology

CHARLESP.SLICHTER,UniversityofIllinoisatUrbana-Champaign

ALVINW.TRIVELPIECE,OakRidgeNationalLaboratory

Page 9: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

SHMUELWINOGRAD,IBMT.J.WatsonResearchCenter

CHARLESA.ZRAKET,MITRECorporation(retired)

NORMANMETZGER,ExecutiveDirector

PAULF.UHLIR,AssociateExecutiveDirector

Page 10: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pagev

PrefaceInJanuary1992theNationalArchivesandRecordsAdministration(NARA)sponsoredathree-dayplanningmeetingattheNationalResearchCouncil(NRC)toreviewtheissuesrelatedtothelong-termretentionofthefederalgovernment'sscientificandtechnicaldatainthephysicalsciences.TheplanningmeetingwasorganizedbytheNRC'sCommissiononPhysicalSciences,Mathematics,andApplicationsandprovidedthebasisforthisstudy,whichwasinitiatedinthefallof1992attherequestofNARA.TheNationalOceanicandAtmosphericAdministration(NOAA)andtheNationalAeronauticsandSpaceAdministration(NASA)subsequentlyprovidedadditionalsupport.

Thestudy'ssteeringcommittee,inconsultationwiththesponsors,developedthefollowingchargetoguidethewritingofthisreport:Describethestatusandplansforthegovernment'sarchivingofobservationalandexperimentaldatainthephysicalsciences.Identifytheprincipalscientific,technical,informationmanagement,andinstitutionalissuesregardingthepermanentarchivingofsuchdata.Assessthecommonalitiesanddifferencesamongthecasestudiesprovidedbythepanelsorganizedunderthisstudy(seebelow)inordertodeterminetheextenttowhichcommonlong-termretentionpoliciesandappraisalguidelinescanbeappliedtodisciplinesthatcollectobservationalandexperimentaldatainthephysicalsciences.Establishasetofgoals,principles,andpriorities,aswellasgenericretentioncriteriaandappraisalguidelinesthatNARAcanincorporateintoitsmission,program,andbudgetplanning.SuggestmechanismsandprocessesforNARAandNOAAtouseinimplementingaprogramofdataappraisal,retention,andpreservation,andlaterinevaluatingtheeffectivenessoftheprogram.Provideasummaryoffindings,conclusions,andrecommendations.

Page 11: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Thesteeringcommitteeformedfivepanelsinspacesciences,atmosphericsciences,oceansciences,geosciences,andphysics,chemistry,andmaterialssciencestoprovidetheirviewsonthekeydataretentionissuesfromdifferentdisciplinaryperspectivesinthephysicalsciences.Thesepanelseachmettwiceandproducedasetofworkingpapers,whicharepublishedseparatelyinStudyontheLong-termRetentionofSelectedScientificandTechnicalRecordsoftheFederalGovernment:WorkingPapers(NationalAcademyPress,Washington,D.C.,1995).Theworkofthepanelswasinvaluabletothe

Page 12: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pagevi

steeringcommitteeinframingtheissues,informingitsconclusionsandrecommendations,andinproducingitsfinalreport.

Thereareseveralaspectsregardingthescopeandfocusofthisreportthatshouldbementioned.Thecommitteedevotedmostofitsattentiontodatastoredonelectronicmedia,ratherthanonpaperoronothermedia.Almostalldataarenowacquired,stored,anddistributedelectronically.Thus,thepreponderanceofdataarchivingproblemsandtheirsolutionsmustbeconsideredinthiscontext.Nevertheless,muchoftheadviceofferedhereisequallyrelevanttodatainotherformats.

Theprincipalfocusofthisreportisonthelong-termretentionofdatainthephysicalsciences.Muchofthediscussion,however,includesnear-termdatamanagementissues,becauseeffectivearchivingbeginswhentheplansforacquiringadatasetaremadeandextendsthroughoutthelifecycleofthedata.Althoughthefocusisexclusivelyondatainthephysicalsciences,thecommitteebelievesthatthedistinctionsithasdrawnbetweentheexperimentalandtheobservationaldata,aswellasthedatamanagementprinciplesithasprovided,arebroadlyapplicabletomostdataintheothernaturalsciences.Inaddition,thestrategicapproachadoptedbythecommitteenecessarilyinvolvesallfederalagenciesthatacquireandmanagephysicalsciencedata,andnotsimplythethreeagenciesthatsponsoredthisstudy.

Finally,itisnecessarytopointoutthatthecommitteewasunabletoachieveconsensusononemajorrecommendationofthestudy,namely,theproposaltoestablishtheNationalScientificInformationResource(NSIR)Federation.AppendixBcontainstheminorityopinionofthedissentingcommitteemember,RoyJenne.Therestofthecommitteemembers,whostronglysupporttheNSIRFederationrecommendation,aredisappointedbythislackofunanimityand

Page 13: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

considermanyoftheassertionsintheminorityopiniontobebasedonanerroneousinterpretationofwhatthereportactuallystatesorrecommends.Weleavethattothereadertojudge.Nevertheless,webelievethattheminorityopinioncanperhapsserveausefulpurposebydrawinggreaterattentiontotheseissuesandbybroadeningthediscussionofthemamongthesponsorsofthestudy,theotherscienceagencies,andtheresearchcommunity.

Inconclusion,thecommitteehopesthatitsadvicewillhelpbringaboutthechangesnecessarytoeffectivelypreservethevaluablescientificdataonourphysicaluniverse.

JeffDozierSteeringCommitteeChair

PaulF.UhlirStudyDirector

Page 14: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pagevii

AcknowledgmentsThesteeringcommitteeisverygratefultothemanyindividualswhoplayedasignificantroleinthecompletionofthisstudy,includingthemembersofthefiveadhocpanelsthatprovidedconclusionsandrecommendationsondataarchivingfromthedifferentphysicalsciencedisciplines;theindividualswhobriefedthesteeringcommitteeandpanels;andmembersoftheNationalResearchCouncil(NRC)staffwhoworkedonvariousaspectsofthisstudy.ThesteeringcommitteealsoextendsitsthankstoTrudyPetersonandKennethThibodeauoftheNationalArchivesandRecordsAdministration(NARA),WilliamTurnbullandHelenWoodoftheNationalOceanicandAtmosphericAdministration(NOAA),andJosephKingoftheNationalAeronauticsandSpaceAdministration(NASA),fromthestudy'ssponsoringagencies.

GerdRosenblatt,ofLawrenceBerkeleyLaboratory,chairedthePhysics,Chemistry,andMaterialsSciencesDataPanel.ThememberswereR.StephenBerry,UniversityofChicago;EdwardGalvin,TheAerospaceCorporation;J.G.Kaufman,TheAluminumAssociation;KirbyKemper,FloridaStateUniversity;DavidR.Lide,Jr.,consultant;andEdgarWestrum,Jr.,UniversityofMichigan.ThesteeringcommitteegratefullyacknowledgesthedetailedbriefingsandinformationprovidedtothispanelbyDonaldAlderson,DepartmentofDefenseNuclearInformationAnalysisCenter;FrankBiggs,SandiaNationalLaboratories;RobertBillingsley,DefenseTechnicalInformationCenter;MarkConrad,NARA;SuzanneLeech,Bionetics,Inc.;VictoriaMcLane,BrookhavenNationalLaboratory;andPatriciaSchuette,BattellePacificNorthwestLaboratory.

TheSpaceSciencesDataPanelwaschairedbyChristopherRusselloftheUniversityofCaliforniaatLosAngeles.Thepanelmemberswere

Page 15: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

GuiseppinaFabbiano,Harvard-SmithsonianCenterforAstrophysics;SarahKadec,consultant;WilliamKurth,UniversityofIowa;StevenLee,UniversityofColorado;andR.StephenSaunders,JetPropulsionLaboratory.Thesteeringcommitteeextendsitsthanksfortheassistanceofthefollowingindividuals,whoprovidedbriefingsandotherinformationtotheSpaceSciencesDataPanel:JoeAllen,NationalGeophysicalDataCenter;StevenBlair,LosAlamosNationalLaboratory;JosephBredekamp,NASA;DeanBundy,NavalResearchLaboratory;DaviddeYoung,NationalOpticalAstronomyObservatories;RobertFrederick,AirForceSpaceForecastCenter;JosephKing,NationalSpaceScienceDataCenter;KnoxLong,SpaceScienceTelescopeInstitute;GuentherRiegler,NASAAstrophysicsDivision;ThomasSmithandJudStailey,AirForceEnvironmentalTechnicalApplicationsCenter;EarlTech,LosAlamosNationalLaboratory;RaymondWalker,UniversityofCaliforniaatLosAngeles;andJamesWillet,NASASpacePhysicsDivision.

Page 16: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pageviii

WernerBaum,ofFloridaStateUniversity,wasthechairoftheAtmosphericSciencesDataPanel.ThememberswereMarjorieCourain,consultant(deceased,January14,1994);WilliamHaggard,ClimatologicalConsultingCorporation;RoyJenne,NationalCenterforAtmosphericResearch;KellyRedmond,DesertResearchInstitute;andThomasVonderHaar,ColoradoStateUniversity.ThesteeringcommitteegratefullyacknowledgesthediverseandsubstantialinputsprovidedbythefollowingindividualstotheAtmosphericSciencesDataPanel:LarryBaume,NARA;ThomasBoden,CarbonDioxideInformationandAnalysisCenter;DeanBundy,NavalResearchLaboratory;DonaldCollins,NASA;RichardDavis,NationalClimaticDataCenter,P.C.Hariharan,JohnsHopkinsUniversity;andGeraldStokes,PacificNorthwestLaboratories.

TheOceanSciencesDataPanelwaschairedbyBruceGritton,MontereyBayAquariumResearchInstitute.ThememberswereRichardDugdale,UniversityofSouthernCalifornia;ThomasDuncan,UniversityofCaliforniaatBerkeley;RobertEvans,RosenstielSchoolofMarineandAtmosphericScience;TerrenceJoyce,WoodsHoleOceanographicInstitution;andVictorZlotnicki,JetPropulsionLaboratory.ThesteeringcommitteeextendsitsthanksforthebriefingsandotherinformationprovidedtotheOceanSciencesDataPanelbyLarryBaume,NARA;DonaldCollinsandSusanDigby,JetPropulsionLaboratory;RonaldFauquet,NOAA;TedTsui,NavalResearchLaboratory;andR.S.Winokur,OfficeofNavalResearch.

TheGeoscienceDataPanelwaschairedbyTheodoreAlbert,aprivateconsultant.ThememberswereSheltonAlexander,PennsylvaniaStateUniversity;SaraGraves,UniversityofAlabamainHuntsville;DavidLandgrebe,PurdueUniversity;andSorooshSorooshian,UniversityofArizona.ThesteeringcommitteegratefullyacknowledgestheinformationprovidedatthemeetingsoftheGeosciencesDataPanelbythefollowingindividuals:RogerBarry,NationalSnowandIce

Page 17: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

DataCenter;DanielCavanaugh,U.S.GeologicalSurvey;DonaldCollins,JetPropulsionLaboratory;KatrinDouglass,SouthernCaliforniaEarthquakeCenterDataCenter;WilliamDraegar,U.S.GeologicalSurvey;JohnDwyer,NARA;ClaireHenson,NationalSnowandIceDataCenter;HerbMeyers,NationalGeophysicalDataCenter;RonWeaver,NationalSnowandIceDataCenter;andThomasYorke,U.S.GeologicalSurvey.

Finally,thesteeringcommitteeisgratefultothestaffoftheNationalResearchCouncil:PaulF.Uhlir,associateexecutivedirectoroftheCommissiononPhysicalSciences,Mathematics,andApplications,whoservedasstudydirector;MarkDavidHandelandTheresaFisher(BoardonAtmosphericSciencesandClimate),AliceKillian(CommissiononGeosciences,Environment,andResources),JamesE.Mallory(ComputerScienceandTelecommunicationsBoard),andScottT.WeidmanandTañaSpencer(BoardonChemicalSciencesandTechnology),whoprovidedstaffsupportforthefivepanels;JulieM.Esanu,fortheprogramassistanceprovidedtothesteeringcommitteeandpanelsandforthepreparationofthefinalmanuscript;DavidBaskin,forhisworkonpreparingthefinalmanuscript;LizPanos,forcoordinatingthereportreview;andRoseannePrice,whoeditedthefinalmanuscript.

Page 18: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Pageix

ContentsSUMMARY

1INTRODUCTION

ImperativesforPreservingDataonOurPhysicalUniverse

ANewFutureforScientificData

2THECHALLENGE:PRESERVATIONANDUSEOFSCIENTIFICDATA

ExperimentalLaboratoryData

ObservationalDatainthePhysicalSciences

SummaryofMajorIssues

3RETENTIONCRITERIAANDTHEAPPRAISALPROCESS

RetentionCriteria

OtherElementsoftheAppraisalProcess

Recommendations

4THEOPPORTUNITIES:THERELATIONSHIPOFTECHNOLOGICALADVANCESTONEWDATAUSEANDRETENTIONSTRATEGIES

EnablingTechnologiesandRelatedDevelopments

OpportunitiesforNewOrganizationalStructures

5ANEWSTRATEGYFORARCHIVINGTHENATION'SSCIENTIFICANDTECHNICALDATA

FundamentalPrinciplesforLong-termDataRetention

Page 19: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

TheProposedNationalScientificInformationResourceFederation

RecommendationsfortheCreationoftheNSIRFederation

RecommendationsSpecificallyforNARA

RecommendationsSpecificallyforNOAA

REFERENCES

APPENDIXAListofAcronyms

APPENDIXBMinorityOpinion

Page 20: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ThisstudyisdedicatedinfondmemoryofMarjorieCourain.

Page 21: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page1

SummaryScientificdatareflectboththeorganizationandthechaosofthenaturalworld.Theystimulateustodevelopconcepts,theories,andmodelstomakesenseofthepatternstheyrepresent.Theresultingabstractionsaretheformalandsystematicideasthatconstitutetheunderstandingofrelationshipsbetweencausesandconsequences,andperhapsmayenablepredictionoffuturesequencesofevents.Becausescientiststransformdatafromthematerialworldintoideas,theobservationsofobjectsandprocessesinthephysicalworldarethestimuliofscientificthought.Dataarethustheseedsofscientificideas.

Therearestrongmotivationsforpreservingscientificobservations:Manyobservationsaboutthenaturalworldarearecordofeventsthatwillneverberepeatedexactly.Examplesincludeobservationsofanatmosphericstorm,adeepoceancurrent,avolcaniceruption,andtheenergyemittedbyasupernova.Oncelost,suchrecordscanneverbereplaced.Observeddataprovideabaselinefordeterminingratesofchangeandforcomputingthefrequencyofoccurrenceofunusualevents.Theyspecifytheobservedenvelopeofvariability.Thelongertherecord,thegreaterourconfidenceintheconclusionswedrawfromit.Adatarecordmayhavemorethanonelife.Asscientificideasadvance,newconceptsmayemergeinthesameorentirelydifferentdisciplinesfromstudyofobservationsthatledearliertodifferentkindsofinsights.Newcomputingtechnologiesforstoringandanalyzingdataenhancethepossibilitiesforfindingorverifyingnewperspectivesthroughreanalysisofexistingdatarecords.Thus,therelativeimportanceofdata,bothcurrentandhistorical,canchangedramatically,ofteninentirelyunanticipateddirections.Thesubstantialinvestmentsmadetoacquiredatarecordsjustifytheirpreservation.Thecostofpreservationwillalmostalwaysbesmallin

Page 22: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

preservation.Thecostofpreservationwillalmostalwaysbesmallincomparisonwiththecostofobservation.Becausewecannotpredictwhichdatawillyieldthemostscientificbenefitinyearsahead,thedatawediscardtodaymaybethedatathatwouldhavebeeninvaluabletomorrow.

Theassembledrecordofobservationaldatathushasdualvalue:itissimultaneouslyahistoryofeventsinthenaturalworldandarecordofhumanaccomplishment.Thehistoryofthephysicalworldisanessentialpartofouraccumulatingknowledge,andtheunderlyingdataformasignificantpartofthatheritage.Theyalsoportrayahistoryofourscientificandtechnologicaldevelopment.

Page 23: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page2

Therearenumeroussocioeconomicreasons,inadditiontothecompellingscientificandhistoricalmotivations,forthelong-termretentionofobservational,aswellascertaintypesofexperimental,data.Forexample,historicalclimatedatahavehadwell-documentedusesinabroadrangeofapplicationsinthemanufacturing,energy,agriculture,transportation,communications,engineering,construction,insurance,andentertainmentsectors.SuchapplicationsarecommonaswellforothertypesofobservationaldataontheEarth'senvironment.Experimentaldatainthephysicalsciencesalsohavemanyindustrialandotherpracticaluses.

Todaywecanforeseethepossibilityofusingthenationalresourceofscientificdatamoreadvantageouslythaneverbeforeastechnologicaladvancesopennewvistasformanagingscientificinformation.Advancesindatastoragetechnologiesmakethelong-termretentionofvirtuallyalldatabothfeasibleandaffordable.TheexistenceoftheInternetandoftheemergingNationalInformationInfrastructure(NII)enablesnationwidesharingandapplicationofdatathatresideinappropriatelyconfigureddatabases.

Ournewpowertostore,distribute,andaccessdataandinformationischangingthewayweworkandthink.However,thecommunitiesinvolvedinthecreation,retention,anduseofscientificdataaboutthephysicalworldarenotoptimallyorganized.Theycommonlyworktowarddisparategoals,arenotwellconnected,anddonottakefulladvantageoftechnologicalandconceptualadvancesindatamanagementandcommunication.Anentirelynewapproachtothelong-termpreservationofscientificdataisnowbothfeasibleandessential.Itmusttakeadvantageofadvancingtechnologyandofdistributedcommunicationsandmanagementstructurestoempowerboththecreatorsandtheusersofsuchdata.

Thisstudy,performedattherequestoftheNationalArchivesand

Page 24: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

RecordsAdministration(NARA),andpartiallysupportedbytheNationalOceanicandAtmosphericAdministration(NOAA)andtheNationalAeronauticsandSpaceAdministration(NASA),identifiesthemajorissuesregardingeffortstoarchiveandusedatainthephysicalsciences,establishesretentioncriteriaandappraisalguidelinesforthosedata,reviewsimportanttechnologicaladvancesandrelatedopportunities,andproposesanewstrategytohelpensureaccesstothedatabyfuturegenerations.

TheChallengeOfEffectivePreservationAndUseOfScientificData

Theresultsofscientificresearcharedisseminatedinthiscountrythroughahybridsystemthatincludesprofessionalsocietyandothernot-for-profitpublishers,thecommercialsector,andthegovernment.Theformaljournalsarepublishedlargelybytheprofessionalsocietyandcommercialsectors,whilegovernmentagenciesmanagelessformalreports(grayliterature).Secondaryabstractingandindexingservicesprovideaccesstothisliterature,increasinglybyelectronicmeans.Whiletherearestrainsinthissystembecauseofrisingcosts,increasingworkload,andissuesrelatedtotheprotectionofintellectualproperty,ithasservedU.S.sciencewellandhasbeenaninvaluablelinkintheprocessoftranslatingscientificadvancesintofurtheradvances,usefultechnology,andeconomicbenefits.

Thecurrentsystem,however,isnotwellsuitedtohandlethescientificandtechnicalelectronicdatabasesthatarethefocusofthisstudy.Thecostofmaintainingthesedatabasesistypicallytoogreattobecoveredbyuserfees;insteadthesedatabasesmustbeconsideredpartofthenationalscientificheritage.Somegovernmentagencieshaveacceptedresponsibilityformaintaininganddisseminatingthedataresultingfromtheirresearchanddevelopment.Insomecases,thissystemisworkingreasonablywell,butinothersthereareproblemsevenwithprovidingcurrentaccess.Archivingforthelongtermraisesquestions

Page 25: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

inallcases,however.

Ageneralproblemprevalentamongallscientificdisciplinesisthelowpriorityattachedtodatamanagementandpreservationbymostagencies.Experienceindicatesthatnewresearchprojectstendtogetmuchmoreattentionthanthehandlingofdatafromoldones,eventhoughthepayofffromoptimalutilizationofexistingdatamaybegreater.

Page 26: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page3

Withregardtolaboratorydata,governmentprogramshaveexistedsincethe1960stocompileresultsfromtheworldscientificliterature,tocheckthedatacarefully,andtopreparedatabasesofcriticallyevaluateddata.Despitechronicunderfunding,theseprogramshaveproduceddatabasesoflastingvaluetothenation,andthegovernmentinvestmentincreatingandmaintainingthesedatabaseshasbeenrepaidmanytimesover.

Intheareaofobservationaldatabases,thesituationismixed.Federalagenciescollectlargeamountsofobservationaldata,whichinmanycasesarecontinuouslyaddedtotheavailablerecordofEarthandspaceprocesses.Thedatasetsresultingfromtheseactivitiesaresometimeswell-documentedandmaintainedinreadilyaccessibleform;inmanyothercases,however,whilethedataaresaved,theyareexceedinglydifficultorimpossibletoaccessoruse,andthusareeffectivelyunavailable.

Themostimportantdeficienciesareinthedocumentation,access,andlong-termpreservationofdatainusableform.Insufficientdocumentationisagenericproblemthataffects,invaryingdegrees,alltheclassesofdataaddressedinthisstudy.Furthermore,fewofthefederaldatacenterscangiveadequateattentiontolong-termarchivingbecausetheyarestretchedthinbycurrentdemandsandinadequateresources.Eventhedatathatarearchivedmaybecomeinaccessiblebecausetheyarenotregularlymigratedtonewstoragemediaasthehardwareandsoftwareusedtoaccessthedatabecomeobsoleteorinoperable.

Anothermajorprobleminhibitingaccesstodataisthelackofdirectoriesthatdescribewhatdatasetsexist,wheretheyarelocated,andhowuserscanaccessthem.Inmanycasestheexistenceofthedataisunknownoutsidetheoriginalscientificgroups,andevenifknown,therefrequentlyisnotenoughinformationforapotentialuser

Page 27: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

toassesstheirrelevanceandusefulness.Thelackofadequatedirectoriesadverselyaffectstheexploitationofournationaldataresourcesandleadstounnecessaryduplicationofeffort.

Asignificantfractionofthearchivedscientificdataisheldbythefederalagenciesthatcollectedthedataaspartoftheirmission.However,alargeamountofvaluablescientificdatagatheredwithfederalfundsisneverarchivedormadeaccessibletoanyoneotherthantheoriginalinvestigators,manyofwhomarenotgovernmentemployees.Inmanyinstances,theorganizationsandindividualsthatreceivegovernmentcontractsorgrantsforscientificinvestigationsareundernoobligationtoretainthedatacollected,ortoplacetheminanaccessiblearchiveattheconclusionoftheproject.Thus,datasetsthatcommonlyaregatheredatgreatexpenseandeffortarenotbroadlyavailableandultimatelymaybelost,squanderingvaluablescientificresourcesandmuchofthepublicinvestmentspentinacquiringthem.Clearly,thereisagreatneedfortheagenciestogetmorereturnontheirinvestmentinsciencebythesimpleexpedientofmakingthedatacollectedundertheirauspicesaccessibletoothers.

Finally,theholdingsofscientificandtechnicaldatabyNARAinelectronicoranyotherformareverysmallincomparisonwiththedataholdingsofthefederalagenciesandtheorganizationssupportedbythem.Moreover,NARA'sbudgetforitsCenterforElectronicRecords,whichhastheformalresponsibilityforarchivingalltypesoffederalelectronicrecords,wasonly$2.5millioninFY1994,abudgetlowerthanthatofmanyoftheindividualagencydatacentersreviewedbythecommitteeinthisstudy.GivenNARA'scurrentandprojectedlevelofeffortforarchivingelectronicscientificdata,itisobviousthatNARAwillbeunabletotakecustodyofthevastmajorityofthesescientificdatasets.Therefore,acoordinatedeffortinvolvingNARA,otherfederalagencies,certainnonfederalentities,andthescientificcommunityisneededtopreservethemostvaluabledataandensurethattheywillremainavailableinusableformindefinitely.The

Page 28: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

challengeistodevelopdatamanagementandarchivingproceduresthatcanhandletherapidincreasesinthevolumesofscientificdata,andatthesametimemaintainolderarchiveddatainaneasilyaccessible,usableform.Animportantpartofthischallengeistopersuadepolicymakersthatscientificdataandinformationareindeedapreciousnationalresourcethatshouldbepreservedandusedbroadlytoadvancescienceandtobenefitsociety.

Page 29: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page4

RetentionCriteriaAndTheAppraisalProcess

TheNationalArchivesandRecordsAdministrationappraisesrecordsonthebasisoftheirinformationalandevidentialvalue.Itisconcernedwithrecordsoflong-termvalue,thoserecordsthatwillprobablyhavevaluelongaftertheyceasetohaveimmediate,orprimary,uses.Thevalueofscientificandtechnicaldataisprimarilyinformationalandisbasedonthescientificcontentoftherecords,ratherthanontheevidencetheyprovideconcerningtheactivitiesoftheagencythatcollectedorcreatedthem.

Recommendations

Therecommendationsbelowregardingtheretentioncriteriaandappraisalprocessshouldbeappliedbythoseresponsibleforstewardshiptoallphysicalsciencedata.Similarcriteriaandappraisalguidelinesmustbedevelopedfordatainotherdisciplines.ThisisatopicofprimaryconcernnotonlytoNARA,NOAA,andNASA,buttoallscientists,datamanagers,andarchivistswhoworkwithsuchrecords.

Asageneralrule,allobservationaldatathatarenonredundant,useful,anddocumentedwellenoughformostprimaryusesshouldbepermanentlymaintained.Laboratorydatasetsarecandidatesforlong-termpreservationifthereisnorealisticchanceofrepeatingtheexperiment,orifthecostandintellectualeffortrequiredtocollectandvalidatethedataweresogreatthatlong-termretentionisclearlyjustified.Forbothobservationalandexperimentaldata,thefollowingretentioncriteriashouldbeusedtodeterminewhetheradatasetshouldbesaved:uniqueness,adequacyofdocumentation(metadata),availabilityofhardwaretoreadthedatarecords,costofreplacement,andevaluationbypeerreview.Completemetadatashoulddefinethecontent,formatorrepresentation,structure,andcontextofadataset.

Page 30: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Theappraisalprocessmustapplytheestablishedcriteriawhileallowingfortheevolutionofcriteriaandprioritiesandmustbeabletorespondtospecialevents,suchaswhenthesurvivalofdatasetsisthreatened.Allstakeholdersscientists,researchmanagers,informationmanagementprofessionals,archivists,andmajorusergroupsshouldberepresentedinthebroadoverarchingdecisionsregardingeachclassofdata.Theappraisalofindividualdatasets,however,shouldbeperformedbythosemostknowledgeableabouttheparticulardataprimarilytheprincipalinvestigatorsandprojectmanagers.Insomecases,theymayneedtoinvolveanarchivistorinformationresourcesprofessionaltoassistwithissuesoflong-termretention.

Classifieddatamustbeevaluatedaccordingtothesameretentioncriteriaasunclassifieddatainanticipationoftheirlong-termvaluewheneventuallydeclassified.Evaluationoftheutilityofclassifieddataforunclassifiedusesneedstobedonebystakeholderswiththerequisiteclearancestoaccesssuchdata.

OpportunitiesCreatedByTechnologicalAdvancesForNewDataUseAndRetentionStrategies

Rapidprogressininformationtechnologycontinuallyaltersboththequantityandthequalityofscientificinformationandperiodicallystimulatesfundamentalmodificationofdatamanagementandarchivingstrategies.Recenttechnologicaladvanceshaveenablednewmethodsandstrategiesfordatastorageandretrievalandhavecreatedbetterwaysofconnectinguserstodataresourcesandtoeachother.Moreover,theevolvingtechnologiesarecatalystsforrevisingorganizationalstructurestomanagedistributedscientificdataarchivesmuchmoreeffectively.

TableS.1providesasummaryofnewtechnologiesandrelateddevelopmentsthatenableanewstrategyforthemanagementofscientificandtechnicaldata.Theseadvancesininformationtechnologies

Page 31: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources
Page 32: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page5

TABLES.1NewTechnologiesandRelatedDevelopmentsThatEnableaNewStrategyfortheManagementofScientificandTechnicalData

NewTechnologyTrendsandRelatedDevelopments

KeyFeatures WhatIsEnabled?

High-performancecomputernetworks

Distributedfunctions;rapiddeliveryoflargedatavolumes

Locationofdatabasesandarchiveswherebestmanaged;collaborativework;distributedorganizations;distributedresponsibility

Lowanddecliningcostofstorage

Inexpensivebackup;continuallydecliningcost;easeofmigration

Deferralofarchivingdecisions;trustindistributedmanagementduetosafestoragebackup

Advanceddatamanagement

Abilitytorigorouslyandformallymanagediversedatatypes

Morecomplexdatastructures(otherthan''flatfiles")handledinarchives,withgreatpotentialadvantages

Changingrequirementsforinformationtechnologyprofessionals

Abilityofpersonnelwithlowertechnicalskillstosucceedindatamanagementroles

Abilitytoentrustscientificdatamanagementinadistributedenvironment

Highreliabilityoftechnologycomponents

Availabilityofbettercomponentsandconnections;reducedprocurementandoperationscosts

Reducedcostandeffortindatamigration;trustedconnectionsforcommunicationandcollaboration

Developmentandacceptanceofstandards

Agreementonterms,interfaces,media,procedures

Reducedefforttocommunicateandapplyresultsofothers;abilitytoconcentrateonmissionissuesandnotontechnologysupport

anddatamanagementsupportthecreationofahighlydistributed,

Page 33: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

anddatamanagementsupportthecreationofahighlydistributed,federatedmanagementstructureforournation'sscientificinformationresources.

ANewStrategyForArchivingTheNation'sScientificAndTechnicalData

Inordertorespondadequatelytotheimperativesforpreservingdataaboutthephysicaluniverseandtotakeadvantageofthetechnologicaladvancesdescribedabove,thefederalgovernmentshouldcreateanintegratedandadaptiveinfrastructureandrelatedprocessesforprovidingreadyaccesstothenationalresourceofscientificandtechnicaldataandrelatedinformation.Suchaneffortmustsupporttheneedsofdataoriginators,users,andcustodiansacrossallphasesofthedatalifecycle,fromorigintousebyfuturegenerations.Thecommitteebelievesthatthefollowingprinciplesshouldguidetheeffortofthegovernmentagenciesinthelong-termretentionofscientificandtechnicaldata:Dataarethelifebloodofscienceandthekeytounderstandingthisandotherworlds.Assuch,dataacquiredinfederalorfederallyfundedendeavors,whichmeetestablishedretentioncriteria,areacriticalnationalresourceandmustbeprotected,preserved,andmadeaccessibletoallpeopleforalltime.Thevalueofscientificdataliesintheiruse.Meaningfulaccesstodata,therefore,meritsasmuchattentionasacquisitionandpreservation.

Page 34: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page6Adequateexplanatorydocumentation,ormetadata,caneliminateoneoftoday'sgreatestbarrierstouseofscientificdata.Asuccessfularchiveisaffordable,durable,extensible,evolvable,andreadilyaccessible.Theonlyeffectiveandaffordablearchivingstrategyisbasedondistributedarchivesmanagedbythosemostknowledgeableaboutthedata.Planningactivitiesatthepointofdataoriginmustincludelong-termdatamanagementandarchiving.

TheProposedNationalScientificInformationResourceFederation

ThecommitteebelievesthatthefederalgovernmentshouldcreateaNationalScientificInformationResourceFederationanevolutionaryandcollaborativenetworkofscientificandtechnicaldatacentersandarchivestotakeonthechallengeofprovidingeffectiveaccesstoandpreservationofimportantdataandrelatedinformation.Suchaninitiativewouldbegintoexploitfullyournation'ssignificantinvestmentinthephysical(andother)sciencesandthedataacquiredwiththatinvestment.Severalcriticalconceptsmustgovernanyfederatedmanagementstructureforittofunctionproperly(Handy,1992):Subsidiaritythepowerisassumedtoliewiththesubordinateunitsofanorganization.Powercanberelinquished,butnottakenaway.Thesubordinateunitstypicallyarebestqualifiedtomakeoperationaldecisionsthatdirectlyaffectthemandthattheywillbeimplementing.Thecentralmanagementisallowedonlythosepowersneededtoensurethatthesubordinatesdonotdamagetheorganization.ItisclearthatthestrengthsofthecurrentsystemformanagingscientificandtechnicaldataandinformationintheUnitedStatesaredistributedamonganumberofdiversedatacentersandarchives,bothwithinandoutsidethegovernment.Asuccessfulfederationoftheseexistinginstitutionswouldrecognizethattheyarethelocationsofexpertiseontheirrespectivedataholdings.Thusthecentralorganizationshouldbesmallandshouldnot

Page 35: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

holdings.Thusthecentralorganizationshouldbesmallandshouldnotmicromanagetheday-to-dayoperationsofthesubsidiaryorganizations.Pluralismthemembersareinterdependent.Inafederation,theindividualsubsidiaryorganizationsrecognizetheadvantagesofbelongingtothefederation,becauseofproductsorservicesthatcanbeobtainedfromotherelementsinthefederation.Theexistenceofmanyspecializeddatacentersandarchives,aswellasthepossibilityofcreatingnewonesinanetworkedenvironment,canoffersignificanteconomiesofscaleandimprovedsharingofideasandexpertise.Whatisgoodforthesubsidiaryelementalsoshouldbegoodforthewhole.Pluralism,coupledwithsubsidiarity,guaranteesameasureofdemocracyinthefederation.Standardizationinterdependencerequirescompatiblelanguages,communications,basicrulesofconduct,andunitsofmeasurement.Theseelementsmaybesummarizedastechnicalandproceduralstandardization.Standardsthataredevelopedbyconsensusofthesubsidiaryelements(e.g.,theparticipatingdatacenters,archives,andresearchers)arewidelyrecognizedasessentialtothesuccessfulmanagementofdata.Separationofpowers(responsibilities)asystemofchecksandbalancesisnecessarytoensurethatthecentralauthoritydoesnottakeonunnecessarypower.Thisprinciplemustbeincorporatedintothefederation'sorganizationalstructure.Strongleadershipthecentralcoordinatingelementorexecutiveofficemustactasthestandardbearer,promotingthefederation'sestablishedgoalsandobjectiveswhileremindingthesubsidiaryorganizationsoftheimportanceofcarryingouttheirresponsibilities.

AfederateddatamanagementsystemwouldbeconsistentwiththegoaloftheNationalInformationInfrastructuretodistributeinformationresourcesbroadlythroughoutoursociety.Thetechnologyis

Page 36: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources
Page 37: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page7

availabletomakeafullynetworked,buthighlydistributedsystemofdatacentersandarchivesbothfeasibleanddesirable.Suchasystemwouldbeefficientinprovidingaccesstoscientificdataandinformationtoalargenumberofpotentialusersandwouldmaximizethegovernment'sreturnontheverylargeinvestmentthatinitiallywentintoacquiringthosedata.Fromanorganizationalstandpoint,afederatedmanagementstructurewouldallowthedisparateelementstocontinuetospecializeinwhattheyeachdobestandtofulfilltheirindividualorganizationalmandates,whileprovidingsomeefficienciesofscaleandpoliticalleverageinaddressingthemostpressingissues.Thecommitteebelievesthisapproachisespeciallytimelyandimportantinaneraoffederalgovernmentbudgetreductions.

Recommendations

Thecommitteethusrecommendsthatthefederalgovernmenttakethefollowingstepsforadequatelypreservingandprovidingaccesstodataaboutourphysicaluniverse:

AdopttheNationalScientificInformationResource(NSIR)FederationconceptasanintegralpartoftheNationalInformationInfrastructure(NII).Thisconceptmustencompassnotonlyanelectronicnetwork,butalsoindividuals,organizations,communities,dataresources,procedures,guidelines,andassociatedactivitiesofdatageneration,management,custodianship,anduse.TheNSIRFederationthusshouldprovidethemeansfordefiningacoherentapproachtomanagingthelifecycleofscientificdata.Thisapproachshouldbedevelopedandimplementedthroughconsensusofcollaboratingorganizationswithdiverseandautonomousmissions.TheinteragencyGlobalChangeDataandInformationSystemisanexampleofaprototypeNSIRFederation,focusedondataforaspecificsetofinterdisciplinaryscienceproblems.TheNSIRFederationwouldbuildonsuchefforts,providingforbetter

Page 38: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

coordinationandinteractionamongthem,andwouldhelporganizefledglingeffortstopreserveandprovidebroadaccesstodatainotherdisciplines.

TheadministrationshouldtakethestepsnecessarytofullydefineandcreatetheNSIRFederation.Thereareatleasttwopotentialfocalpointswithintheadministrationforplanningsuchanactivity.ThesearetheinteragencyInformationInfrastructureTaskForcefortheNIIandtheNationalScienceandTechnologyCouncil.Aconvocationofrepresentativesfromthescientific,dataandinformationmanagement,andarchivingcommunitieswouldbeagoodwaytohelpdefineandinauguratethisinitiative.

FollowingtheformalauthorizationbythefederalgovernmentforcreatingtheNSIRFederation,theprincipalparties,includingNARAandNOAA,shouldconcludeagreementsfortheimplementationofadistributedarchivesystem.Thesystemshouldinvolveallrelevantinstitutions,includingnongovernmentalentitiesthatarefundedbythefederalgovernmentorthatmaintaindatathatwereacquiredwithfederalfunds.Asageneralprinciple,datacollectedbyanagencyshouldremainwiththatagencyindefinitely.ThecommitteerecognizesthatthisrecommendationmayrequiresignificantoperationalchangesforagenciesotherthanNOAA,andevensomechangeswithrespecttoNOAA'sdataactivities.Furthermore,theassociatedagenciesintheNSIRFederationmustworktogether,undertheleadofasmallexecutiveofficewiththeexpertisetoestablishdatamanagementguidelinesandminimumcriteriaforadequatemetadatathatcouldbeappliedacrosstheentireFederation.Theexecutiveofficecouldbeeitherahigh-levelinteragencycoordinatingcommitteeoranewofficeatanappropriatefederalagency,suchastheNationalScienceFoundation,whichhasabroadscientificandtechnicalaswellascommunicationmandate.Inanycase,theexecutiveofficeshouldresistthetypicaltendencytowardbureaucraticaccretionofpower,personnel,andresources,aswellasthetendencytoconsolidateand

Page 39: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

centralizedataholdings.Amanagementcouncilconsistingofrepresentativesofthememberorganizationsshouldbecreatedtohelpensurethattheexecutiveofficefunctionremainsfullyresponsivetoallmembersofthefederation.

Page 40: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page8

Dataaccessandpreservationservicesshouldbeimplementedonthemostcost-effectivebasispossiblefortheFederation.Forexample,oneinstitutionshouldprovideaservicetooneormoreotherinstitutionsinordertoexploitpotentialeconomiesofscaleandfocalpointsofexpertise.Thismeasuremightincreasethecosttotheprovidinginstitution,butwoulddecreasetheoverallcosttothefederation,thegovernment,andthetaxpayer.

TheinstitutionsbelongingtotheNSIRFederationshoulddevelopaprocessforcollaboratingeffectivelyonspecificinitiatives.Thisprocessshouldprovideamechanismtodefineandprioritizedatamanagementandpreservationinitiatives,toestablishtherequiredagreementsbetweencollaboratingorganizations,andtosecurefundingforeachinitiative.Eachparticipatingorganizationwouldcontributetothefederationaccordingtoitsparticularstrengthsandinamannerconsistentwiththefoundingcharter.Inaddition,anindependentadvisoryboardconsistingofexpertsfromusergroupsshouldbeformedinsupportofeachinitiative.

TheNSIRFederationshoulddevelopanationalresourceofinformationtechnologythatisconsistentwithitscharteredobjectivesandthatcanbeeffectivelydistributedtoinstitutionsthatmustmanagedata.Thesetechnologieswouldincludecompleteproducts,designs,guidelines,standards,andmethodologies.Arelatedlong-termtechnologystrategy,or"technologynavigation"function,shouldbedevelopedtohelpguidetheseefforts.

TheNSIRFederationshouldinstituteanindependentlymanagedprocessforawardingNSIRcertificationtomemberscientificinstitutionsandtheirdataandinformationsystemsonthebasisofwell-definedcriteriaandstandards.Thecertificationprocessshouldbemanagedbyanongovernmental,not-for-profitorganization,whichwouldreceivetechnicalguidancefromtheparticipatingfederal

Page 41: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

agencies.Thecertificationneedstohavecredibilityinthecommunity,sothatnonmemberinstitutionswillaspiretoattaincertificationandhaveittaggedtotheirproducts.Thecertificationalsoshouldbesomethingthatcommercialvalue-addedprovidersseektoincreasethecredibilityoftheirproducts.

ItalsoisimportantforthecommitteetostatewhattheNSIRFederationshouldnotbe.Itshouldnotbecomeanexpensivebureaucraticentity.Theexecutiveofficemustnotimposeanystandardsorinformationtechnologiesfromabovethathavenotbeenvalidatedthroughaconsensusprocessofthememberorganizations.Finally,theexecutiveofficemustnotattempttomicromanagetheoperationsoftheparticipants,norshouldithaveanydirectcontrolovertheirbudgetsandfundingallocations.

RecommendationsSpecificallyforNARA

AlthoughNARAhasalegislativemandatetopreservefederalrecords,itcannottoday,norwillitlikelyeverbeableto,actasthecustodianofmostphysicalsciencedata.ThedatavolumeistoogreatinrelationtotheverylowfundingappropriatedtoNARA,theNARAstaffdonothavethespecializedscientificknowledge,theinteragencylinkagesarenotinplace,andahugeinfrastructuresimilartothatwhichalreadyexistsatotheragencieswouldneedtobeduplicatedbyNARA.Inaddition,thedesignationofafederalrecordissometimesirrelevanttothearchivalprocessforscientificandtechnicaldata,andmanydataoflong-terminterestdonotmeettheexistingdefinitionofafederalrecord.*Hence,

*"'[Federal]records'includesallbooks,papers,maps,photographs,machinereadablematerials,orotherdocumentarymaterials,regardlessofphysicalformorcharacteristics,madeorreceivedbyanagencyoftheUnitedStatesGovernmentunderFederallaworinconnectionwiththetransactionofpublicbusinessandpreservedorappropriateforpreservationbythatagencyoritslegitimatesuccessorasevidenceoftheorganization,function,policies,decisions,procedures,operations,orother

Page 42: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

activitiesoftheGovernmentorbecauseoftheinformationalvalueofthedatainthem"(44U.S.C.3301).

Page 43: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page9

NARAhasaspecialroleasapartnerinthearchivingprocessforscientificandtechnicaldatasetsthatisdifferentfromitstraditionalroleasthenation'sarchives.

ThecommitteemakesthefollowingspecificrecommendationstoNARAinadditiontothosemadeelsewhereinthisreport:

NARAshouldstrengthenitsliaisonwitheachfederalagencythatproducesscientificandtechnicaldatatoensurethatappropriateattentionisdevotedtotheirlong-termretentioninadistributedstorageenvironment.

NARAshouldformstandingadvisorycommitteeswithmanagersofscientificdata,historians,andscientificresearcherstoaddresstheretentionandappraisalofscientificandtechnicaldatacollectionsandrelatedissues.

NARAshouldcollaboratewithotheragenciesthatmaintainlong-termcustodyofdatatodevelopaneffectiveaccessmechanismtothesedistributedarchives.Theinitialstepshouldfocusonlocatorsystemsandevolvetowardatransparentaccesssystem.

Finally,NARAshouldworkwiththescientificcommunityandpotentialsourcesofscientificdatatodevelopadaptableperformancecriteriafordataformatsandmedia,ratherthanmandatingnarrowandinflexibleproductstandards.

RecommendationsSpecificallyforNOAA

AsthelargestholderofearthsciencesdataintheUnitedStates,NOAAhasavastamountofscientificdatastoredatanumberoffacilitiesacrossthecountry.NOAAthushasanespeciallyimportantroleinthepreservationofournation'sobservationaldataonthephysicalenvironment.ThecommitteemakesthefollowingspecificrecommendationstoNOAA:

NOAAshouldplaceahigherpriorityondocumentingandestablishing

Page 44: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

NOAAshouldplaceahigherpriorityondocumentingandestablishingdirectoriesofitsdataholdings.

NOAA,withtheactivecooperationofNARA,shouldleadeffortstobetterdefinetechnology-independentstandardsforarchiving,storing,andtransmittingthedatawithinitspurview.

Finally,NOAA,aswellaseveryotherfederalscienceagency,shouldensurethat:allitsdataaresharedandreadilyavailable;itfulfillsitsresponsibilityforqualitycontrol,metadatastructures,documentation,andcreationofdataproducts;itparticipatesinelectronicnetworksthatenableaccess,sharing,andtransferofdata;anditexpresslyincorporatesthelong-termviewinplanningandcarryingoutitsdatamanagementresponsibilities.

Thecreationofthecommittee'sproposedNSIRFederationwouldhelpprovideacollaborativemechanismandmoresustainedpeerpressuretomeettheseobjectives,andthusenhancethevalueofscientificandtechnicaldataandinformationresourcestothenation.

Page 45: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page10

1IntroductionStandingattheintersectionofpastandfuture,wehumansarefascinatedwiththeeventsofyesteryearandintriguedwithwhattomorrowwillbring.Ourprehistoricancestorsbegantheprocessofrecordingaspectsoftheenvironmentthatwereimportanttothem(Marshack,1985;Boorstin,1992).Todaywearecuriousaboutmanymoreworlds,rangingfromthoseofatomicsizetothoseofcosmicscale.WithinstrumentsonEarthandinspace,weseektocaptureviewsofrealitythatwillhelpusunderstandnatureandourrelationshiptoit.

Scientificdatareflectboththeorganizationandthechaosofthenaturalworld.Theystimulateustodevelopconcepts,theories,andmodelstomakesenseofthepatternstheyrepresent.Theresultingabstractionsaretheproductofscientificendeavor,thegoalbeingtodeveloptheformalandsystematicideasthatconstitutetheunderstandingofrelationshipsbetweencausesandconsequencesandperhapsmayenablepredictionoffuturesequencesofevents.Becausescientiststransformdatafromthematerialworldintoideas,theobservationsofobjectsandprocessesinthephysicalworldarethestimuliofscientificthought.Dataarethustheseedsofscientificideas.

Sciencegenerallyworksbyproceedingfromdatatounderstandingthroughaprocessoforganizingthedataandanalyzingtheirimplications.Thefollowingdefinitions,adaptedfromSettingPrioritiesforSpaceResearch:OpportunitiesandImperatives(NRC,1992a),indicatehowtheprocessworks:Dataarenumericalquantitiesorotherfactualattributesderivedfromobservation,experiment,orcalculation.Informationisacollectionofdataandassociatedexplanations,interpretations,orothertextualmaterialconcerningaparticularobject,

Page 46: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

interpretations,orothertextualmaterialconcerningaparticularobject,event,orprocess.Knowledgeisinformationorganized,synthesized,orsummarizedtoenhancecomprehension,awareness,orunderstanding.Understandingisthepossessionofaclearandcompleteideaofthenature,significance,orexplanationofsomething;itisthepowertorenderexperienceintelligiblebyorderingparticularsunderbroadconcepts.

Thisprocessiscyclical.Newdataconfirmorrefuteexistingtheoriesandstimulatenewunderstanding,whichgeneratesnewanddeeperquestionsthatoftenneedentirelynewsetsofobservationstobegintheprocessofansweringthem.Newunderstandingalsoleadstoincreasedtechnologicalcapability,and

Page 47: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page11

thatinturnmakesnewobservationspossibleandagainallowsustocontemplatemoresophisticatedquestions.

Thusobservationsandscientificprogressareintertwined;datafromthephysicalworldensurethatscienceisfoundedonrealityaswetrytoanswertheunending"how"and"why"questionsthatarepartofbeinghuman.Theanswersbecomeunderstandingthatenablesustodevelopschemesforpredictingornotbeingsurprisedbyfutureevents.Andunderstanding,wehope,ultimatelyleadstowisdomaboutourinteractionswiththeworldaroundus.

ImperativesForPreservingDataOnOurPhysicalUniverse

Thescientificreasonsforpreservingdataderivefromthefactthatobservations,knowledge,andunderstandingarecumulative.Thuswebelievethatthemorecompletetherecord,themorewecanextractfromit.

Manyobservationsaboutthenaturalworldarearecordofeventsthatwillneverberepeatedexactly.Examplesincludeobservationsofanatmosphericstorm,adeepoceancurrent,avolcaniceruption,andtheenergyemittedbyasupernova.Oncelost,suchrecordscanneverbereplaced.

Observeddataprovideabaselinefordeterminingratesofchangeandforcomputingthefrequencyofoccurrenceofunusualevents.Thelongertherecord,thegreaterourconfidenceintheconclusionswedrawfromit.Ourtraditionalobservationalrecordshaveportrayedfrozeninstantsofreality.Ifpreserved,theywillcontinuetoprovideinsights,butifneglected,theywillmeltaway.

Adatarecordisalsoworthpreservingbecauseitmayhavemorethanonelife.Asscientificideasadvance,newconceptsemergeinthesameorentirelydifferentdisciplinesfromstudyofobservationsthatled

Page 48: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

earliertodifferentkindsofinsights.Newcomputingtechnologiesforstoringandanalyzingdataenhancethepossibilitiesforfindingorverifyingnewperspectivesthroughreanalysisofexistingdatarecords.Thus,therelativeimportanceofdata,bothcurrentandhistorical,canchangedramatically,ofteninentirelyunanticipateddirections.Thismeansthatthereanalysisofdata,eveninthedistantfuture,maybringnewunderstanding,whichwillagainincreasethevalueofthosedataoverthatwhichwemighthaveassignedtothematthetimeoftheirarchiving.Finally,thesubstantialinvestmentsmadetoacquiredatarecordsusuallyjustifytheirpreservation.Thecostofpreservationwillalmostalwaysbesmallincomparisonwiththecostofobservation.Becausewecannotpredictwhichdatawillyieldthemostscientificbenefitinyearsahead,thedatawediscardtodaymaybethedatathatwouldhavebeeninvaluabletomorrow.

Theassembledrecordofobservationaldatathushasdualvalue:itissimultaneouslyahistoryofeventsinthenaturalworldandarecordofhumanaccomplishment.Thehistoryofthephysicalworldisanessentialpartofouraccumulatingknowledge,andtheunderlyingdataformasignificantpartofthatheritage.Theyalsoportrayahistoryofourscientificandtechnologicaldevelopment.

Withappropriateexplanatorydocumentation,oftenreferredtoasmetadata,thedatademonstratetheincreasingsophisticationofourattemptstounderstandournaturalsurroundingsandthetechnologicalcapabilitiesweapplytothetask.Preservedforstudybyfuturegenerations,thedatawillspeakacrosstheyearsaboutwhatwetriedtodo,wherewesucceeded,andwherewefailed.Withincreasingcapabilitiesforanalyzingandconceptualizingpatternsindata,thosewhofollowmayfindinourarchiveddataimportantcluesthatwecouldnotordidnotsee.Atthesametime,ourdescendantswillbegratefulthatwepreservedasufficientlylonghistoryoftheirworldthattheycanmakeimportantdecisionsabouttheirownfuture.

Page 49: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Therearenumeroussocioeconomicreasons,inadditiontothecompellingscientificandhistoricalmotivations,forthelong-termretentionofobservational,aswellascertaintypesofexperimental,data.Forexample,historicalclimatedatahavehadwell-documentedusesinabroadrangeofapplicationsinmanufacturing,energy,agriculture,transportation,communications,engineering,construction,insurance,andentertainment(OTA,1994).Suchapplicationsarecommonforothertypesofobservational

Page 50: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page12

dataontheEarth'senvironment.Experimentaldatainthephysicalsciencesalsohavemanyindustrialandotherpracticaluses.Additionalexamplesofthelong-termusesofthevariousphysicalsciencedataareprovidedinthenextchapter.

ANewFutureForScientificData

Thecollectionsofscientificdataacquiredwithgovernmentandprivatesupportarethefoundationforourunderstandingofthephysicalworldandforourcapabilitiestopredictchangesinthatworld.Intheyearsahead,thevolumesofthosecollectionsofdatawillincreasedramatically.Theywillstimulateadvancesinourscientificunderstandingandinourapplicationsofthatunderstandingtopursueimportantnationalgoals.Thescientificdatainfederal,state,andprivatedatabasesthusconstituteacriticalnationalresource,onewhosevalueincreasesasthedatabecomemorereadilyandbroadlyavailable.

Today,wecanforeseethepossibilityofusingthenationalresourceofscientificdatamoreadvantageouslythaneverbefore,astechnologicaladvancesopennewvistasformanagingandaccessingscientificinformation.Growingcomputationalpowerenablesnewapproachestotheanalysis,management,andapplicationofdata.Advancesindatastoragetechnologiesmakethelong-termretentionofvirtuallyalldatabothfeasibleandaffordable.TheexistenceoftheInternetandoftheemergingNationalInformationInfrastructure(NII)enableunprecedentednationwidesharingandapplicationofdatathatresideinappropriatelyconfigureddatabases.Automaticsearchprocedures,filetransfercapabilities,andtheacceleratinguseoftheWorldWideWebfunctionsontheInternetillustratethepowerofthecontemporarytechnology.Itisimportanttonotethattheseenablingtechnologieshaveemergedinashorttimespan;equallyrapidadvancescanbeanticipatedintheyearsahead,whichwillfurtherfacilitatethesearch

Page 51: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

forandaccesstothenation'sdataresources.

Ournewpowertostoreanddistributedataandinformationischangingthewayweworkandthink.However,thecommunitiesinvolvedinthecreation,retention,anduseofscientificdataaboutthephysicalworldarenotoptimallyorganized.Theycommonlyworktowarddisparategoals,arenotwellconnected,anddonottakefulladvantageoftechnologicalandconceptualadvancesindatamanagementandcommunication.Anentirelynewapproachtothelong-termpreservationofscientificdataisnowbothfeasibleandessential.Itmusttakeadvantageofadvancingtechnologyandofdistributedcommunicationsandmanagementstructurestoempowerboththecreatorsandtheusersofsuchdata.

Thisstudyidentifiesthemajorissuesregardingexistingeffortstoarchiveandusedatainthephysicalsciences,establishesretentioncriteriaandappraisalguidelinesforthosedata,reviewsimportanttechnologicaladvancesandrelatedopportunities,andproposesanewstrategytoensureaccesstothedatabyfuturegenerations.

Page 52: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page13

2TheChallenge:PreservationandUseofScientificDataWeadvanceourunderstandingofthephysicaluniversebybuildingoncurrentandpaststudiesinindividualdisciplines,bycollectingandanalyzingnewtypesofdata,andbyusingpastobservationsinentirelynewwaysnotenvisionedwhenthedatawereinitiallycollected.Themorecompletetherecordofscientificdataandinformation,themorenewunderstandingandknowledgewecanextractfromit.Observationsofnaturalphenomenatypicallyrepresentarecordofeventsthatwillneverberepeatedinadynamicuniversethatcontinuallychangesintimeandvariesinspace.Newscientificadvanceshavehadsignificant,sometimesprofound,societalandeconomicimpactsandmaybeexpectedtobeequallyimportantinthefuture.Scientificdataandinformationareattheheartoftheseadvancesandareessentialfornewdiscoveries.Therefore,theyconstituteapreciousnationalresource.

Thesectionsthatfollowdescribebrieflythetwomajortypesofdatathatareofcriticalimportanceinthephysicalsciencesexperimentallaboratorydatainphysics,chemistry,andmaterialssciences,andobservationaldataintheearthandspacesciences.Ineachofthesebroadareastheprogressthathasbeenmadetodateintermsoflong-termpreservationandaccessibilityischaracterized,andthekeyissuesidentified.Morecomprehensivedescriptionsofthestatusoflong-termdataretentioninthevariousphysicalsciencedisciplineareasareinthevolumeofworkingpaperspreparedasbackgroundforthisreport(NRC,1995).

ExperimentalLaboratoryData

Page 53: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Theexperimentalscienceshaveprogressedoverthecenturiesbybuildingontheconcepts,theories,andfactualinformationresultingfromeachgenerationofscientificinquiry.TheobservationsofTychoBrahewereusedbyKeplertodevelophislawsofplanetaryorbits,andNewton'sformulationofmechanicsdrewuponthepreviousworkofGalileo,Kepler,andothers.AcenturyofmeasurementsonpropertiesofthechemicalelementsprovidedtherawmaterialneededforMendeleevtoconstructhisperiodictable.Thehistoryofscienceisrichinexampleswheretheintroductionofnew,oftenrevolutionary,conceptsrestedondatathathadbeenpreservedfrompreviousscientificinvestigations.Furthermore,thetechnologyoftomorrowisoftenbasedonthelaboratorydataoftodayoryesterday.

Theexplosivegrowthofscienceinthiscenturyprovidesmanyotherexamplesofthekeyroleofdatafrompreviousexperiments.WhenTownesandSchawlowpublishedtheirlandmark1958paperthatdemonstratedthetheoreticalpossibilityofbuildingalaser,intensiveeffortswerestartedtofindareal

Page 54: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page14

physicalsystemthatwouldmeetthenecessaryrequirements.Dataonatomicspectra,someofthem60to70yearsold,providedthekeytocreationofthefirstworkinggaslaser.Ifithadbeennecessarytomakenewmeasurementsoneveryconceivablesysteminordertoselectthemostpromisingfortrial,theinventionofthelaserandallthenewtechnologyandeconomicbenefitsthatithasbroughtwouldhavebeendelayedformanyyears.

ThecrashprogramtoimproverocketpropulsionsystemsfollowingthelaunchofthefirstSovietSputnikprovidesanotherexample.Dataonthethermodynamicpropertiesofawiderangeofsubstanceswereessentialtotheeffortstooptimizerocketengineperformance.Aconcertedgovernmentprogramwasstartedtobuildadatabaseofthermodynamicpropertiesforrocketenginedesign.Althoughsomenewlaboratorymeasurementswererequired,manyoftheneededdatawereinthescientificliterature,somepublishedasearlyas1880.Theavailabilityoftheseolderdatasignificantlyaidedtherocketengineprogram.

Datageneratedbyscientistsandengineersinthefieldsofphysics,chemistry,andmaterialssciencehavetraditionallybeenpublishedinresearchjournals,whichservebothacurrentdisseminationandanarchivalfunction.Thisjournalsystemhasservedsciencewellfor300years.Manyscientificlibrariesthroughoutthecountryprovideaccesstothesejournals.Becausebackvolumesarekeptinlibrariesinmanydifferentplaces,thereislittledangerofirreparablelossfromanaturalcatastrophe.Manyscientificsocietiesalsohavedepositorysystemsthatallowauthorstosubmitvoluminousdatasetsthatcannotbepublishedinthejournalsbecauseoflackofspace.Thesocietiesmaintainthesearchives,generallyonmicrofilm,andsupplycopiesonrequest.

Whilethegrowinguseofelectronicrecordingandstoragetechniques

Page 55: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

isalreadyaffectingthetraditionaljournalsystem,wecanexpectpublisherstotakeadvantageofthenewtechnologytomeetnewneeds.Scientificsocietiesarebeginningtoimplementelectronicarchivesforpreservingdatathataretoovoluminoustopublishinpaperformats.Forexample,theAmericanChemicalSocietyrecentlybegantomakedatafrompapersinitsleadingjournal(JournaloftheAmericanChemicalSociety)availableontheInternet.Itisanaturalstepfromthepaperandmicrofilmarchivesthatsuchsocietiesnowmaintaintotheelectronicarchivesofthefuture.Clearly,theseprivatesectorarchivesmustbeanintegralpartoftheoverallconceptofa"NationalScientificInformationResource."

Electronicallyrecordeddatainthelaboratoryphysicalsciencesareoftwoforms,originalexperimentalmeasurementsandevaluatedcompilationsofpublisheddata.Theseareexaminedhereinturn.

OriginalExperimentalMeasurements

Recentdecadeshaveseensignificantchangesintheformof"originaldata."Arawexperimentalresultwas,inthepast,typicallyameasuredvaluesuchasavoltageordistance.Theinvestigatorreadthesemeasurementsfrominstruments,wrotetheminanotebook,treatedthemarithmeticallytoobtainthedesiredscientificvariablefromtherawmeasurement,andinterpretedthem.Theoriginalmeasurementswereeventuallydiscardedinmostcases.Today,manyrawdataareacquiredandprocessedelectronicallyassoonastheyareenteredintothecomputer,sothatonlytheprocesseddataexistlongenoughforanyonetolookat.Withrapid,automateddataacquisitionandmanipulation,theoptionexiststokeepelectronicdataandreanalyzethemasrequired.However,automateddatacollectionoftenresultsinlargevolumesofinsignificantdata,sothatinmanyexperimentsthedatastreamisscreenedandmostofthedataarediscardedinrealtimebyacomputerprogramorbytheexperimenter.Forexample,spectroscopistsusedtokeep,atleasttemporarily,thephotographic

Page 56: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

platesorrecorderchartsfromwhichtheyhadtakenmeasurements.Nowthespectralfeaturesmaybeanalyzedelectronicallyimmediatelyuponmeasurement,andonlytheattributesofrelevantfeaturesarerecorded.Thefractionoftherawdatathatissavedafterinitialprocessingmaybesmall,sometimeslessthanonepartin10,000.Invirtuallyallcases,thereisnojustificationforpreservingtherawdata,becausetheexperimentcanberepeatedinthoserareinstancesinwhichanunanticipatedfutureinterestappears.

Page 57: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page15

Whenconsideringlaboratorydataofthiskind,itisusuallybesttorecognizethatnooneknowsasmuchabouttheoriginaldataastheoriginalexperimenter.Iftheexperimenterdoesnotfindtherawdataworthpreserving(andworthdocumenting),thenthedataareprobablynotgoingtobeofusetoanyoneelse.Becausethenumberofstagesofprocessing(e.g.,replication,averaging,coordinatetransformations,applyingcorrections,andsoon)differforeverytypeofmeasurementandundergocontinualevolutionasnewtechniquesareintroduced,itwouldbefruitlesstotrytoformulategenericretentioncriteriaforalltypesoflaboratorydata.

However,therearecertainclassesoflaboratorydata(where''laboratory"isusedinabroadsense)thatshouldbecandidatesforpreservationifproperlydocumented,becauseitwouldbeimpossibleorimpracticaltoreproducethemeasurements.Someofthedatatakeninlargeplasmaphysicsfacilitiesfallinthiscategory,becausereproductionofthefacilitieswouldbeextremelycostly.Amorestrikingexampleisthespectroscopicandothermeasurementsfromnucleartestsintheatmosphere,whichitishopedwillneverbereproduced.Onamoremundanelevel,propertiesofengineeringmaterials,measuredasapartoflargegovernmentresearchanddevelopmentprograms,providemanydataofpossibleinterestinthefuture.Suchdataareacquiredasasmallstepinalargerprogramandusuallyarenotpublishedinthescientificliteratureordisseminatedbytheusualchannels.Theywouldbecostlytoreproducebecausemanyofthematerialswerespeciallypreparedwithuniquefabricationtechnology.ExamplesincludepolymerandsensordatafromtheStrategicDefenseInitiative,engineeringdatafromtheNationalAeronauticsandSpaceAdministration(NASA),andthesuperconductingmaterialsmeasurementscarriedouttodevelopmagnetfabricationtechniquesforthecanceledSuperconductingSuperCollider.Eventhoughthisprojectwillnotbecompleted,the

Page 58: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

materialsmeasurementsshouldbesaved,becausetheymaywellbeapplicabletofutureengineeringprojects.

EvaluatedCompilations

Compilationsresultingfromthecriticalanalysisofalargebodyofdatafromthescientificliteratureareaseparateareaforconsideration.Well-knownexamplesincludethermodynamicpropertycompilationssuchastheNationalInstituteofStandardsandTechnology'sJointArmy-Navy-AirForce(JANAF)tablesandthethermophysicalpropertiesdisseminatedbytheDepartmentofDefense'sCenterforInformationandDataAnalysisandSynthesisatPurdueUniversity(seethePhysics,Chemistry,andMaterialsSciencesDataPanelreportintheNRC(1995)reportforadetaileddiscussionoftheseexamples).TheDepartmentofEnergyoperatesseveraldataevaluationcentersinnuclearphysicsandchemistry.Insuchcenters,thedataandbackupdocumentationarenotimpossibletoreplace;theysimplyrepresentsomucheffortandexerciseofspecializedscientificjudgmentthatitwouldbeextremelycostlytoredothework.Thecostofnothavingthedataavailable,althoughusuallydifficulttomeasureotherthananecdotally,canbemuchhigherthanthecostofpreservingthem.Inparticular,ifitbecomesnecessaryinthefuturetoexpandorextendthecompilation,thefulldocumentation(e.g.,dataextractedfromreferences,fittingprograms,notesontheanalysistechniques,andthelike)willprovideavaluablebaseforthenewwork.Amajorconcerninconsideringthesedatacollectionsishowthedataandtheunderlyingdocumentationcanbepreservedandmadeaccessibleifthecentersproducingthemlosetheirfundingorexpertpersonnel.Thisconcernincreasesasgovernmentagenciesdownsizetheiractivities.

ObservationalDataInThePhysicalSciences

Overthepasttwodecades,theNationalResearchCouncilandothergroupshaveissuednumerousreportsthathaveaddresseddatamanagementissues,includinglong-termretentionrequirements,for

Page 59: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

digitalobservationaldataintheearthandspacesciences(NRC,1982,1984,1986a,b,1988a,b,1990,1992b,1993;GAO,1990a,b;Haasetal.,1985;NAPA,1991).Mostofthesereportshavefocusedquitenarrowlyonthedatamanagementorarchivingproblemsofspecificdisciplinesoragencies,and

Page 60: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page16

nonehasaddressedcomprehensivelytheissuesassociatedwiththelong-termretentionofobservationalandexperimentaldatainthephysicalsciences.

MajorCharacteristicsofObservationalData

Observationaldatasets,likelaboratorydata,includedigitalinformation(inbothwrittenandelectronicform),graphicalrecords,andverbaldescriptions.Therecordsexistasinkonpaper,punchedpaper,film(includingmicroforms),magnetictapeofmanytypes(includingvideotape),magneticdisk,anddigitalopticalmedia(includingCD-ROM).Overthepastthreedecades,however,thedominantformofdatacollectionandstoragehasbeenelectronic.

Observationaldatacanbecharacterizedbythecollectionandmanagementpracticesappliedthroughoutthelifecycleoftheirexistence.Onemightcharacterizetwomajorpracticesdrivenbythefundingmodelsforconductingtheunderlyingscience.The"bigscience"fundingmodelcreatesafundingumbrellaformultipleindividualsandinstitutionstoconductcoordinateddataacquisition,investigation,andpublication.Often,theselargeprogramsadoptastandardapproachforlife-cycledatamanagement.However,thereisusuallylittlestandardizationamongthebigscienceprograms.ExamplesofsuchprogramsincludetheWorldOceanCirculationExperiment,theWorldClimateResearchProgram,andNASA'sMissiontoPlanetEarth(CENR,1994).Theotherfundingmodel,"smallscience,"fundsindividualsorsmallgroupsofindividualstoconductindependentdataacquisition,analysis,andpublication.Typically,theseinvestigatorsplan,design,andimplementtheirowndatamanagementstrategywithlittleinteractionwiththerestofthescientificcommunity.Thedatageneratedunderbothmodelshavelong-termvalue,bothforscienceandforthebroaderinterestsofthenation.

Page 61: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Specificsubdisciplinesalsoimposedifferentrequirementsonlong-termdatamanagement.Forinstance,whilethereisgeneralagreementwithinthephysicaloceanographycommunityonthedefinitionofstandardobservationvariablesandtheprocessesofmeasuringthosevariables,thesamecannotbesaidforbiologicaloceanography.Becauseofdifferencesinmeasuringtechniques,lackofcommunityagreementonnamingstandards,andthescientificprocessbywhichbiologyprogresses,datamanagementforbiologicaldatasetsisinherentlymorecomplexthaninphysicaloceanography.Thedatafromthesetwosubdisciplineswillhavetoaccommodatemultiplenamingschemesandalternatetaxonomies.Therefore,datamanagersandarchivistshavetodealwithdifferingapproachesandvocabulariesamongdisciplines,evolutionofdisciplineresearchparadigmsovertime,anddivergingconceptsandmethodswithinadiscipline.

Scientificresearchleadstothecreationofdatathatcanbeprocessedandinterpretedatdifferentlevelsofcomplexity.Typically,eachlevelofprocessingaddsvaluetotheoriginal(level-0)databysummarizingtheoriginalproduct,synthesizinganewproduct,orprovidinganinterpretationoftheoriginaldata.Theprocessingofdataleadstoaninherentparadoxthatmaynotbereadilyapparent.Theoriginalunprocessed,orminimallyprocessed,dataareusuallythemostdifficulttounderstandorusebyanyoneotherthantheexpertprimaryuser.Witheverysuccessivelevelofprocessing,thedatatendtobecomemoreunderstandableandoftenbetterdocumentedforthenonexpertuser.Onemightthereforeassumethatitisthemosthighlyprocesseddataproductsthathavethegreatestvalueforlong-termpreservation,becausetheyaremoreeasilyunderstoodbyabroaderspectrumofpotentialusers.Infact,justtheoppositeisusuallythecaseforobservationaldata,foritisonlywiththeoriginalunprocesseddatathatitwillbepossibletorecreateallotherlevelsofprocesseddataanddataproducts.Todoso,however,requirespreservationofthenecessaryinformationaboutprocessingstepsandancillarydata.

Page 62: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Anotherimportantcharacteristicofobservationaldataistheirvolume.Inthisrespect,observationaldatacanbedividedintotwodifferentclasses:small-volumeandlarge-volumedatasets.Themajorityoftraditionalground-based,insituobservationsformsmall-volumedatasetsbecausetheyarebasedonindividuallyconductedmeasurementsorsamplecollections.Satelliteandotherremotelysensedobservationsgenerallyformlarge-volumedatasets.

Page 63: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page17

Thecommitteedefinessmall-volumedatasetsasthosewithvolumesthataresmallinrelationtothecapacityoflow-cost,widelyavailablestoragemediaandrelatedhardware.ThehardwareandsoftwaretowriteandproduceCD-ROMsarenowgenerallyavailableforlessthan$10,000,andpersonalcomputerscapableofreadingCD-ROMsarebeingmarketedashome-use,consumeritems.Forexample,thetotalvolumeofthesmall-volumeoceanographicdataisprojectedtobelessthan50gigabytesby1995,andthustheentirehistoricaldatasetforallobservationscouldbestoredonfewerthan100CD-ROMs.Thisisfewerdiskettesthanmanypeoplehaveintheircompactdiskmusiccollections.

Issuessuchasarchivingcost,longevityofmedia,andmaintenanceofthedataholdingsarenotthedominantconsiderationswithregardtoretainingsmall-volumedatasets.Rather,themajorissuewithrespecttothisclassofdataisthecompletenessofthedescriptiveinformation,ormetadata.Ifadatasethasbeenproperlypreparedanddocumented,theoperationsrequiredtomigratethedatashouldbeamenabletosignificantautomationandthereforeposeonlyaminorchallengetothelong-termmaintenanceofthearchive.Further,thesedatamaybewidelydistributedwithsimplereplicationofthemedia.Forexample,thevariousNOAAandNASAdatacentershaveprovidedcopiesoftheirdatasetstomanyusersforanumberofyears.

Adifferentproblemisposedbylarge-volumedatasets.ThebiggestdatasetstypicallycomefromEarthobservationsatellitesensorsandspacesciencemissions,andarechallengingtosomecontemporarystoragedevices.However,itisclearthatforthedatasettoexistatall,anadequatestoragemediumcapableofcapturingandmaintainingthedataforsometimeperiodmustexistwhenthedataaregenerated.Further,thetimeperiodforreliable,initialstorageshouldatleastcoverthelifetimeofthedatasetattheorganizationacquiringandusingthedatabeforetherecordsneedtobemigratedtonewmediaor

Page 64: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

transferredtoanotherorganization,suchasNOAAorNARA.Inaddition,duringtheinitialstorageperiod,therearelikelytobemajorincreasesinthedensityofmassstorageaccompaniedbysignificantdecreasesinthecostofstorageofthedata.Thus,datasetsthatarechallengingtodaywillgraduallybetransformedto"small-volume"statusinthefuture,asadvancingtechnologyincreasesthecapacityandlowersthecostofstoragedevices.Nevertheless,itisimportanttonotethatthelargestdatasets(e.g.,largerthatoneterabyte)canpresentsignificantorganizationalandmanagementproblemsthatrequirespecialanalysisofthedataflow,volume,access,andtimingcharacteristics.

ObservationalDataintheSpaceandEarthSciences

AstronomyandAstrophysicsData

Astronomyandastrophysicsareobservationalsciences;thatis,theyarebasedonwhattheskyprovidesandwecollect.Therefore,inmanyastronomicalinvestigationsthereisnosuchthingas"repeatinganexperiment"withtheexpectationofgettingthesameresults.Manyobjectshavepropertiesthatchangewithtimeeitherbecauseoftheirintrinsicnature(e.g.,variablestars),evolution(e.g.,starsgoingsupernova),orreasonsyetunknown.Ithappensquitefrequentlythatahighlyvariableobjectisfoundinsatellitedataandsubsequentarchivalresearchinopticalplatesallowsitsidentificationasagiventypeofstar.

Astronomyandastrophysicsdataareacquiredbybothground-basedandspace-basedobservatories.Ground-basedobservatories,whichareoperatedbyuniversitiesorothernonprofitorganizations(e.g.,AssociationofUniversitiesforResearchinAstronomy,theSmithsonianInstitution)andfundedbytheseorganizationsorbytheNationalScienceFoundation(NSF),havetraditionallybeenusedtostudytheskyatvisiblewavelengths.SincethesecondWorldWar,astronomershaveusedimprovingtechnologiestoobserveatradioand

Page 65: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

infraredwavelengths.Consortiaofuniversities,includingbothU.S.andforeigninstitutions,areconstructingnewtelescopes,whichuseadvancedtechnologytobuildlargermirrorsthatwillallowustolookdeeperintotheuniverse.Radioobservatoriesrangefromsmalleronesoperatedbyuniversitiestolargernationalfacilities,suchastheNationalRadioAstronomyObservatory,fundedby

Page 66: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page18

NSF.Mosttelescopesareforindividualobservingprograms,butsomearededicatedtosystematicskysurveys.

Datafromgroundobservationshavetraditionallybeenthepropertyoftheobserver;therefore,observatorieshavenostandardpoliciesfordataarchiving.Theexceptionsaresomebigprojects,suchasthePalomarSkySurvey,wheredataeitheraremadepublicandsoldorarearchivedwithintheuniversityorobservatory.Somecenters,suchastheNationalRadioAstronomyObservatory,theNationalOpticalAstronomyObservatories,andtheHarvard-SmithsonianCenterforAstrophysics,havebeguntoarchivemostdataobtainedfrommajortelescopes.Thesedataarevaluedandusedbroadlybyastronomers.Nevertheless,archivalactivitiesremainofgenerallylowpriority.

Althoughtheolderastronomicaldataconsistofphotographicplatesandotheranalogdata,virtuallyalldatatodayarecollecteddigitally.Therealsohavebeenmajoreffortstodigitizeoldphotographicdatatoallowtheiranalysisbycomputer.Anexampleofthisisthedigitizationofawhole-skysurveybytheSpaceTelescopeScienceInstitute,andthissurveyisnowavailableforsaleonCD-ROMfromtheAstronomicalSocietyofthePacific.Recently,theastronomicalcommunityadoptedastandardformatfortransfersofdigitalfiles(FITS).Withtheadventofdigitaldata,therealsohasbeenanevolutionfromindividualdataanalysispackagestoafewwidelydistributedpackages(e.g.,IRAF,AIPS,VISTA,XANADU),whichprovidestandardtoolsforbaselineanalysis.

BecauseofthefilteringanddistortionproducedbytheEarth'satmosphere,theamountofenergyemittedbycelestialbodiesthatcanbedetectedonthegroundislimitedsignificantly.Observationsfromspaceabovetheatmosphereremovesuchlimitations.Fromitsinception,spaceastronomyandastrophysicshavebeenmostlyunderNASA'spurview,althoughsomeimportantexperimentshavebeen

Page 67: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

financedbytheDepartmentofDefense.Thedataarecollectedthroughtelescopesanddetectorsplacedonairbornedevices(balloonsorplanes),rockets,NASA'sSpaceShuttle,andorbitingsatellites.Thelargestvolumeofdataiscollectedbysatellites,andmostofthesemissionsareinternationalcollaborations.TheU.S.portionhasalwaysbeenhandledbyNASA.

WithinNASA,spaceastronomyandastrophysicsareorganizedindifferentwavelength-baseddisciplines,reflectingtheorganizationinthescientificcommunity.Thesedisciplinesincludetheinfrared,whosemaindatacenteristheInfraredProcessingandAnalysisCenterinPasadena,California,wherethedatafromtheInfraredAstronomySatellitemissionarearchived;theopticalandultraviolet,withdatacentersattheSpaceTelescopeScienceInstituteinBaltimore,Maryland,wheretheHubbleSpaceTelescopedataarearchived,andattheNASAGoddardSpaceFlightCenterinGreenbelt,Maryland,wheretheInternationalUltravioletExplorerarchiveresides;andhigh-energyastrophysics,whichmaintainsx-raydataattheEinsteinObservatoryDataCenterinCambridge,Massachusetts.

Table2.1providesarepresentativesampleofNASAAstrophysicsArchives.TheearlierNASAastrophysicsprojectswereso-called"principalinvestigator"missions,whereacontractwasawardedtoagroupofprincipalinvestigators,whobuiltthehardware,receivedthedatafromtheexperiments,andanalyzedandinterpretedthem.Theseprincipalinvestigatorshadnoclearlystatedguidelinestopreparedataforarchiving,otherthantodeliverthereduceddatatotheNASAdatadepositoryattheNationalSpaceScienceDataCenter(NSSDC)attheNASAGoddardSpaceFlightCenter.Documentationgenerallywasminimal,andthedata,whichoftenwerenotwell-documentedorwell-organized,weredifficulttoretrieveforscientificuse,eveniftheywereadequatelyphysicallypreserved.

Ithasbecomefullyapparent,however,thattheuniquenessandhigh

Page 68: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

acquisitioncostofthesespacedatamaketheireffectivepreservationandarchivingahighpriority.Evenaftertheactiveoperationofaspaceobservatoryhasended,thedatatypicallyareretrievedandusedbyscientistsformanymoreyears.Asaresult,thesituationhasimprovedconsiderablyattheNSSDCinrecentyears.Moreover,NASAnowfundswavelength-specificscientificdatacenterstoprocessthedata,eliminateanomaliesinthedata,andprovidesoftwareforscientificanalysis.

Page 69: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

TABLE2.1ARepresentativeSampleofNASAAstrophysicsArchives,bySatelliteMission

HighEnergyAstrophysicalObservatory2

InternationalUltravioletExplorer

InfraredAstronomicalSatellite

HubbleSpaceTelescope

Datatype X-raydata Ultravioletdata Infrareddata Optical/Ultravioletdata

Yearoflaunch

1978 1978 1983 1990

Duration 2.5years Ongoing 300days Ongoing

Totaldatavolume(gigabytes)

~100 ~100 ~150 ~5500byyear2005

Datacenter EinsteinObservatoryDataCenter,Cambridge,Massachusetts

NationalSpaceScienceDataCenter,Greenbelt,Maryland

InfraredProcessingandAnalysisCenter,Pasadena,California

SpaceTelescopeScienceInstitute,Baltimore,Maryland

Page 70: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page20

PlanetaryScienceData

Planetarydataalsoareacquiredbybothground-basedandspace-basedobservations.Planetarydataincludeobservationsoftheentirephysicalsystemandforcesaffectingaplanetorotherbody,includingthegeologyandgeophysics,atmosphere,rings,andfields.Thesensorsusedcollectdataacrossmuchoftheelectromagneticspectrum.Currently,mostplanetaryobservationsaresupportedbyNASA,eitherasthedirectresultofplanetarymissionsorasground-basedobservationsthatsupportamission.Overthepastthreedecades,NASAhassentroboticspacecrafttoeveryplanetinthesolarsystemexceptPluto,totwoasteroids,andtoacomet.MenhavewalkedontheMoon,performedexperimentsthere,andreturnedsamples.Theknowledgewehaveaboutthebodiesinthesolarsystem,withtheexceptionofourownplanet,comesmostlyfromspacemissions.Insomecases,suchasthegasgiantsJupiter,Saturn,Uranus,andNeptune,roboticspaceprobeshaveprovidedmostofourcurrentknowledge.Manyofthesatellitesoftheotherplanetswerenomorethanpointsoflightwithminimalspectralandlight-curvemeasurementsbeforetheVoyagermission.Noweachisrecognizedasaseparateworldwithhighlyindividualcharacteristics.

Thescientificandhistoricalimportanceofspace-basedplanetaryobservations,therealizationthatadditionalmissionscannotreplicatetheoriginalobservations,andtheexpenseofplanetarymissionsallpromptedNASAtocreatethePlanetaryDataSystem(PDS)toimprovetheacquisition,archiving,anddistributionofplanetarydata.ThedevelopersandcurrentstaffofthePDSrecognizethatthedatafromplanetarymissionsmakeupthescientificcapitaloftheagency'splanetaryexplorationprogramandthatthesedataareanationalresource.ThePDStriestoacquireallexistingplanetarydatafromNASA'smissionsandevenfrominternationalventures,inordertohaveacompletearchiveofourexplorationofthesolarsystem.In

Page 71: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

additiontothespace-basedmeasurements,thePDSacceptsrelevantground-basedobservationsandlaboratorymeasurementsthatsupportplanetarymissionsbyprovidingbaselineorcalibrationdata.Abasicconditionforacceptanceisthatthedatasetmustbeproperlydocumentedandincludeallrelevantancillarydata,includingplanetandspacecraftephemerides,calibrationtables,andexperimenternotesabouttheshortcomingsofthedata.MembersofthePDSscientificstaffandscientistsinthecommunitywhohaveexpertisewithintherelevantdisciplinespeer-revieweachdataset.

OneofthemoreimportantcontributionsofthePDS,especiallywithregardtotheongoingpreservationofdatainausefulform,istheelectronic"publication"ofthemajorityofthedatafrommanyplanetarymissionsintheformofCD-ROMs.Theseincludenotonlythedata,butalsodocumentation,formatspecifications,ancillarydata,andeven,insomecases,displayandanalysistools.

SpacePhysicsData

Spacephysicsinvolvesthestudyofthelargeststructuresinthesolarsystemtheplasmaenvironmentsoftheplanetsandotherbodiesandthesolarwind.Thoseenvironmentsconsistofplasmasrangingfromlowenergies(thethermalcomponent)tochargedparticlesofhighenergies,includingcosmicraysacceleratedbygalacticprocesses.Theyalsoconsistofthemagneticfields(iftheyexist)ofplanetsortheSun,aswellaselectrostaticandelectromagneticfieldsgeneratedfromnaturalinstabilitiesinplasmasandcharged-particlepopulations.Furthermore,inmanylocales,suchascometsandtheEarth'sionosphere,dustandneutralgasesplayanimportantroleinmediatingthebehaviorofplasmasandelectromagneticfields.Asaconsequence,thefieldofspacephysicsrequiresabroadarrayofsensorsandinstrumentsatalllevelsofcomplexity.

Manyinstrumentsmakeinsituobservations,butnoveltechniquesenableremotesensingofvariousplasmaregimes.Becausesomeof

Page 72: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

themostapparentmanifestationsofspacephysicsprocessesresultinthenorthernlightsandinplanetary-scalemodificationsoftheterrestrialmagneticfield(andsubsequentcatastrophiceffectsonpowergridsandcommunications),spacephysicsreliesheavilyonawidearrayofground-basedobservations,includingmagnetometers,ionosphericsounders,incoherentradarfacilities,

Page 73: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page21

all-skycameras,andphotometers.Inaddition,abroadrangeofground-basedandspace-basedsolarmonitorshasbecomecrucialtostudythecorrelationsbetweenvariousdisruptionsintheterrestrialplasmaenvironmentandsolaractivity,includingsunspots,flares,andprominences.

Formanyreasons,itisessentialtopreservespacephysicsdataforlongperiodsoftime.TheSundrivessolar-terrestrialrelationships,andmanystudiesrequireobservationsover22-yearsolarcycles.DuringthiscycletheSunreversesitsmagneticpolaritytwiceandgoesthroughperiodsofincreasedactivitywithsunspotsandassociatedflares.Atsolaractivityminimum,flareandsunspotactivitydecreases,butexpandedcoronalholesappear.Longintervalsofrecordsarerequiredbecauseeachsolarcycleisdifferentfrompreviousonesandbecausetherearelong-termdeviations,suchastheMaunderminimum,from"normal"patterns.Fromtheterrestrialpointofview,therearemotionsofthemagneticdipoleandevenmagneticfieldreversalsontimescalesofthousandsofyears.

Becausemanyspacephysicsobservationsaretakeninsitu,modelsofthemagnetosphereneeddatacollectedbymanyspacecraft,havingdifferentkindsoforbitsandtrajectories.Tomakesenseoutofdatafromoneofthesemissions,itisimportanttobeabletoexaminewhatanotherspacecraftinadifferentorbitfound.Onlybypreservingthedatafromnumerousmissionsdoweacquireasufficientarchive.

Spacephysicshasgeneratedabout50gigabytesofdataperyearoverthelast30years.ThefieldhasenjoyedthisextraordinaryproductivityprimarilybecausemostmissionswereinEarthorbitandweretrackedcontinuouslyforyears.Manyofthesedatasetswere"archived"bysendingthetapesandsometimestherelevantdocumentationtotheNSSDC.Copiesofthedataonmicrofilmoronothermediaweresentthereaswell.Unfortunately,foreverywell-prepared,thoroughly

Page 74: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

documentedspacephysicsdatasetattheNSSDC,thereareseveralpoorlypreparedandimproperlydocumenteddatasets.Fortheearliestspacemissions,thearchivingtechniqueswereundeveloped,andarchivingwasnotdeemedahighpriority.Thus,therearemanydataattheNSSDCthatmostscientistswouldfinddifficulttousewithonlytheinformationoriginallysupplied.GiventherecentemphasisontheproperpreservationofdataandtheimportanceofarchivingpromptedinpartbytwoGeneralAccountingOfficereports(1990a,b)andalsobyaheightenedawarenessanddesireforhigh-qualityarchivesbythecommunitymanyrecentlyarchiveddatasetsareinbetterconditionthantheirpredecessors.EventhoughtheSpacePhysicsDataSystemhasbeeninexistenceonlysince1993,themoreadvanceddataactivitiesinotherdisciplineshaveinfluencedthespacephysicscommunityfavorably.Hence,itisbecomingmorelikelythatthedatanowbeingsubmittedareofahigherquality,havemoreadequatedocumentation,andaremorecompletethanearlierdatasets.

NOAA,NSF,theDepartmentofDefense,privateandeducationalinstitutions,andforeignorganizationstypicallysupporttheground-basedobservations.Mostofthesedata,notmanagedbyNASA,eventuallycomeunderthepurviewoftheNationalGeophysicalDataCenter,operatedbyNOAAatBoulder,Colorado.Thecenter'sholdingsconsistofover300digitalandanalogdatabases,someofwhichareverylarge.However,manyimportantdatasetsstillresidesolelyinthehandsoftheoriginalinvestigators,themilitary,orforeignsources.

AtmosphericScienceData

Atmosphericsciencedatasetsarediverseandpresentavarietyofproblemsfordistribution,archiving,andlaterinterpretation.Somedatasetsontheatmospherestandoutasthelargestinanyscientificdiscipline,particularlythosefromremotesensingbysatelliteorradar;othersconsistofcontributionsfromthousandsofindividualsallover

Page 75: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

theworld,andtheprovenanceofthosedataissometimesuncertain.Manydatasetsspandecades,andafewspanmorethanacentury,withaccompanyingproblemsduetolackofhomogeneityinmeasurementtechniquesandsamplingstrategies.ThelargestatmosphericsciencedataholdingsintheUnitedStatesarethoseofthefederalgovernment.However,significantamountsofmaterialareavailableonlyfromstateorprivatesources.

Page 76: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page22

Notallatmosphericdatasetsarelargeandconspicuous;manyaresmall.Therearehundredsofdatasetsofonlyafewmegabytesorless.Therearealsomanymedium-sizeddatasetsthatrangefromperhaps100megabytestotensofgigabytes,aswellasverylargedatasets,manyterabytesinvolume.Table2.2providesasamplingofsomeofthelargerdatasets.Datavolumedoesnotdrivethecostofarchivingsmall-sizedandmedium-sizeddatasetsifpropertechnicalchoicesaremade.Rather,itisthelabor-intensiveprocessofreadyingadatasetforindefinitepreservationthatcanbecostly.

Manyatmosphericdatasetsaredynamic,continuallygrowingorbeingotherwisemodified.Becauseweatherkeepsoccurring,observationaltimeseriesfromoperationalmeteorologicalactivitiesarenever"complete."Incontrast,fieldprogramsusuallyhavefiniteextent,andtheresultingdatasetshaveadefiniteend.However,manyrecentlarge,complexfieldprogramshavespawnedassociatedmonitoringactivitiesthathavecontinuedaftertheinitialphasesoftheproject.Despitethefrequentusageoftheterm"experiment"todenotefieldprograms,theseintensiveeffortsareobservational,ratherthanexperimental,exercises.Sometrulyexperimentaldataexist,includingafewdatasetsthatincludetheresultsfromsuchworkassensordevelopmentandtests,fluiddynamicsexperiments,thermodynamicmeasurements,andlaboratorychemicalstudies.Nevertheless,thevastmajorityofatmosphericsciencedatadescribeobservationsofever-changingphenomena,andthustheyareunique,valuable,andirreplaceable.

Formuchmeteorologicalandclimateresearch,aswellasformanyapplications,itisessentialtohavearchivesofglobaldata.ThisgoalhasbeenlargelyachievedintheUnitedStates,althougholderdatasetsstillneedtobedigitized.Collectively,U.S.archiveshavethebestsetsofglobaldataofanynation,particularlyfordatasincetheearly1950s.However,manyvaluabledatastoredinothernationsare

Page 77: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

inaccessibletoU.S.scientists(andinsomecasesareinaccessibletothosenations'scientistsaswell).

Meteorologicalandotheratmosphericdataareusedforvaryingpurposesondifferenttimescales.Itisconvenienttodelineatethree:(1)real-timeorcurrent,(2)recentpastorshort-termretrospective,and(3)distantpastorretrospective.Comparedwithotherdisciplines,meteorologicaldataareprobablyusedbyawidersegmentoftheU.S.populationthanotherscientificdata,becausetheyrelatedirectlytopractical,dailyconcerns.Thereisalargelayaudienceforweatherandclimateinformation.

Thereal-timeorcurrentuseofmostdatasetsusuallymotivatesdecisionsoncollectionstrategiesandthereforequality.Forexample,theprimaryreasonforcollectingmostmeteorologicaldataisforoperationalweatherforecastingandwarning,includingforecastingforaviationoperations.Thesedataareperishable,andtimelinessandspatialresolutionaremoreimportantthanabsoluteaccuracyandcontinuity.

Therearemanyrecentpastorshort-termretrospectiveusesofmeteorologicaldatathatcanbeofgreatsignificance.Inthiscontext,shorttermtypicallymeansfromyesterdaytoafewweeks,oroccasionallyafewmonths,ago.Agoodexampleofsuchusageofdataisinmonitoringthedevelopmentofadrought,asignificantfunctionforpredictingcropyields.Thetransportationindustryusespastdataforverificationofweatherconditionsfordelayclaims.

Mostretrospectiveusesrequiredatafromseveralmonthsoldthroughthetraditional(thoughnowsuspect)30-yearaveragingperiodsusedforclimatenormals.TheNationalClimaticDataCenterhandlesover100,000datarequestsperyear.Thestateclimatologistsandregionalclimatecentersalsoprocessaboutthismany.Legalproceedingsandinsuranceclaimsoftenrequireaccuratemeteorologicalrecordsforcorroborationofwitnesstestimony,criminalinvestigations,and

Page 78: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

validationsofweatherclaimsrelatedtoaccidentsandpropertydamage.Farmersandagronomistsneeddatacoveringmonthstoyearsforstudiesofpesticideresidueandtoxicology,decisionsaboutpesticidespraying,planningoffertilizerusage,andcropselection.Architectsandbuildingengineersrequiresite-specificdataonheatingandcoolingneeds,windstresses,snowloads,andsolaravailability.Airportdesignersneedprevailingwindpatterns.Utilityplannersneedaggregateheatingandcoolingloadsfortheirareas.

Long-termretrospectiveusesofatmosphericdataaretheprimaryconcerninthisstudy.Theseusesarehighlydiverse,difficulttopredict,andmakegreatdemandsonthedataandtheirassociatedmetadata.

Page 79: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page23

TABLE2.2VolumeofSelectedDataSetsinAtmosphericSciences

TypeofDataSet Comments DatesYearsVolumeAtmosphericInSituObservations

Worldupperair Twotimesperday,1,000stations

1962-1993

32 25GB

Worldlandsurface Every3hours,7,500stations

1967-1993

27 60GB

Worldoceansurface Every3hours(40,000observationsperday)

1854-1993

139 15GB

WorldobservationsduringFirstGARPGlobalExperiment

Surfaceandaloft,butnotsatellite

1978-1979

1 10GB

U.S.surface Daily,now9,000stations 1900-1993

94 15GB

SelectedAnalyses(mostlyglobal)

MainNationalMeteorologicalCenteranalyses

Twotimesperday,increasingat4GB/year

1945-1993

48 50GB

NationalMeteorologicalCenteradvancedanalyses

Fourtimesperday,increasingat19GB/year

1990-1993

4 58GB

NationalCenterforAtmosphericResearch'soceanobservationsandanalyses

Thirty-eightdatasets 8GB

EuropeanCenterforMediumRangeWeatherForecastingadvancedanalyses

Fourtimesperday,increasingat8GB/year

1985-1993

9 76GB

SelectedSatellites

NOAAgeostationarysatellites Half-hour,visibleandinfrared

1978-1993

16 130TB

NOAApolarorbitingsatellites 1978-1993

15

Page 80: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Sounders(TIROSOperationalVerticalSounder)

15 720GB

AdvancedVeryHighResolutionRadiometer(4-kmcoverage,5channel)

15 5TB

NASAEarthObservingSatellite-AM

Indevelopment,88TB/year,level-1data

1998-

U.S.RadarData

Domainsof30to60km 1973-1991

19 1GB

NextGenerationRadarSystem(NEXRAD)a

650GBperradareachyear,104TB/yearfor160-sitesystem

1997- 100sTB

Notes:Manyotheratmosphericdatasetshavevolumesofonly1to500MB.

1MB(megabyte)=106bytes;1GB(gigabyte)=109bytes;1TB(terabyte)=1012bytes.

aFirstradarsweredeployedin1993.

Mostoftheusesdiscussedabovedonotneeddatacoveringmorethanafewdecades.Severaloftheseapplications,however,requirethelongesttimeserieswecanprovide.

Whentechnologyadvancesandaltersthemethodofdatacollection,thereisastrongimpetustoscrapthedatacollectedby"obsolete"technology.However,theseolddatamaybecomecriticalinthefuture.Anotableexampleinvolvesupperairwindprofiles.Thesewereoriginallycollectedbykitesandlaterbyradiosondescarriedonballoons.Withtheonsetofthespaceprogram,therewasanurgentneedfordetailedlow-altitudewinddataforanalysisofstressesonrocketsatlaunch.

Page 81: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Appropriatedatacouldnot

Page 82: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page24

beobtainedfromradiosondes,becauseoftheirhighascentrate,butolderkite-baseddata,whichhadbeenscheduledfordisposal,wereavailable.Fortunately,theyhadnotyetbeendestroyedwhentheywereagainneeded.

Therehavebeendramaticretrospectiveusesformilitarypurposes(e.g.,Jacobs,1947).PlanningfortheD-dayinvasionofFrance,bombingrunsoverJapan,andtherecentdesertwarinIraqallrequireddetailedclimaticinformation,somelongthoughtuselessbutnotyetdiscarded.Suchunexpectedusesrequiretheretentionofmanytypesofdatafrommanyplacesforalongtime.Sincethefirstflightsofmeteorologicalsatellitesin1959,wealreadyhavehadseveralexamplesofimportantretrospectiveusesofsatellitedatasets.Forinstance,acombinationofreprocessedNimbus-7satellitedataandolddatafromtheDobsonnetworkhelpedtoconfirmtherecurringseasonallossofstratosphericozoneovertheAntarcticintheearly1980s.

Ifmeteorologistsaretostudypastweatherevents,suchasseverehurricanes,damagingwinterstorms,oroutbreaksoftornadoes,theymusthaveattheirdisposalalldatafortheperiodsoftimeandgeographicalareasinvolved.Hurricanetrackrecordsspanningmorethanacenturyarestillregularlyusedforbothresearchandoperationalpurposes.

Anincreasinglysignificantuseofmeteorologicaldataisthemonitoringoftheclimateoftheplanet.Althoughbarelytwodecadesagothestudyofclimatewasnotaveryhighpriority,todayclimateresearchissuesareprominent;someofthenation'sleadingscientistsspecializeinclimatestudies,andpolicymakersseekinformationonlikelyclimaticconditionsofthefuture.Theimportanceofoldatmosphericdatahasbecomeclear,butthereanalysisoftheseolddatainthesearchfortrendshasoftenfoundtheminadequateandpoorly

Page 83: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

documented.Thegrowinginterestinglobalclimatechangeandthedifficultieswithhistoricaldatathatithelpeduncoverhavestronglymotivatedearthscientiststotakeaseriousinterestinthelong-termpreservationofatmosphericdata.Similarly,studiesoflong-termwaterandlandusagerequiretimeseriesofmanydecades,ormore.Suchdataneedsalsoapplytoplanningaquiferusageandstudiesondeforestationanddesertification.

Somehistoriansexamineconnectionsbetweenenvironmentalconditionsandhumanevents.Thetimescalesstudiedcanrangefromtheimmediate,suchastheinfluenceofweatheronbattles,totheverylongterm,suchastheriseordeclineofacivilizationaffectedbywateravailability.Workersinthisfieldoftensearchthroughtheoldestexistingdataandhaveevenprovidedmeteorologicalinformationtoatmosphericscientistsfromunconventionalsourcessuchasdiariesandagriculturalrecords.

Contemporaryarrangementsforthestorageandarchivingofatmosphericdataarediverse,complex,andpresentmanyproblems.Someofthesearrangementscouldbeimproved.Atmosphericdataareinmanylocations,andtheyhaveabroadrangeoflifecycles.Difficultproblemsariseinpreparingmetadata,packagingdataforextendedarchiving,motivatingresearcherstopreparetheirdataforusebyothers,andsimplydealingwiththelargesizeofsomeoftheatmosphericdatasets.Criteriaforidentifyingdatasetstosaveindefinitelyarenotnecessarilyobvious.Finally,anyproposedsolutionsmustbemadeinfullrecognitionoftheirimpactonbudgetsandotherresources.

GeoscienceData

Spatially,thedomaincoveredbythegeosciencesextendsfromtheEarth'scoretothesurfaceandintospace.Temporally,itcoversbroadtrendsfromtheremoteoriginsoftheEarthtopossiblefuturescenarios,butitalsoisconcernedwithrapidlyvarying,oftenshort-

Page 84: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

livedphenomena.Datainthegeosciencesfallintotwobroadcategories.Oneistheobservationanddescriptionofuniqueevents,suchasearthquakes,volcaniceruptions,andfloods.Inmostcases,suchdataneedtobearchivedforalongtimeperiod,regardlessoftheirquality.Theothercategoryconsistsofobservationsofquantitiescontinuousinspaceandtime,suchasgravityandtheEarth'smagnetismandstructure,seismicsampling,andgroundwaterdistribution.

Page 85: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page25

Thevolumeofgeosciencedataobtainedwithpublicfundinghasincreaseddramaticallyoverthepastfewdecades.Thisincreaseistheresultofseveralconvergingfactors,includingtheextremelyvariedtypesofobservationaldatacollectedbythescientificcommunity;thelargevolumesavailablethroughbettermeasurementtechniques,moresophisticatedinstrumentation,andadvancingcomputertechnology;andincreasingdemandfromnotonlythescientificcommunitybutalsothegeneralpublic,includingengineers,lawyers,andstatisticians.Nongovernmentalandcommercialinstitutionsalsoaremajorcollectorsandsourcesofpertinentdata.

TwoexamplestheLandsatdatabaseandthenation'sholdingsofseismicdataillustratemanyofthecharacteristicsandissuesinherentinthelong-termarchivingofgeosciencedata.OtherexamplesareprovidedintheworkingpaperoftheGeoscienceDataPanel(NRC,1995).

TheLandsatdatabaseconsistsofmultispectralimagesoftheEarth'ssurface,whichhavebeenaccumulatingsincethelaunchofLandsat1inJuly1972.Thearchiveincludesdigitaltapesofmultispectralimagedatainseveralformats,black-and-whitefilm,andfalse-colorcompositesofsynopticviewsoftheEarth'ssurface,allfrom700kminspace.ThisdatabasethusconstitutesanimportantrecordoftheevolvingcharacteristicsoftheEarth'slandsurface,includingthatoftheUnitedStates,itsterritories,andpossessions.Therecorddocumentsnotonlytheresultsofvariousfederalgovernmentpoliciesandprograms,butalsothoseofmanystateandlocalgovernmentsandprivateprogramsandactivities.Itfurtherprovidesdocumentationoftheimpactofvariouslarge-scaleepisodicevents,suchasfloods,storms,andvolcaniceruptions,andisofgreatvaluetobothcurrentandfuturepublicandprivateactivities.

Landsatdataarecurrentlyavailableineitherimageordigitalform

Page 86: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

fromtheEarthResourcesObservingSystem(EROS)DataCenterinSiouxFalls,SouthDakota.TheLandsatsatelliteswereoriginallyunderthecontrolofNASA.However,in1980theybecametheresponsibilityofNOAA.ThecurrentlyoperationalLandsat4and5spacecraftwereplacedundercontroloftheEOSATCompanyin1985.UnderEOSAT'scontrol,thedataarenotinthepublicdomain,aresignificantlymoreexpensive,andcarryproprietaryrestrictionsontheiruse.BeginningwiththelaunchofLandsat7,responsibilityfortheLandsatsystemwillpassbacktoNASA,whichwillbuildandlaunchthesatellitethelate1990s.NASAwilloperatethesystemsanddeliverthedatatotheEROSDataCenterfordistribution.Thedatawillonceagainbeinthepublicdomain,althoughtheEROSDataCenterstillplanstochargemorethanthemarginalcostofreproductioninfulfillinguserrequests.ItisnowwidelyrecognizedthattheshifttoprivatecontroloftheLandsatsystemsignificantlyreducedtheaccesstoanduseofthedata.

AsofJanuary1993theLandsatdatabasecontainedmorethan100,000tapesofvaryingdensityandformats,andover2,850,000framesofhardcopyimagery.DigitalLandsatdataareusuallydeliveredtousersasmagnetictapes.Othermedia,suchasCD-ROMsandstreamingtapes,alsomaysoonbeused.Datarequestsoccurmostfrequentlyinreferencetoaparticulargeographiclocation,commonlyexpressedaslatitudeandlongitude,foraparticulartimeoftheyear,andmeetingcertaincloudcoverlimitations.

Landsatdataareusedwidelyacrossthespectrumofgeoscienceapplicationsinbothcivilianandmilitaryoperationsandresearch.Theseincludesuchapplicationsastheimpactofhumanactivitiesontheenvironment,land-useplanningandresource-allocationdecisions,disasterassessment,measurementandassessmentofrenewableandnonrenewableresources,andmanyothers.TheyareusedalsobythegeneralpublicinanycontextwhereviewsoftheEarth'ssurfaceareneeded.Examplesincludesuchdiverseapplicationsasvisualaidsin

Page 87: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

elementaryandsecondaryeducation,backgroundforhighwaymaps,andillustrationsformagazinearticlesaboutvariousregionsoftheworld.

TheLandsatdatabaseisuniquebecausedatafromanygivenareamaybeavailableatsampledinstantsoveraperiodofmorethan20years,thusmakingpossibleforthefirsttimethestudyofslowlyvaryingphenomenaonEarth.Eventhoughdatafromtheearly1970smaynowhavealowfrequencyofuse,theirpotentialvalueremainshighandtheyrepresentasignificantarchivalrecord.

Page 88: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page26

IncontrasttotheLandsatdatabase,seismicdataarebroadlydistributedratherthanconcentratedinonedatacenterorsystem.Thisexamplefocusesprimarilyonseismicdatafromearthquakesandexplosions,bothnuclearandchemical.Somefederalagencies,notablytheU.S.GeologicalSurvey(USGS)andNOAA'sNationalGeophysicalDataCenter,collectandarchiveimportantseismicexplorationdata.Inaddition,theDepartmentofDefense(DOD),DepartmentofEnergy(DOE),U.S.NuclearRegulatoryCommission(USNRC),USGS,andNOAAhavebeenandcontinuetobeengagedinthecollectionandarchivingofearthquakeandexplosiondata.Theseagencyprogramsarecarriedoutindependentlyofoneanotherwiththeresultthateachagencyhasitsowndatamanagementandarchivingpoliciesandpractices.Consequently,thesedataholdingsaregreatlydistributedamongtheagenciesinfundamentallydifferentformsandformats.

Globalearthquakedatahavebeenacquiredsystematicallysincetheearly1960s,whentheU.S.CoastandGeodeticSurveyoftheDepartmentofCommercedeployedaglobalseismicnetworkofabout130stationscalledtheWorld-WideStandardizedSeismographicNetwork(WWSSN)andproducedanarchiveofphotographicfilm''chips"ofthe24-hour/dayrecordingsatallstations.Researchersandotherapplicationscouldobtaincopiesoftheseanalogdataatmodestcost.Thesuccessofthisprecursortotoday'sglobaldigitalnetworkcannotbeoverestimated,becausetheavailabilityofaglobaldatasetinstandardformatfromwell-calibratedinstrumentspermittedpreviouslyimpossiblestudiesofglobalseismicitypatterns,earthquakesourcemechanisms,andtheEarth'sstructure.ThesestudieshaveledtoavastlyimprovedunderstandingofthedynamicsoftheEarthasawhole,includingtectonicplatemovements,generationofnewoceanfloor,evolutionoftheEarth'scrust,andoccurrencesofdestructiveearthquakesandvolcaniceruptions.

Page 89: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

TheUSNRChasfundedtheoperationofregionalseismicnetworksovermuchoftheUnitedStates,somesincetheearly1970s,insupportofprogramsforthesitingandsafetyofnuclearpowerplants.USGSalsohasco-fundedorseparatelyfundedregionalnetworksforearthquakehazardassessmentsinseismogenicareasoftheUnitedStates.However,changesinthefundingprioritiesofUSGSandUSNRCinrecentyearshaveresultedintheinterruptionordiscontinuationofsomeofthesenetworks,particularlyintheeasternUnitedStates.Thishasadverselyaffecteddataflowandseismicresearch.Seismicdatahavebeenarchivedinabroadlydistributed,nonuniformmodebytheorganizationsmostlyuniversitiesthatcollectedthedatafromthevariousnetworks.Manyofthesedatahavelong-termvalueforcharacterizingindetailthetectonicactivityofseismogenicareasintheUnitedStates.

Inadditiontothefederalagencies,severalprivatesectororganizationsnowcollect,distribute,andarchiveseismicdatasetsoflong-termsignificance.TheIncorporatedResearchInstitutionsforSeismology(IRIS),anot-for-profitconsortiumofuniversitiesandprivateresearchorganizations,isengagedinamajordevelopmentofaglobaldigitalseismicnetworkofabout100continuouslyrecordingstations(theGlobalSeismicNetwork)incooperationwithUSGS.Theprojectalsoincludesaversatile,portabledigitalseismicarrayofupto1,000stationsthatcanbedeployedforvarioustimeintervalsforspecialseismologicalstudies.DatasetsfromtheglobalandportablearrayarebeingpermanentlyarchivedattheIRISDataManagementCenter(DMC)inSeattle,Washington.TheDMCalsoservesastheInternationalFederationofDigitalNetworks'centerforcontinuousdigitaldata,whichaddsobservationsfrommanyadditionalstationstothearchive.IRISfundingforthisactivitycomesprimarilyfromNSFandDOD.Finally,individualuniversities,suchastheCaliforniaInstituteofTechnology,theUniversityofCaliforniaatBerkeley,theUniversityofAlaska,theUniversityofWashington,Columbia

Page 90: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

University,MemphisStateUniversity,andSt.LouisUniversity,alsomaintainarchivesoftheseismicdatathattheycollect.

ThevolumeofdigitaldatacurrentlyheldandanticipatedtobeacquiredbytheIRISDMCissummarizedinTable2.3.Althoughsomedatasetshavebeencompletedbecausetheyareproject-orprogram-specific,mostofthecurrentoperationscontinuetoaddlargeamountsofnewdataandimplementnewtechnologyforrecording,storage,retrieval,anddistribution,therebycreatingadynamic,highlydistributedarchivewhoseholdingsandaccessprotocolschangewithtime.Forexample,theIRIS

Page 91: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

TABLE2.3SummaryofActualandProjectedDataVolumesArchivedintheIRISDataManagementCenter

NumberofInstruments

ProjectedDataVolumes(gigabytes/year)

1994 1995 1996 1997 1998 1999

GSN 100 1,159 2,359 3,959 6,003 8,047 10,091

FDSN 146 370 670 1,070 1,530 2,050 2,670

JSParrays 5 1,095 2,190 3,650 5,475 7,300 9,125

OSN 30 0 0 15 58 218 498

PASSCAL-BB 500 1,318 2,277 3,556 5,154 7,073 9,312

PASSCAL-RR 500 542 885 1,341 1,912 2,597 3,397

Regional-Trig 500 150 290 490 730 1,030 1,390

Total 1,781 4,634 8,671 14,081 20,862 28,315 36,483

Note:Abbreviationsareasfollows:

GSN GlobalSeismicNetwork(IRIS)

FDSN FederationofDigitalSeismicNetworks

JSP JointSeismicProgram(withtheformerSovietUnion)(IRIS)

OSN OceanSeismicNetwork

PASSCAL-BB

ProgramforArrayStudiesoftheContinentalLithosphereBroadband(IRIS)

PASSCAL-RR

ProgramforArrayStudiesoftheContinentalLithosphereRegionalRecordings(IRIS)

Regional-Trig

RegionalTriggeredRecordings

Page 92: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

aProjectednumbersbyyear2000.

Source:IRISDataManagementCenter,privatecommunication,1994.

DMCrecentlybeganprovidingbotharchivedandnear-real-timedataontheInternet,therebygreatlyfacilitatingrapidaccess.

SignificantvolumesofexploratoryseismicdataobtainedbygeophysicalcontractorsareheldbytheDepartmentofInterior.Thesedataareusedbythefederalgovernmentandbypetroleumcompaniesinpreparingforoilandgasexplorationactivities.Thereare,however,variousproprietaryrestrictionsonaccesstothesedatabyotherusers.

Insummary,thesourcesofseismicdataarediverse,thearchivingishighlydistributed,andthedataareinmanydifferentformatswithdifferentmetadatastructures.Moreover,datasetswithlong-termscientificandhistoricalvalueresideinbothfederalandnongovernmentalorganizations,althoughinmostofthelattercasesfederalfundshavepaidatleastinpartfortheiracquisition,archiving,anddistribution.

Theusersofseismicdataaremanyanddiverseaswell.Theyincludefederalandstategovernmentagencies,universities,andprivateindustry,particularlythepetroleumindustry.Thousandsofindividualsaredirectorindirectusersofseismicdata.Certainly,thepublicasawholeisanenduserofhistoricalseismicdataandinformation,includingthelocation,magnitude,anddamageassociatedwithearthquakesaroundtheworld.

Mostseismicdatasetshavebeenorarenowusedbothforoperationalpurposesandforresearch,althoughforoperationalactivitiesthedataareusedprimarilyimmediatelyfollowingtheircollection.Examplesoftheiruseforoperationalactivitiesincludetsunamiwarningandtherapiddeterminationofthemagnitude,location,andfaultmechanismofdestructiveearthquakesandtheiraftershocks,bothtoinformthepublicandtoassistinemergencyresponseandspecialmonitoring.Onalongertimescalethedataareusedforhazardreductionandseismicsafetyinseismogenicregions,includinglocalzoningdecisionsforfuturedevelopment,andsitingandsafetyofcriticalfacilitiessuchasnuclearpower

Page 93: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

plants.Dataareobtainedandusedforcontinuousglobalmonitoringofearthquakeactivityandofthresholdorcomprehensivetestbansonundergroundnuclearexplosions.Ofcourse,therealsoisabroadspectrumof

Page 94: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page28

researchthatuseshistoricalseismicdata,includingstudiesofthephysicsofearthquakeandexplosivesources,propagationeffectsonseismicsignals,imagingoftheEarth'sstructuresatallscales,seismicitypatterns,andearthquakepredictionorhazardestimation.Olderdataareimportantandarecommonlyusedformostofthesetypesofresearch.Forexample,establishingtherecurrencerateforlarger-magnitudeearthquakesrequiresdecadestocenturiesofobservations,eveninthemostseismicallyactiveareas.

Inconclusion,mostoftheseismicdatahavelong-termvalueforscientificresearch,disastermitigation,andvarioussocioeconomicuses.Thedataarearchivedinabroadlydistributedmanner.However,onlyafractionofthearchiveddataareunderthedirectcontroloffederalgovernmentagencies,anditappearsthatmanyofthesedatasetsarenotconsideredofficialfederalrecords.Exceptformostcommercialexploratoryseismicdata,federalfundshavepaidformuchoftheinstrumentation,stationoperationandmaintenance,collection,storage,anddistributionofseismicdata.Theseimportantseismicdatasetsshouldbekeptindefinitelyinaformaccessibletoboththescientificcommunityandotherusers.

OceanScienceData

Theoceansandatmosphereareturbulentfluids,constantlychangingovermanyspatialandtemporalscales.Thenumeroustypesofdatathatdescribetheoceansareoftenunrelatedtooneanother,andeventhosethatarerelatedfrequentlyhavenonlinearandpoorlyunderstoodinteractions.Forexample,temperaturedatafromaspecificpointandtimeintheNorthAtlanticcannotbeaccuratelypredictedfromdatacollectedinthesameplacetheyearbefore,oreventheweekbefore,orfromdatacollectedatthesametime1,000kilometersoreven100kilometersaway,orfromsalinitydatacollectedatthesameplaceandtime.Eachdatumcontributesuniqueinformationaslongasitis

Page 95: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

accurate,correspondstoadifferentphysicalquantity,isobtainedfromadifferenttimeandplace,andcannotbeaccuratelycomputedfromotherexistingdata.

Onesourceofoceanographicdataisthefieldprogram.Largeandsmallfieldprogramsconductedinsupportofspecificresearchprojectsaretheprimecontributorsofinsituandinvitroobservationaldatasetsforalltheoceandisciplines.Insitudatasetsarethosethatarederivedbyprocessingthemeasurementsfromsensorsimmerseddirectlyintotheoceanenvironment.Processingofinsitudataislargelyautomated,andsothedatasetsarerelativelydense.Invitrodatasetsareproducedbylaboratoryanalysesofsamplescollectedfromtheoceanenvironment.Theselaboratoryanalysescombinesophisticatedmeasurementequipmentwithlabor-andtime-intensiveprocedures.Therefore,invitrodataaretypicallysparse.Remotelysensedobservationsalsomaybeassociatedwithfieldprogramdatabysynchronizinginsitusamplingwiththeuseofremotesensingplatforms.

Theharshandremotenatureoftheworldoceanenvironmenthasinhibitedtheestablishmentofaroutinedatacollectionsystem.Althoughseveralremotesensingplatformsdoprovidedailymonitoringofoceansurfaceconditionsonaglobalbasis,continuousmeasurementofsubsurfaceconditionswithadequatetimeandspaceresolutionforeffectivemonitoringisnotareality.Thelackofcontinuousandcomprehensiveoceanographicdatamaycontributemosttotheinconsistentdatamanagementpracticesandlackofcommunity-widestandardsfordatareportingandexchangeintheoceandisciplines.Becauseoftheneedfordailyglobalprediction,suchstandardsandpracticesaremuchmorehighlydevelopedintheatmosphericcommunity.TheestablishmentoftheGlobalOceanObservationSystempresentsanopportunitytoengagetheoceancommunityintheidentificationandimplementationofappropriatestandards.

Page 96: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Likeotherobservationaldata,oceanographicdataextendbeyonddirectlyorremotelymeasuredobservationsoftheenvironment.Thedataproductsbasedontheanalyses,interpretations,andpresentationsofaggregatesofobservationsalsomustbeconsideredinthedesign,implementation,andmaintenanceofanydatamanagementandarchivingmechanism.Themoretraditionalproducts,suchasparametergridsandoutputfromoceanmodels,willsurelybesupplementedfrominnovativesources

Page 97: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page29

likelytoemergefromtheinteractivescientificcollaborationandvalue-addedservicesthatarebecomingincreasinglyavailablethroughelectronicnetworks.

TheprincipalfederalagencyoceandataholdingsareattheNOAANationalOceanographicDataCenter(NODC),theNASAPhysicalOceanographyDistributedActiveArchiveCenter(PO.DAAC)attheJetPropulsionLaboratory,andatseveralNavycenters,whichholdmostlyclassifieddatasets.Inaddition,significantamountsofdataareheldbytheuniversities.

LocatedinWashington,D.C.,theNODCarchivesphysical,chemical,andbiologicaloceanographicdatacollectedbyotherfederalagencies,includingdatacollectedbyprincipalinvestigatorsundergrantsfromtheNationalScienceFoundation;stateandlocalgovernmentagencies;universitiesandresearchinstitutions;andprivateindustry.ThecenteralsoobtainsforeigndatathroughbilateralexchangeswithothernationsandthroughthefacilitiesofWorldDataCenterAforOceanography,whichisoperatedbytheNODCundertheauspicesoftheNationalAcademyofSciences.TheNODCprovidesabroadrangeofoceanographicdataandinformationproductsandservicestothousandsofusersworldwide,andincreasingly,thesedataarebeingdistributedonCD-ROMsandontheInternet.Table2.4presentsasummaryoftheNODC'sdataholdings.

ThePO.DAACisamajorfederallysponsoredoceanographicdatacenter,whichisoperatedbytheCaliforniaInstituteofTechnology'sJetPropulsionLaboratoryinPasadena,California.AsoneelementoftheNASAEarthObservingSystemDataandInformationSystem,themissionofthePO.DAACistoarchiveanddistributedataonthephysicalstateoftheoceans.UnlikethedataattheNODC,mostofthedatasetsatthePO.DAACarederivedfromsatelliteobservations.Dataproductsincludesea-surfaceheight,surface-windvector,

Page 98: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

surface-windstressvector,surface-windspeed,integratedwatervapor,atmosphericliquidwater,sea-surfacetemperature,sea-iceextentandconcentration,heatflux,andinsitudatathatarerelatedtothesatellitedata.ThesatellitemissionsthathaveproducedthesedataincludetheNASAOceanTopographyExperiment(TOPEX/Poseidon,doneincooperationwithFrance),Geos-3,Nimbus-7,andSeasat;theNOAAPolar-OrbitingOperationalEnvironmentalSatelliteseries;andtheDOD'sGeosatandDefenseMeteorologicalSatelliteProgram.

SummaryOfMajorIssues

Theresultsofscientificresearcharedisseminatedinthiscountrythroughahybridsystemthatincludesprofessionalsocietyandothernot-for-profitpublishers,thecommercialsector,andthegovernment.Theformaljournalsarepublishedlargelybytheprofessionalsocietyandcommercialsectors,whilegovernmentagenciesmanagelessformalreports(grayliterature).Secondaryservices,suchasabstractingandindexing,provideaccesstothisliterature,increasinglybyelectronicmeans.Whiletherearestrainsinthissystembecauseofrisingcosts,increasingworkload,andissuesrelatedtotheprotectionofintellectualproperty,ithasservedU.S.sciencewellandhasbeenaninvaluablelinkintheprocessoftranslatingscientificadvancesintofurtheradvances,usefultechnology,andeconomicbenefits.

Thecurrentsystem,however,isnotwellsuitedtohandlethescientificelectronicdatabasesthatarethefocusofthisstudy.Thecostsofmaintainingthesedatabasesaretypicallytoogreattobecoveredbyuserfees;instead,thesedatabasesmustbeconsideredpartofthenationalscientificheritage.Somegovernmentagencieshaveacceptedresponsibilityformaintaininganddisseminatingdataresultingfromtheirownresearchanddevelopment.Insomecases,thissystemisworkingreasonablywell,butinothersthereareproblemsevenwithprovidingcurrentaccess.Archivingforthelongtermraisesquestionsinallcases,however.

Page 99: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Ageneralproblemcommontoallscientificdisciplinesisthelowpriorityattachedtodatamanagementandpreservation.Experienceindicatesthatnewexperimentstendtogetmuchmoreattentionthanthehandlingofdatafromoldones,eventhoughthepayofffromoptimalutilizationofexistingdatamaybegreater.Forinstance,accordingtofiguressuppliedbyNOAA,NOAA'sbudgetforitsNationalDataCentersinFY1980was$24.6million,andtheirtotaldatavolumewasapproximatelyoneterabyte.In

Page 100: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page30

TABLE2.4NationalOceanographicDataCenterDataHoldings(asofOctober1994)

Discipline Volume(megabytes)Physical/ChemicalDataMasterdatafiles

Buoydata(wind/waves) 9,679

Currents 4,290

Oceanstations 1,645

Salinity/temperature/depth 1,557

BTtemperatureprofiles 872

Sealevel 125

Marinechemistry/marinepollutants 89

Other 68

Subtotal 18,325Individualdatasets,forexample

Geosatdatasets 12,841

CoastWatchdata 60,000

LevitusOceanAtlas1994datasets 4,743

Other(estimated) 11,000

Subtotal 88,584

TotalPhysical/Chemical 106,909MarineBiologicalDataMasterdatafiles

Fish/shellfish 115

Benthicorganisms 69

Page 101: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Intertidal/subtidalorganisms 30

Plankton 32

Marinemammalsighting/census 21

Primaryproductivity 7

Subtotal 274Individualdatasets,forexample

Marinebirddatasets 52

Marinemammaldatasets 4

Marinepathologydatasets 4

Other(estimated) 200

Subtotal 260

TotalBiological 534

TotalDataHoldings 107,443

Source:NOAA,privatecommunication,1994.

Page 102: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page31

FY1994,thebudgetwasonly$22.0million(notadjustedforinflation),whilethevolumeoftheircombineddataholdingswasabout220terabytes!Duringthissameperiod,theoverallNOAAbudgetincreasedfrom$827.5millionto$1.86billion.

Withregardtolaboratorydata,governmentprogramshaveexistedsincethe1960stocompileresultsfromtheworldscientificliterature,tocheckthedatacarefully,andtopreparedatabasesofcriticallyevaluateddata.Forinstance,theNationalInstituteofStandardsandTechnologyoperatesitsStandardReferenceDataProgram,whichcoversabroadrangeofdatainphysics,chemistry,andmaterialsscience.TheDepartmentofEnergyalsosupportsanumberofdatacentersofthistype.Despitechronicunderfunding,theseprogramshaveproduceddatabasesoflastingvaluetothenation.Tociteoneexample,theMassSpectralDatabasemanagedbytheNationalInstituteofStandardsandTechnology,theNationalInstitutesofHealth,andtheEnvironmentalProtectionAgencycontainsspectraofover60,000compounds.Ithasbeeninstalledinmanythousandsofmassspectrometersthatarebeingusedformonitoringenvironmentalpollution,designingdrugs,characterizingnewmaterials,andmanyotherapplications.Thegovernmentinvestmentincreatingandmaintainingthisdatabasehasbeenrepaidmanytimesover.

Intheareaofobservationaldatabases,thesituationismixed.Federalagenciescollectlargeamountsofobservationaldata,whichinmanycasesarecontinuouslyaddedtotheavailablerecordofEarthandspaceprocesses.Thedatasetsresultingfromtheseactivitiessometimesarewell-documentedandmaintainedinreadilyaccessibleform;butinmanyothercases,theyareexceedinglydifficultorimpossibletoaccessoruse,andthusareeffectivelyunavailable.Ingeneral,theagenciesandotherorganizationsdoagoodjobofmakingdataandinformationavailabletothescientists(primaryusers)duringtheactivestagesofprojectsandforsometimeafterward.Examplesof

Page 103: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

notablesuccessesincludetheNASAPlanetaryDataSystem,wherethepremisehasbeenthatthedatahavelong-termvalueandmustbeaccessibleindefinitelyintothefuture,andtheNOAANationalDataCenters,wherethepolicyistomigratearchiveddatatonewmediaevery10years.

Technologicaladvanceshavekeptpacewiththelargegrowthindatavolumesinscientificdisciplinessuchthatthelong-termretentionofallornearlyallofthedatacollectedisfeasible.Indeed,inmostfieldstheentirecollectionofdatafromthepastisnotlargeincomparisonwiththecurrentandanticipateddatavolumesthatwillbecollectedduringonlyayearortwo.However,significantfractionsoftheolderdataaredifficultorinsomecasesimpossibletoaccess,becausetheyhavenotbeentransferredtonewstoragemedia.Thistransferoftenhasreceivedlowprioritybecausemanydatamanagementanddataretentionactivitiesarechronicallyunderfundedandjusthandlingthecurrentdataflowusesnearlyalloftheavailableresources.Thus,manyvaluabledatasetsarestoredonlow-densityroundtapesoronspecializedmagnetictapemediarequiringhardwarethatisnowobsoleteorinoperable.Forexample,alargevolumeoftheearlyLandsatcoverageoftheEarthresidesontapesthatcannotbereadbyanyexistinghardware.Recentdata-rescueeffortshavebeensuccessfulingettingolderdataintoaccessibleform,buttheseeffortsaretime-consumingandcostly.Thereasontheseeffortshavebeenundertaken,particularlyintheobservationalsciences,istherecognitionthatretrospectivedataarevitaltounderstandinglong-termchangesinnaturalphenomena.Giventheextraordinarilyrapidadvancesincomputingandstoragetechnologyinrecentyears,plannedperiodicmigrationofdatatonewmediawillbeincreasinglyimportantinallscientificdisciplinestoensurelong-termaccesstoourscientificdataresources.

Itisaxiomaticthatadatabasehaslimitedutilityunlesstheauxiliaryinformationrequiredtounderstandanduseitcorrectlythemetadatais

Page 104: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

includedintherecord.Anunambiguousdescriptionofthestorageformatisobviouslyessentialforinterpretationofanelectronicdatabase.Therequirementisevenmorestringenttosupportmeaningfulaccesstodataoverthelongterm,becausethehardware,software,andeventhelanguagebywhichformatsaredescribedwilllikelybedifferentdecadesandcenturiesfromnow.Thesameistrueregardingthescientificdetailsofthecontentofthedata.Auxiliaryinformationsuchasenvironmentalconditions(e.g.,temperatureandpressure),methodofcalibratingthe

Page 105: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page32

instruments,anddataanalysistechniquesmustbegiventobeabletofullyandcorrectlyusethedata.Providingthisinformationistimeconsumingandcostlyifdoneretrospectively,butmuchlesssoifitispreparedatthetimethedataarecollected.Documentationthatisinadequateforunderstandingandusingthedatagreatlydiminishesthevalueofthedata,particularlyforsecondaryandtertiaryusers.

Anothermajorprobleminhibitingaccesstodataisthelackofdirectoriesthatdescribewhatdatasetsexist,wheretheyarelocated,andhowuserscanaccessthem.This,too,isespeciallyaproblemforpotentialsecondaryandtertiaryusers.Inmanycasestheexistenceofthedataisunknownoutsidetheprimaryusergroups,andevenifknown,therefrequentlyisnotenoughinformationforapotentialusertoassesstheirrelevanceandusefulness.Thisrealizationhasresultedinaninteragencyeffort,ledbyNASA,tobuildaMasterDirectoryofGlobalChangeDataandInformation.ThisMasterDirectoryisintendedtoinformusersofwheredatasetsofpotentialinterestresideandhowtoaccessthem.Similardirectoriesareneededinotherscientificdisciplines,aswellasacrossalldisciplines.Thelackofadequatedirectoriesadverselyaffectstheexploitationofournationaldataresourcesandcommonlyleadstounnecessaryduplicationofeffort.

Asignificantfractionofthearchivedscientificdataisheldbythefederalagenciesthatcollectedthedataaspartoftheirmission.However,alargeamountofvaluablescientificdatagatheredwithfederalfundsisneverarchivedormadeaccessibletoanyoneotherthantheoriginalinvestigators,manyofwhomarenotgovernmentemployees.Inmanyinstances,theorganizationsandindividualsthatreceivegovernmentcontractsorgrantsforscientificinvestigationsareundernoobligationtoretainthedatacollected,ortoplacetheminapubliclyaccessiblearchiveattheconclusionoftheproject.Atbest,scientistsinthesamefieldmaybeabletoobtaindesireddatasetson

Page 106: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

anadhocbasisbycontactingtheoriginalinvestigatorsdirectly;secondaryandtertiaryuserstypicallyareunawareoftheexistenceofthedataandhavenomechanism(otherthanpersonalcontact)toaccessthedata.Thus,datasetsthatcommonlyaregatheredatgreatexpenseandeffortarenotbroadlyavailableandultimatelymaybelost,squanderingvaluablescientificresourcesandmuchofthepublicinvestmentspentinacquiringthem.Clearly,thereisagreatneedfortheagenciestogetmorereturnontheirinvestmentinsciencebythesimpleexpedientofmakingthedatacollectedundertheirauspicesaccessibletoothers.

Asseenfromthediscussioninearliersectionsandaddressedindetailintheindividualdisciplinepanelreports(NRC,1995),thereisalargeanddiversecollectionofscientificdataandinformationextantinfederalagenciesandnonfederalorganizations,includingstateandlocalagencies,universities,not-for-profitinstitutions,andtheprivatesector.Ataminimum,thosedatathatareacquiredwiththesupportoffederalfundingshouldberegardedaspartoftheNationalScientificInformationResource.

Finally,NARA'sholdingsofscientificandtechnicaldatainelectronicoranyotherformareverysmallincomparisontothedataholdingsoftheseotherorganizations.Moreover,NARA'sbudgetforitsCenterforElectronicRecords,whichhasformalresponsibilityforarchivingalltypesoffederalelectronicrecords,wasonly$2.5millioninFY1994,abudgetlowerthanthatofmanyoftheindividualagencydatacentersreviewedbythecommitteeinthisstudy.GivenNARA'scurrentandprojectedlevelofeffortforarchivingelectronicscientificdata,itisobviousthatNARAwillbeunabletotakecustodyofthevastmajorityofthescientificdatasetsthatrequirearchiving.Therefore,acoordinatedeffortinvolvingNARA,otherfederalagencies,certainnonfederalentities,andthescientificcommunityisneededtopreservethemostvaluabledataandensurethattheywillremainavailableinusableformindefinitely.Thechallengeistodevelopdata

Page 107: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

managementandarchivinginfrastructureandproceduresthatcanhandletherapidincreasesinthevolumesofscientificdata,andatthesametimemaintainolderarchiveddatainaneasilyaccessible,usableform.Animportantpartofthischallengeistopersuadepolicymakersthatscientificdataandinformationareindeedapreciousnationalresourcethatshouldbepreservedandusedbroadlytoadvancescienceandtobenefitsociety.

Page 108: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page33

3RetentionCriteriaandtheAppraisalProcessTheNationalArchivesandRecordsAdministrationappraisesandretainsrecordsonthebasisoftheirinformationalandevidentialvalue.Itisconcernedwithrecordsoflong-termvaluethoserecordsthatwillprobablyhavevaluelongaftertheyceasetohaveimmediate,orprimary,uses.Althoughscientificdatabasescanprovideevidenceoftheresearchconductedbyanagency,theirvalueisprimarilyinformational;itisbasedonthecontentoftherecordsratherthanontheirdescriptionofactivitiesbytheagencythatcollectedorcreatedthem.

Specialproblemsariseinappraisingscientificdatafortheirlong-termvalue,particularlybeyondthecommunityofresearchscientistsworkinginthespecificfieldtowhichthemeasurementsrefer.Scientificdataarevoluminous,constantlyincreasing,andoftendifficultforthoseinotherfieldstouseintheiroriginalformats.Thedatatypicallyareexpensivetocollect,providebaselinesforfutureobservations,enhanceunderstandingofotherdata,andareofimmenseimportanceforadvancingscientificknowledgeandforeducatingnewscientists.Thedataalsoareimportanttoanunderstandingoftheworldinwhichwelive;thedata(ortheconclusionsdrawnfromthem)maybeimportanttoeconomists,historians,statisticians,politicians,andthegeneralpublic.Atthesametime,itisdifficulttopredictthefullvalueofthedatatoresearchersandotherusersdecadesorcenturiesfromnow,althoughpastexperiencehasshownthatscientificdatacollectedmanyyearsagoprovideuniquecontributionstonewunderstandingofourphysicaluniverse.

Page 109: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

RetentionCriteria

Thecriteriathatfollowaretobeusedduringtheappraisalprocesstodetermineretentionofphysicalsciencedata.Theyshouldbeappliedbythoseresponsibleforstewardshiptoallphysicalsciencedata,whethercreatedbysmallindividualprojectsorinthecourseoflarge-scaleresearchprograms.Similarcriteriaandguidelinesmustbedevelopedfordatainotherdisciplines.ThisisatopicofprimaryconcernnotonlytoNARA,NOAA,andNASA,buttoallscientists,datamanagers,andarchivistswhoworkwithsuchrecords,andwasprovidedinthechargetothecommitteeasacentralissue.Althoughthecommitteefoundthatmanyretentioncriteriaapplytoboththeobservationalandthelaboratorysciences,significantdifferencesarenotedbelow.Themetadatarequirements,whichtendtobeeitherpoorlyunderstoodorignored,aregivenparticularemphasis.Additionaldetailsanddistinctionsarediscussedintheworkingpapersofthedisciplinepanels(NRC,1995).

Page 110: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page34

CriteriaCommontoBothObservationalandLaboratorySciencesUniquenessofdata.DootherauthenticatedcopiesofthedataunderconsiderationalreadyexistinanaccessiblerepositorythatmeetsNARAstandardsofpermanenceandsecurity?Ifso,aretheyadequatelybackedup?Iftheanswersareyes,thedatasetneednotnecessarilyberetained.Accessibilityadequacyofdocumentation.Thoughwemightwishthatalldatasetswereofhighqualityandaccompaniedbydetailedmetadata,thatisnotalwaysthecase.Ataminimum,themetadatashouldbesufficientforascientistworkinginthedisciplinetomakeuseofthedataset.Ifdocumentationislackingorissopoorthatadatasetisnotlikelytobeofvaluetosomeoneinterestedindataofthattype,orthedataaremorelikelytomisleadthantoinform,thatdatasetshouldhavealowpriorityforarchiving,orperhapsshouldnotbearchivedevenifresourcesareavailable.Nevertheless,thecommitteedoesnotbelievethatmanydatasetsshouldbepurgedbecausetheylacksufficientdocumentation.Thevastmajorityofdatasetsnowmeetminimumstandardsofdocumentation,whichmeansthataskilledusereitherisgivensufficientinformationorcanfigureitout.Adequacyofdocumentationisthusbutonecriteriontoconsiderintheappraisalofdataforlong-termretention.Metadatarequirementsarediscussedingreaterdetailbelow.Accessibilityavailabilityofhardware.Isthehardwareneededtoaccessthedataobsolescent,inoperable,orotherwiseunavailable?Ifso,thedataarenotusable.Decisionsonwhethertokeepsuchdatashouldbebasedonthefeasibilityofbuildingoracquiringthenecessaryhardware,theusabilityofthedataiftheywereaccessible,andthenatureofthedataset,ifknown.Toavoidthissituation,migrationofdatatocurrentstoragemediashouldbepartofthenormalroutinetomaintainthearchive.Costofreplacement.Couldthedatabereacquiredifafuturenationalneedforthedataweretoarise?Ifso,wouldreacquisitionofthedatabemorecostlythantheirpreservation?Fortheobservationalsciences,the

Page 111: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

answerisalmostalwaysthatthedatacannotbereacquired.Theexceptioniswithadatasetinadisciplineinwhichthechangesofnaturearesoslowthatthedatacouldberecapturedatanothertime.Forexample,dataonthefossilrecordofevolutioncontainedinstratigraphicrockunitscouldbereacquired.Thelaboratorysciencesgeneratedatathatcan,inprinciple,bereacquired.Thequestioniswhetherthedatacanbereproducedatanacceptablecost.Datasetsinthelaboratorysciencesthatarecandidatesforlong-termpreservationcanbeclassifiedintothreegenerictypes:(1)massiverecordsanddatafromanoriginalexperiment,particularlyacostly"mega-experiment,"thatthereisnorealisticchanceofreplicating(e.g.,dataobtainedfromexpensivefacilitiessuchasplasmafusiondevices,ordataofinterestinphysicsandchemistryderivedfromspecialeventssuchasnucleartests);(2)unique,perhapssample-dependentorenvironment-dependent,engineeringdata,manyofwhichneverreachthepublishedliterature;and(3)criticallyevaluatedcompilationsofdatafromalargenumberoforiginalsources,togetherwiththebackupdataanddocumentationonselectionofrecommendedvalues,thatrepresenttremendousaccumulatedeffort.Peerreview.Hasthedatasetundergoneaformalpeerreviewtocertifyitsintegrityandcompleteness,oristheredocumentedevidenceofuseofthedatasetinpublicationsinpeer-reviewedjournals?Haveexpertusersprovidedevidencethatthisdatasetisasdescribedinthedocumentation?Formalreviewofdatasetsisnotnowcommon.Itshouldbeencouraged,however,especiallyintheobservationalsciences.AgoodmodelisthepeerreviewsystemforNASA'sPlanetaryDataSystem.Inthelaboratorysciences,thecriticallyevaluatedcompilationsofdatareferredtoinChapter2haveundergoneextensivepeerreview.

DifferencesBetweentheObservationalandtheLaboratorySciences

Dataderivedfromlaboratoryexperiments,suchasthehardnessofsteelproducedinaparticularmelt,differfromdatabasedonobservationsoftransientnaturalphenomena,suchastherecordsofthe1993

Page 112: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

transientnaturalphenomena,suchastherecordsofthe1993

Page 113: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page35

midwesternfloods.Thus,theystimulatedifferentquestionsrelatedtodatapreservationissues.Ashasalreadybeennoted,onedifferencearisesfromthefactthattransientnaturalphenomenaarenotreproducible;thefactthattheresultingobservationaldataare"snapshotsintime"sometimesmeansthatthedatahavehistoricalorevidentialvalueinadditiontotheirinformationalvalue.Observationaldatasetsthatprovideacontinuoustime-seriesrecordofthephysicaluniverse,orofhumanimpactuponit,areimportanttofuturegenerationsforcomparisonandtheidentificationoftrends.Inaddition,manyobservationaldatasetsrepresentmajorengineeringorworker-intensivecollectionactivitiesthatwarrantdocumentationandcouldnotfeasiblybecarriedoutagain.

Experimentershavegoodreasontobelievethatifandwhentheirdataarerecreatedinthefuture,instrumentswillbebetter.Inmanyexperiments,rawdata(e.g.,theinitialsensorreadingsbeforeanytransformations,conversions,averaging,orcorrectionsaremade)mayexistonlyforafleetinginstantbeforetheyarediscardedorfurtherprocessed.Evenwhenraw(level-0)dataareacquiredandsaved,principalinvestigatorsfrequentlyfailtoprovideappropriatedocumentationbecausetheydonotexpectanyoneelsetousethesedata.Instead,theprocesseddatasetsaremorelikelytohaveadequatemetadataandmeetthecommittee'sothercriteriaforretention.

Quitetheoppositesituationseemstoprevailfortheobservationalsciences,wheremanysecondaryscientificusersfeeltheyneedtobeabletogetbacktothelevel-0dataandarebecomingmoreactiveindemandingthatthecollectorsofthedataprovideadequatemetadata.

SpecialIssuesintheRetentionofObservationalData

Allobservationaldatathatarenonredundant,reliable,andusablebymostprimaryusersshouldbepermanentlymaintained.Thisjudgmentisbasedonthecommittee'sbeliefthatadvancingtechnologiesand

Page 114: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

betterdatamanagementpracticesmakeitpossibletostayaheadofthegrowingdatavolumes,asdiscussedinChapter4.Italsoislikelythatitwillbemoreexpensivetoreappraisedatasetsthansimplytokeepthem.Ifthecommitteeiswrongonthesetwocounts,itmaybepossiblethatthevolumeofthedatacanbereducedthroughsamplingtechniquesandthroughintelligentselectionofthedatasetsofhighestpriority,asexplainedbelow.

Datasamplingissuesariseinmeasurementsystemsandinconsideringarchivalstrategiestoprovidereadyuseraccess.Evenbeforeadatamanagerfacesarchivingdecisions,manysamplingratedecisionsalreadyhavebeenmade.Forexample,intheatmosphericsciences,wecouldeasilysampletemperaturesensorsandwindgauges100timesperminute,butthatfrequencyisunnecessaryfornearlyalluses.Ingeneral,itisnecessarytokeeponlydataproperlysampledintimeandspace;thatis,thesamplingintervalmustbesuchthatthemost-rapidly-varyingcomponentisnotaliased.AtleasttwosamplespercyclearerequiredaccordingtotheSamplingTheorem.Thusreductionofoversampleddatatotheminimumsamplingrateneeded,coupledwithlosslessdatacompression,cansignificantlyreducedatavolumeswithnolossofscientificcontent.However,ifthephenomenaofinterestareslowlyvarying,thenmorerapidfluctuations,whichmighthavevalueforotherpurposes,canbefilteredoutandthedatareducedtoretainthedesireddataunaliased;thistechniquecanfurtherreducethedatavolumeattheexpenseoflosinghigher-frequencydata.Thearchivingofonly"representative"subsetsofourlargestdatasetsisoftensuggested,butthenotionraisesdifficultissuesinstatistics,datamanagementphilosophy,andbudgeting.Inconcept,theremaybeacceptableproceduresforthelong-termarchivingofrepresentativesubsetsoflargedatasets,butnoeffectivemethodologyexiststodaytochoosethosethatwouldsatisfytheneedsoffutureusers.

Anexampleoftheapproachtodecidingwhichobservationaldatasets

Page 115: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

toretaincomesfromtheatmosphericsciences.Inthisfieldthevalueofadatasetaspartofalongtimeseriesisanimportantcriterionforarchivingdecisions.Thetemperaturerecordforagivenyearfromastationoperatingoveracenturyismuchmorevaluablethanasimilarrecordfromanearbystationwithashorterlifetime.Studiesofclimatechangeandothertypesofenvironmentalchangefindlongtimeseriestobeessential.For

Page 116: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page36

example,confirmationoftheseasonalstratosphericozonedepletionovertheAntarcticinthe1980srequiredreferencebacktotheDobsoncolumnozonedatafromthefirsthalfofthiscenturyforcomparativepurposes.TheU.S.HistoricalClimateNetworkdataareahighpriorityforarchivingbecausetheyrepresentalongtimeseriesofhigh-qualitydata,withexcellentmetadata;thiscombinationofattributesofdataofacommontypemakestheoveralldatasetexceptionallyvaluable.

MetadataIssues

Thecommitteehasarrivedatseveralrelatedconclusionsconcerningtheimportanceofdocumentation,ormetadata,totheeffectivearchivingofscientificdata.Theseincludethefollowing:Effectivearchivingneedstobeginwheneveradecisiontocollectdataismade.Originatorsofdatashouldpreparetheminitiallysotheycanbearchivedorpassedonwithoutsignificantadditionalprocessing.Thegreatestbarriertocontemporaryandfutureuseofscientificdatabyotherresearchers,policymakers,educators,andthegeneralpublicislackofadequatedocumentation.Adatasetwithoutmetadata,orwithmetadatathatdonotsupporteffectiveaccessandassessmentofdatalineageandquality,haslittlelong-termuse.Fordatasetsofmodestvolume,themajorproblemiscompletenessofthemetadata,ratherthanarchivingcost,longevityofmedia,ormaintenanceofdataholdings.Lackofeffectivepolicies,procedures,andtechnicalinfrastructureratherthantechnologyistheprimaryconstraintinestablishinganeffectivemetadatamechanism.

Thissuiteofconclusionsledthecommitteetorecommendthat''adequacyofdocumentation"beacriticalevaluationcriterionfordatasetretention.Thefollowingdiscussionilluminatesthemultiple

Page 117: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

setretention.Thefollowingdiscussionilluminatesthemultipleperspectivesofmetadata,theessenceoftheproblem,andimportantelementsofanymetadatasolution.

PerspectivesonMetadata

Thetermmetadataoftenisusedtodenote"dataaboutdata,"thatis,theauxiliaryinformationneededtousetheactualdatainadatabaseproperlyandtoavoidpossiblemisinterpretationofthosedata.Thetermisusedinmanyscientificdisciplines,butnotalwayswithpreciselythesamemeaning.Somecommentsondifferenttypesofmetadatamaybehelpful.

Themostbasicclassofmetadatacomprisestheinformationthatisessentialtoanyuseofthedata.Anobviousexampleistheunitsinwhichphysicalquantitiesareexpressed.Ifunitsarenotspecified,thenumbersareambiguous;atbest,theusermustattempttodeducetheunitsbycomparisonwithotherdatasources.Indealingwithobservationaldata,thecoordinatesandthecoordinatesystem(spatialandtemporal)obviouslymustbespecified.Laboratorydataareoftensensitivefunctionsofsomeenvironmentalconditionsuchastemperatureorpressure.Forexample,theboilingpointofaliquidvarieswithpressure,sothataboilingpointvaluehasnomeaningunlessthepressureisspecified.Althoughthisiswellknown,manymistakesoccurwhenauserassumesavaluetakenfromacompilationtobeaboilingpointatnormalatmosphericpressure,whileitactuallyreferstoareducedpressure.

Asignificantprobleminplanningalong-termdataarchiveissimplecarelessnessonthepartofthecreatorsandcustodiansofthedata.Currentpractitionersinascientificfieldmayimplicitlyunderstandwhattheunitsorenvironmentalconditionsare.Shortcutsaretakenbytheauthorsthatcausenoproblemincommunicatingwiththeircontemporarycolleagues(althoughtheymaybeconfusingtothoseinadifferentdiscipline),butpracticesandlanguagecanchangeoveragenerationortwo.Foralong-termarchive,eventhemostobviousmetadatashouldbespecifiedindetail.

Page 118: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

metadatashouldbespecifiedindetail.

Page 119: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page37

Beyondthisbasictypeofmetadata,thereisauxiliaryinformationthatisnotneededbythemajorityofusers(presentorfuture),butisofinteresttoafewspecialists.Includedherearetheparametersthathaveonlyaslightinfluenceonthedatainquestion,sothatmostusersdonotneedtoknowaboutthem.Forexample,thetypicaluserofadatabaseofatomicspectraisconcernedonlywiththewavelengthandaroughvalueoftheintensityofeachspectralline.However,afewuserswhoaretryingtoextractfurtherinformationfromthedatamaywanttoknowtheconditionsunderwhichthespectrumwasrecorded,suchasthecurrentdensity,typeofelectrode,andgaspressure.ReferringtotheJANAFThermochemicalTables,whicharediscussedinthePhysics,Chemistry,andMaterialsSciencesDataPanelreport(NRC,1995),mostusersareperfectlycontentwiththevaluesgiven(alongwiththeconfidencethatthecompilersdidagoodjobofselectingthemostreliablevalues).Aminorityofusers,however,willwantmoredetailsonhowthedatawereanalyzed,suchaswhethertheheatcapacityvalueswerefittedtoafifth-degreepolynomialoracubicspline,andsoforth.

Perhapsthemostpervasiveformofmetadataistheaccuracyofthevalues.Toapurist,nonumberhasmeaningunlessitisaccompaniedbyanestimateofuncertainty.Specifyingtheuncertaintyofeachdatapointincreasesthesizeandcomplexityofthedatabase,butsometimesmaybenecessary.Ataminimum,themetadatashouldincludegeneralcommentsonthemaximumexpectederrors,evenifaquantitativemeasuresuchasstandarddeviationcannotbegiven.Finally,thetermmetadataissometimesunderstoodtoencompassthefulldocumentationnecessarytotracethepedigreeonthedatabase.Forlaboratorydata,thisincludescitationstoalltheprimaryresearchpapersrelevanttothedatabase.Acriticalevaluationofespeciallyimportantquantities(suchasthefundamentalphysicalconstantsorkeythermodynamicvalues)mayendupwithonlyafewhundreddata

Page 120: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

points,butincludemassivedocumentationandcitationstoahundredyearsofliterature.Insuchcasesthemetadataoccupyfarmorespacethanthedatathemselves.

Fromthisdiscussion,itisevidentthatmetadatacanspantherangefromafewsimplestatementsaboutthedatatoveryextensive(andexpensive)documentation.Itisdifficulttogivegeneralguidelinesontheamountofmetadataneeded;eachcasemustbeconsideredinthecontextofhowfutureusersmayusethedataandwhatauxiliaryinformationtheywillneed.Someguidancemaybeobtainedfromformaleffortstosetmetadatastandardsforexperimenterstofollowinpreservingtheirdata.Inchemistry,forexample,manyorganizationshavedevelopeddetailedrecommendationsonreportingdatafromspecificsubfields.Thesehavebeencollectedinarecentbook,ReportingExperimentalData(ACS,1993).TheAmericanSocietyforTestingandMaterialsCommitteeE49onComputerizationofMaterialPropertyDatahasanambitiousprogramtodevelopconsensusstandardsformetadatarequirementsfordatabasesofpropertiesofengineeringmaterials.Thesedocumentsemphasizethatmetadatarequirementsmustbeapproachedonacase-by-casebasisandmustinvolveexpertsineachfield.

Theconclusionisthatmetadata,whatevertheparticularform,arecrucialtotheuseofalmosteverydatasetandmustbeincludedinanyarchivingplan.Thenecessarymetadatausuallyaddverylittletothestoragerequirements,butmayrequireconsiderableintellectualefforttoprepare,especiallyiftheyareassembledretrospectivelyratherthanwhenthedataarefirstcollected.

Theprecedingdiscussiondefinesmetadatafromtheperspectiveoftheresearchscientist.Anadditional,andsomewhatoverlapping,perspectiveisprovidedbythecomputersciencecommunity.Inthiscommunity,thetermmetadatareferstothespecificationofelectronicrepresentationofindividualdataitems,thelogicalstructureofgroups

Page 121: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ofdataitems,andthephysicalaccessandstoragemediaandformatsthatholdthedata.Tothecomputerscientistordatabaseadministrator,thecontextualdatathattheresearchscientistreferstoasmetadataencompassotherdataentities.Infact,divergencecanexistevenamongresearchscientistsastothedifferencesbetweendataandmetadata.Whatismetadataforonemaybedatafortheother.

Inviewofthisconfusion,thecommitteehaschosentokeepthetermmetadataandtoexplicitlydefineitsfundamentalcomponents.Assuch,thecommitteeviewsmetadataasrepresentinginformationthat

Page 122: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page38

supportstheeffectiveuseofdatafromcreationthroughlong-termuse.Itspansfourancillaryrealms:content,formatorrepresentation,structure,andcontext.Thecontentrealmidentifies,defines,anddescribesprimarydataitemsincludingunits,acceptablevalues,andsoforth.Therepresentationrealmspecifiesthephysicalrepresentationofeachvaluedomain,oftentechnologydependent,andthephysicalstoragestructureofaggregateddataitems,oftenarbitrary.Thestructurerealmdefinesthelogicalaggregationofitemsintoameaningfulconcept.Thecontextrealmtypicallysuppliesthelineageandqualityassessmentoftheprimarydata.Itincludesallancillaryinformationassociatedwiththecollection,processing,anduseoftheprimarydata.Onthebasisofthisexplicitdefinition,thefollowingsectiondescribesmetadataobjectives,implementationissues,andpotentialfordefiningastandardizedframework.

AnalysisofMetadata:FromChallengetoSolution

Theproblemofdatasetdocumentationisreceivingincreasedattentioninthecontextofscientificdatamanagement.Intheearthsciences,globalclimatechangeresearchandgeneralenvironmentalconcernshaveignitedinterestinamoreinterdisciplinaryandlong-termapproachtoconductingscience.Interdisciplinarycollaborationrequiresmoreeffectivesharingofdataandinformationamongindividualresearchers,disciplines,programs,andinstitutions,allofwhichmayoperateunderdifferentparadigmsorhavedifferentterminologyforsimilarconcepts(NRC,inpress).Further,long-termresearchrequiresthatresearchersbeabletoaccessandcomparedatasetsthatwerecreatedbypastresearchersandcollectedindifferentcontextsbydifferenttechnologies.Therefore,tosupporttheinterdisciplinarysharingandlong-termusefulnessofdata,adequatemetadatamustbeincludedwithinaframeworkthataccomplishesthefollowingobjectives:providesmeaningfulselectioncriteriaforaccessingpertinentdata;supportsthetranslationoflogicalconceptsandterminologyamong

Page 123: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

communities;supportstheexchangeofdatastoredindifferingphysicalformats;andenhancestheassessmentofdatasetsbyconsumers.

Acriticalquestionishowtomotivatetheusercommunitytoparticipateintheprocessofmetadatapreparationandstandardization.Theissueofmotivationisbestaddressedbythevaluesystemofthecommunityitself.Itmaybearguedthattheproblemwillnotbesolveduntiltheproductionofverifieddatasetsandtheirprovisiontoscientificcolleaguesbecomemorehighlyvaluedactivities.Developmentssuchasthepeer-reviewedpublicationofdatasetsshouldcontributetothisshiftinvalues.However,untiltheseactivitiesareassimilatedintothefabricofcareeradvancement,suchasbeingincorporatedintocriteriafortenureinacademicinstitutions,progresswillcontinuetobeslowanduneven.

Nevertheless,thereareanumberofspecificactionsthatcanbetakentopromotethepreparationandstandardizationofmetadata.Fundingagenciescouldhelpfacilitatechangebyrequiringandenforcingminimaldocumentationofdatasetscreatedundertheirgrants(aswellasotherdesirabledatamanagementandarchivingpracticesdiscussedelsewhereinthisreport).Thiswillnotbeaneffectivemechanism,however,unlesstheminimalstandardsforconsistencyandcompletenessareprovidedasatargetforgranteesandasameasuringstickforthefundingagent.Tobeeffective,thesestandardsmustbecreatedthroughthecollaborationofresearchers,datamanagers,librarians,archivists,andpolicymakers.

Individualsandinstitutionsinthescientificcommunitycouldcontributebyrecognizingthatdatamanagementandtheprovisionofappropriatedocumentationofdataareanessentialscienceinfrastructurefunctionspanningalldisciplines.Greatercost-effectiveness,consistency,andqualitycanbeachievedifthemanydiversedatamanagementactivitiesarebettercoordinated.Theessentialrequirementformakingthesevaluesystemchangesanddevelopingeffectivesolutionsistherecognitionthat

Page 124: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

systemchangesanddevelopingeffectivesolutionsistherecognitionthatall

Page 125: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page39

segmentsofthescientificcommunityneedtobeeducatedonthisissue.Fundingagenciesandthescientificcommunitythusmustmoveforwardtogetherinthedevelopmentofacoherentstrategyforend-to-endmanagementthatfocusesonmetadatarequirementsasamajorelement.

Theultimatesolutionformetadatahandlingwillincludeanapproachthatnotonlysupportsthedocumentationofadatasetthroughoutitslifecycle,butalsosupportsevolutionarydocumentationrequirements.Forexample,earlyinthedevelopmentanduseofaninstrumentsystem,thescientificcommunitymaynotbeabletospecifycompletelywhatmetadatawillbeimportantfortheeffectiveuseoftheobservationsproducedbythissystem.Inthiscase,someofthedocumentationmayincludefree-formnarrativeswithoutthebenefitofcontrolledvocabularies.Documentationofthisnatureisusefulonlytoalimitedaudiencethatunderstandsthespecializedvocabularyofthesourceinstrument,project,discipline,orinstitution.Inaddition,itisstilldifficulttomakethesedescriptionsusefultoanautomatedagentperformingasearchonbehalfofauser.Asinstrumentusebecomesmoreroutine,thisdocumentationcouldevolvetoamorestructured,butnotcumbersome,form.Onepotentiallyusefulapproachconstrainsthetextualdescriptionstoawell-defined,controlledvocabulary.Ifthevocabularyisclearlyspecifiedandmadeeasilyavailablewiththedataandassociateddocumentation,usersbeyondthosecloselyassociatedwiththecreationofthedatasetmaybeabletousethisinformationtoassessitsrelevance,significance,andreliability.Eventually,thismorestructuredalternativewillevolveintothespecificationofstructuredrecordswithappropriatelydefinedfields,standardvaluedomains,andrelationshipswithdatasetrecords.Thecommitteealsoexpectsthatimprovementsinsoftwarefornaturallanguageunderstandingwillenabletheautomatictranslationoffree-formnarrativesintoeasilysearchedmetadatafields.

Page 126: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Anequallyimportantcomponentofthemetadatasolutionistheidentificationanddetaileddefinitionofclassesofinformationthatarecriticaltothecompleteandconsistentdocumentationofdatasets.Informationmodelingtechniquescanbeusedtodeveloptheseclassesofinformation,someofwhichwillhaveclear,concisedefinitionsandasetofdefinedattributes,whileotherswillbeidentifiedbutwillnothaveclearlydefinedattributesorboundarieswithotherclasses.Theresultinginformationmodelshouldpresentatechnology-independentdescriptionofmetadataentitiesandtheirrelationshipswiththeprimarydata.Themodelshouldidentifymetadatathatmaybegeneralizedacrossallclassificationsofdatasetsandusagepatterns,aswellasaccommodatespecializedneeds.Suchamodelshouldprovidethebasisforintelligentinformationpolicies,datamanagementpractices,andmetadatastandards.Theinformationpolicies,however,mustnotsaddledataproviderswithlong,cumbersome"forms"tofillout.Thatwoulddiscouragethecontributionofthedatathemselves,andthecommitteerecognizesthatdatawithincompletedocumentationarebetterthannodataatall.Nevertheless,appropriatelyestablishedmetadatastandardsdonotnecessarilyneedtobedifficultorcostlytoapply,andthereforeneednotbeoneroustothedataprovider.AnexampleofageneralizedmetadataframeworkintheobservationalsciencesispresentedintheworkingpaperoftheOceanSciencesDataPanel(NRC,1995).

OtherElementsOfTheAppraisalProcess

Adatamanagementplanshouldbecreatedforanynewresearchprojectormissionplan,consistentwiththerequirementsofOMB(1994)CircularA-130.AgoodexampleofthisistheProjectDataManagementPlanoftheNASANationalSpaceScienceDataCenter(NASA,1992).Ataminimum,thoseindividualswhohaveresponsibilityforimplementingthedatamanagementplanandensuringaccessibilityandmaintenanceofthedatashouldplayakeyroleinthesubsequentappraisalprocess.

Page 127: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Mostindividualinvestigatorsandpeerreviewersdonotrecognizetheirrolesasappraisersforarchivalpurposes,buttheviewsoftheseexpertsshouldweighheavilyinthedecisionsrelatingtolong-termvalueorpermanencyofthedataobtained.Theprincipalinvestigatorsandprojectmanagerswhocollectandanalyzethedataclearlyhavethebestsenseofhowlongthedatawillbevaluablefortheirownscientificpurposes.Primaryusersalsocanprovideadetailedunderstandingregardingtheusesofthe

Page 128: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page40

datafortheirowndiscipline,buttheymaynotcomprehendthelong-termvalueofthedataforapplicationtootherresearchornationalproblems.Becausesuchprimaryusersandotherdatacollectorssometimesdonotthinkbeyondtheirownneeds,theagenciesshouldworkwithNARAtoprovidegooddocumentationattheinceptionofscientificprojects,especiallydocumentationthatwouldbeusefultosecondaryandtertiaryusers.Althoughprovidingmoreextensivedocumentationoftenmaybeviewedasanextraburdenbytheprincipalinvestigatorsanddatamanagers,thelaborandexpensecanbeminimizedifitisplannedattheinceptionofaproject,whereasitisextremelydifficultaftertheprojectiscompleted.Properdatamanagementpracticescanbepromotedbyconsideringdatamanagementintheevaluationofaninvestigator'spastperformance.

Becausemanyscientificendeavorsrequireparticipationbyanumberofagenciesandorganizations,itisimportanttocoordinatedatamanagementactivitiesandassignresponsibilitiesforthemaintenanceofthedataduringperiodsofprimaryuse.NARAiscurrentlyresponsibleforthefinalappraisaloffederalrecordsandthedeterminationoftheirvalueasaccessionstothepermanentnationalcollectionunderitsstatutorymandate.However,NARAshouldtakeadvantageoftheexpertiseoftheotherparticipantsinvolvedthroughoutthelifecycleofthedata.

Thecommitteebelievesthatallstakeholdersscientists,researchmanagers,informationmanagementprofessionals,archivists,andmajorusergroupsshouldberepresentedinthebroad,overarchingdecisionsregardingeachclassofdata.Theappraisalofindividualdatasets,however,shouldbeseenasanongoing,informalprocessassociatedwiththeactiveresearchuseofthedata,andthereforeshouldbeperformedbythosemostknowledgeableabouttheparticulardataprimarilytheprincipalinvestigatorsandprojectmanagers.Insomecases,theymayneedtoinvolveanarchivistor

Page 129: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

informationresourcesmanagertohelpwithissuesoflong-termretention.Althoughthecommitteebelievesthatformalappraisalsshouldbekepttoaminimum,appraisalsshouldbeperformedaccordingtothedatamanagementplanestablishedforeachproject.

Althoughthecommitteewasnotexpresslychargedwithadvisingonclassifieddata,thereisanobviousneedtosaveclassifiedscientificdataaswell.Thecompleterecordsoftheatmosphericatomicbombtestsareaclearexample.Itismoredifficulttoprovideandassessmetadataforaclassifieddataset,anditcostsmoretomaintainclassifieddata.Also,thereisatrade-offbetweenthevalueofthedatafornationalsecurity,therisktonationalsecurityifthedataaredeclassified,andthepotentialvaluetosocietyofhavingthedatadeclassified.Thus,itishighlybeneficialandcost-effectivetohavemechanismsinplacethatconsidertheseissuesperiodicallyforanygivenclassifieddatasetandthatpromotedeclassificationwhenappropriate.

Recommendations

Thecommitteemakesthefollowingrecommendationsregardingtheretentioncriteriaandappraisalprocessforphysicalsciencedata:

Asageneralrule,allobservationaldatathatarenonredundant,useful,anddocumentedwellenoughformostprimaryusesshouldbepermanentlymaintained.Laboratorydatasetsarecandidatesforlong-termpreservationifthereisnorealisticchanceofrepeatingtheexperiment,orifthecostandintellectualeffortrequiredtocollectandvalidatethedataweresogreatthatthelong-termretentionisclearlyjustified.Forbothobservationalandexperimentaldata,thefollowingretentioncriteriashouldbeusedtodeterminewhetheradatasetshouldbesaved:uniqueness,adequacyofdocumentation(metadata),availabilityofhardwaretoreadthedatarecords,costofreplacement,andevaluationbypeerreview.Completemetadatashoulddefinethecontent,formatorrepresentation,structure,andcontextofadataset.

Page 130: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Theappraisalprocessmustapplytheestablishedcriteriawhileallowingfortheevolutionofcriteriaandpriorities,andbeabletorespondtospecialevents,suchaswhenthesurvivalofdata

Page 131: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page41

setsisthreatened.Allstakeholdersscientists,researchmanagers,informationmanagementprofessionals,archivists,andmajorusergroupsshouldberepresentedinthebroad,overarchingdecisionsregardingeachclassofdata.Theappraisalofindividualdatasets,however,shouldbeperformedbythosemostknowledgeableabouttheparticulardataprimarilytheprincipalinvestigatorsandprojectmanagers.Insomecases,theymayneedtoinvolveanarchivistorinformationresourcesprofessionaltoassistwithissuesoflong-termretention.

Classifieddatamustbeevaluatedaccordingtothesameretentioncriteriaasunclassifieddatainanticipationoftheirlong-termvaluewheneventuallydeclassified.Evaluationoftheutilityofclassifieddataforunclassifiedusesneedstobedonebystakeholderswiththerequisiteclearancestoaccesssuchdata.

Page 132: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page42

4TheOpportunities:TheRelationshipofTechnologicalAdvancestoNewDataUseandRetentionStrategiesRapidprogressininformationtechnologycontinuallyaltersboththequantityandthequalityofscientificinformationandperiodicallystimulatesfundamentalmodificationofdatamanagementandarchivingstrategies.Recenttechnologicaladvanceshaveenablednewmethodsandstrategiesfordatastorageandretrievalandhavecreatedbetterwaysofconnectinguserstodataresourcesandtoeachother.Moreover,theevolvingtechnologiesarecatalystsforrevisingorganizationalstructurestomanagescientificdataarchivesmuchmoreeffectivelyinadistributedmanner.Assumptionsabouteffectivemanagementofscientificdatathathavebeenlongandfirmlyheldarebeingdirectlychallengedbynewinformationtechnology.Theseassumptionshavebeenbasedonexperiencewithmanagementofpaperrecords,generallyindomainsoutsideofscience.Someoftheoutdatedassumptionsthatarerapidlylosingtheirrelevanceincludethefollowing:Physicalpossessionofthedataisessentialtotheirmanagementandarchiving.Thisprinciplehasoutliveditsusefulnessinthecontextofelectronicphysicalsciencedataandhasmadeaccessdifficultforlegitimateusers.Electronicinformationiseasilycopiedanddisseminated.Thisfeatureremovesconstraintsimposedbythelimitedphysicalaccess.Becausemostgovernmentphysicalsciencedataareconsideredtobeinthepublicdomain,theconstraintsofcopyrightandfeecollectiontothefreemovementofdataareremovedaswell.Costofanarchiveincreasesinproportiontocollectionsizeanduse.Physicalarchivecostisafunctionofspace,aswellascataloging,repair,andaccessefforts.Improvedinventorytechnologyhaseasedsomeofthecostburdenoverthelastseveralyears,but,fundamentally,archiveswithlargephysicalholdingsoperateintraditionalwayswithlinearly

Page 133: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

withlargephysicalholdingsoperateintraditionalwayswithlinearlyscalingcosts.Suchcostsactuallydiscourageuse,sincephysicalhandlingofitemsscaleswithuse,whereasbudgetsreflectusageindirectly.Incontrast,electronicinformationstorageandmanagementcostshavedeclinedasrapidlyasthecostsofcomputertechnologyandprocessingoverthelast30years.Thereisnoforeseeableendtothisprocess.Storingandusingthenextbytewillbecheaperthanstoringandusingthemostrecentbyteforalongtimetocome.Onlyarchivistsandlibrarianshavethecapabilitiestomanagearchiveddata.Whilelibrariansandarchivistsareimportantadvisorsandparticipantsinscientificdatamanagement,thedominantmanagementresponsibilityfallstothescientificcommunityanditsdesignatedscientificdatamanagers(whoareablendofscientist,computerscientist,andlibrarian/archivist).Ifpracticingscientistsdonotparticipateinthemanagementofscientificinformation,suchdatawillfallintoobscurityorobsolescence.

Page 134: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page43Thelocatorinformation(catalog)aboutthemanagedobjectsissimpleandcompact.Findingrelevantscientificinformationoftenrequiressearchingthefullcontentandthiscontentgenerallyisnotintheconvenientlycompressedformoftext.Forexample,tosearchforalldatasetswherethestratosphericozoneconcentrationislessthansomeadhocthresholdinsomeregion,onewouldneedtoexecuteacomplexalgorithmoneverydatasamplecoveringtheregioninquestion.Queriessuchasthisbecomeevenmorecomplexiftheregionofinterestisdeterminedafterretrieval(e.g.,howmanydaysinarowwasthearealextentoftheozoneholeoveropenoceangreaterthan5,000squarekilometers?).Theselectionanduseofscientificdatatosolvecomplexproblemscanbesimplifiedthroughtheuseoftheconceptofbrowsinginformationbasedoncontent.Browsingofteninvolvesexaminationoflargenumbersofsamplesanddatavolumes.Specialized"browsingproducts"canbedefinedtolocaterecordsofinterest.Forthequeryexamplesabove,low-resolutionozonemapscouldbeusedtofindcandidatedatasetswithhighprobabilityofrelevance.Informationabouttheprocesses(includingsensorcharacteristics,computerprogramcapabilities,andcalibrationpoints)usedtodevelopthedatasetisneededforitsproperuse.Suchinformationincreasesthesizeandcomplexityofthelocatorservice.

Theremainderofthischapterdescribeshowadvancinginformationtechnologiesenablethedatamanager,librarian,andarchivisttodealwiththechallengesofscientificdatamanagementinacollaborativefashionwiththescientificusercommunity.

EnablingTechnologiesAndRelatedDevelopments

Table4.1providesasummaryofaspectsofscientificdatamanagementchangedbynewtechnologiesandrelateddevelopments.Thesesixareasarediscussedinmoredetailbelow.

High-PerformanceComputerNetworks

Page 135: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

High-PerformanceComputerNetworks

Therapidexpansionofcomputernetworksandtheiruseforelectronicmailanddatabaseaccesshaveobviatedtheneedforresearchersandotherusersofscientificandtechnicaldatatobeinphysicalproximitytocolleagues,informationresources,andevenadvancedtechnicalfacilities.Thishaspresentedamenuofchoicesaboutthebestmeanstodistributedataandtheresponsibilityofmanagingthem.

Aworldwide,"virtual"libraryisbeingcreatedontheInternet.ApplicationprogramssuchasMosaicaredemonstratingthepoweroffreeandsimplenavigationacrossanoceanofavailableresources.Improvingnetworkcapacity,reliability,performance,andsecuritymeasuresarehelpingtomaketheseresourcesmorewidelyaccessibleanduseful.

High-performancenetworksalsosupportmovementofinformationfornewapplications(e.g.,forproducingsafelymanagedbackupcopies,"profiling"informationforindividualuser'sneeds,orstagingdatathroughanumberofrefinementstepsindifferentlocationsforfocusedresearch).Networkssupportcollaborativeworkandresearchprojectsthatspantraditionalresearchboundaries.Suchworkrequireseasyaccesstoavarietyofdatasourcesatonce.

High-performancenetworksenablescientificdataresourcestobewidelydistributedandmanagedbygroupsofscientists.Usersthusarefreedtoconcentrateonthemosteffectiveuseofthedata,ratherthanontheirowndatamanagementissues.Networkscanprovideavehicleforregularlydistributingbackupcopiesofdataandmetadatatoensuresafestorage.Distributionofdatatouserscanbedoneviathenetworkinadditionto,orinsteadof,viaphysicalmediasuchastapesandCD-ROMs.Datacanbelinkedtogethertohelpusersnavigateamongrelateditems.ThiskindoflinkingisattheheartoftheWorldWideWebconceptandbroughttousersbyMosaic.Thepopulationofinformationproviders(e.g.,peoplewhocancontributetotheknowledgebase)hasnowgrowntoincludeallnetworkedmembersofauser

Page 136: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

nowgrowntoincludeallnetworkedmembersofauser

Page 137: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page44

TABLE4.1NewTechnologiesandRelatedDevelopmentsThatEnableaNewStrategyfortheManagementofScientificandTechnicalData

NewTechnologyTrendsandRelatedDevelopments

KeyFeatures WhatIsEnabled?

High-performancecomputernetworks

Distributedfunctions;rapiddeliveryoflargedatavolumes

Locationofdatabasesandarchiveswherebestmanaged;collaborativework;distributedorganizations;distributedresponsibility

Lowanddecliningcostofstorage

Inexpensivebackup;continuallydecliningcost;easeofmigration

Deferralofarchivingdecisions;trustindistributedmanagementduetosafestoragebackup

Advanceddatamanagement

Abilitytorigorouslyandformallymanagediversedatatypes

Morecomplexdatastructures(otherthan"flatfiles")handledinarchives,withgreatpotentialadvantages

Changingrequirementsforinformationtechnologyprofessionals

Abilityofpersonnelwithlowertechnicalskillstosucceedindatamanagementroles

Abilitytoentrustscientificdatamanagementinadistributedenvironment

Highreliabilityoftechnologycomponents

Availabilityofbettercomponentsandconnections;reducedprocurementandoperationscosts

Reducedcostandeffortindatamigration;trustedconnectionsforcommunicationandcollaboration

Developmentandacceptanceofstandards

Agreementonterms,interfaces,media,procedures

Reducedefforttocommunicateandapplyresultsofothers;abilitytoconcentrateonmissionissuesandnotontechnologysupport

Page 138: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

population.Suchcontributionscanbeassimpleasanannotationonanexistingitem,orascomplexasafullyprocessedandpeer-reviewednewitem.Mostprofoundly,theevolvingnetworkinfrastructureenablesnewconceptsfordistributionoffunctionsandresponsibilityinorganizations(NRC,1994).

Althoughnetworkscanprovideaquickandeasymeanstodistributedata,itmustbenotedthatCD-ROMshavebeenusedtodistributedataforseveralyearsandhavebeenverysuccessful.CD-ROMsnotonlypermituserstohaveahugelocallibraryofdata,buttheyoftencomewithabettersetofdataaccesstoolsthanarenormallyavailable.Somedatasetsarelargeenoughthatthemostcost-effectivemethodtodeliverthemisonmediasuchasExabytetapes(8mm).

LowandDecliningCostofStorage

Asformostaspectsofcomputerhardware,thecostofstoragehasdeclinedcontinuouslyandrapidlyforthe30yearsofthemoderncomputerage.Newstoragetechnologyisalsoincreasinglycompactandsupportsevergreateraccessspeeds(Gelsingeretal.,1989).Thehistoricaltrendsareexpectedtocontinueforupto20years.Already,laboratoryengineeringresultsconfirmthisprojectionforatleastthenextdecade.Themostsignificantimplicationisthatthedecisionsaboutsamplingordiscardingscientificdatacangenerallybedeferred,particularlyfordatasetsforwhichthenecessarymetadataexistandwhosequalityhasbeencertified.Forrelativelysmallerdatasets,thedeliberationregardinglong-termretentionmaywellcostmorethantherecurringactsofmigration.Thecostofstorageissmallinrelationtooverallmissionorinvestigationcostsandthereforeshouldnotbeadecisiondriver.Experience

Page 139: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page45

suggests,however,thatthefundstomeetthesecostsneedtoreceivespecialprotectionintheannualagencybudgetcycles.Thesupportforthedatamanagementaspectsofscientificmissionshastypicallyhadalowerprioritythanthedatacollectionaspects.Thelowcostofstoragealsoimpliesthattheincrementalcostofsupportingaremotesafecopyofdataandmetadataalsowillbesmall,exceptfortheverylargedatasets.Therefore,overthenextfewdecades,datareceivedandstoredmaybeexpectedtobecheaplyandquicklymigratedtonewtechnologieswhenstoragemediareachtheirnominallimitsofreliabilityorforconvenienceofimprovedaccess.

Itisimportantnottoexpectaperpetualadvantagefromthistechnologicaldiscontinuity.Thefactthatdatarequiresignificanttimeperiodsfortheirmigrationmustbeconsidered.Thecostdecaytrendwillslowdownatsomepointinthefuture,causingtheoverallcostofstoragetoreturntosomethingclosertothelinearrelationshiptovolume.Wealsomustberealisticandexpectthatfundswillnotalwaysbeavailabletosaveandbackupeverydataset.Decisionsonretentionorsamplingwillhavetobemade.

Nevertheless,thealreadylowandcontinuallydecliningcostofstorageallowsaprioridecisionstobemadeincertaincircumstancestokeepscientificdatasetsindefinitely.Backuporsafestoragecopiesofdataarebecomingmoreaffordableasdatamigrationbecomeslessexpensivewithsmaller,faster,andcheaperstoragedevices.Reliabilityalsoisimprovingwithnewsoftware-basedarchivesystems(includingmigrationandbackupfeatures).However,thereisanenhancedneedforongoingtechnologymonitoringbyanappropriatebodyformedia,standards,andmigrationautomation.Suchmonitoringshouldbeincorporatedinanyscientificdatamanagementandarchivingstrategy.

Therapidchangeofstoragetechnologiessuggeststhateffortsto

Page 140: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

protecttoday'sscientificdatalegacymustbeaccelerated.Theobsolescenceofmediatypesandrecorders/playersisoccurringwithinshorterandshortertimeperiods.Thisimpliesthat"salvage"activitieswillbeincreasinglydifficultfordataleftoutofmigrationstonewmedia.This"joinorbeleftbehind"by-productofrapidtechnologicalchangeintensifiesshort-termbudgetpressuresonarchives.Itdemandsinresponseastrongmanagementcommitmenttoprovideresourcesandsaveimportantdatasets.

Ifdigitaldataaretosurvive,itisoffundamentalimportancetomanageandconstrainthecostsofarchivemaintenance.Theproblemisthatnewdatawillbecomingin,olddatawillneedtobemigratedtonewmedia,thebuildingwillneedtoberepaired,andthereusuallywillnotbealotofextramoneyfornewequipmentoraddedstaff.Toavoidproblems,thedatamigrationprocessinthesystemdesignmustbealmosttotallyautomated.Thisrefinementoftenhasnotbeenachieved,anditcancauseunnecessarybudgetdifficulties.Finally,itisessentialforagenciestopreserveallthehardwareandsoftwarenecessarytoaccessalltheirdatauntilthedatahavebeensuccessfullymigratedorotherwisedisposedof.

AdvancedDataManagement

Therearesignsthatdatamanagementtechnologyisbeginningtoaddressand,perhaps,tocatchupwiththecomplexitiesoftheverylargevolumesofscientificdata.Improvementshaveoccurredindatabasemanagementsystems,hierarchicalfilesystems,datarepresentationstandards,queryoptimizers,datadistributiontechniques,specializedaccessmethods,anddatasecuritytools(Silberschatzetal.,1991).Further,investmentinstandardsandcooperativeapproachesisaccelerating,fueledinpartbythedemandsofmedicine,education,entertainment,journalism,financialservices,andothercommercialapplications.Whilecompetingapproachesandinconsistentvocabularycreatenear-termconfusion,theattentionand

Page 141: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

investmentlevelsbodewellforthelonger-termcapabilitytogobeyond"flatfile"representationsofdatathatneedtobearchived.Thenewtoolsandtechniquesaremoredescriptiveofthedata,theirheritage,theprocessesthathaveworkeduponthedata,andtherelationshipsofdatatoeachother.

Newdatamanagementtechnologywillenableeasierrepresentationofmorediversetypesofscientificdata.Becauseoftherigorthatnewtechniquesrequire(e.g.,forself-documentationorforprecisedefinitionofaccessmethods),long-termarchiveswillbenefitfromdatastructuresotherthanflat

Page 142: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page46

files.Thenewtechnologyalsoimpliesthatthecreationofarichersetofmetadatawillbeeasiertoimplementandthatthesedatawillbeofhighscientificvalueforcontent-basedretrievals.Torealizethepotentialofthisenabledfacilitywithmetadata,thescientificcommunitywillhavetoacceptandsupporteffortstodevelopandapplynewmetadatarequirements.

TheChangingRequirementsforInformationTechnologyProfessionals

InformationtechnologyprofessionalswithhighskilllevelscannowbefoundinallpartsoftheUnitedStatesandaroundtheworld.Butastheybringtheinformationtechnologyindustrytohigherlevelsofmaturity,theeffectistoreducethecomplexityofmajortasksinmanaginginformation.Suchtaskspreviouslyrequiredtheirskilleduseofsophisticatedassemblylanguageorjobcontrollanguage(JCL)programming.JCLprogrammingreferstothestepsintheolddaysthatoneusedatthesystemconsoletogetprogramstorun,attachtherightfiles,printtotherightprinter,andsimilarfunctions.Today,muchofthisworkismasked,madeautomatic,andcontrolledthroughiconsandothermeans.Thesetaskscannowbeperformedbycompetentscientistsorprofessionalswithlowertechnicalskills,ratherthanbyhighlytrainedspecialists.Becausemorefunctionscanbecompletelyhandledbymachines,managementofthedatacanbegreatlyautomatedandoperatedbylessskilledindividuals.Thedatathemselvescanbewidelydistributedwithoutfearofloss,particularlywithabackupcopyinsafestorage.

Overthenext5to10years,thecostsforinformationtechnologyprofessionalsatindividualscientificdatacentersandarchivescanbedramaticallyreduced.Thereasonsforthereductionincostsincludemoreautomaticprocessesforstoragemanagement,rudimentarylearningcapabilityinsystems,servicesperformedbyendusersbased

Page 143: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ontheirpreferences,improvedsystemsmanagement,highercomponentreliability,improvedapplicationofstandards,andvendorconsistencywithstandards.

Althoughthedominanttrendwillbeforasmaller,lesstechnicallyskilledstafftomanagethephysicalaspectsofthearchive,therewillbeapressingdemandforfewer,highlyskilledpeoplewhoblendtheskillsofphysicalscientist,computerscientist,andarchivist.Thesepeoplemustbeabletohandletheintellectualchallengesofbridgingthesedisciplineswhileprovidingthecoachinganddirectiontohelpdevelopdataandoperationsstandardsforscientificcommunities.

HighReliabilityofTechnologyComponents

Microprocessors,newstoragemediatechnologies,maturesoftware,errorcorrectioncapabilities,improvedpackaging,andreducedpowerconsumptionhaveallmadesignificantcontributionstothereliabilityofcomputersystemsandnetworks.Whatwasrecentlyconsideredunreliable,requiringconstantattentionandexpensiverepair,isnowregardedasreliableandnotworthyofefforttorepair.Althoughprecautionshavealwaysbeentakentoprotectagainstlossofvaluabledata,manyoftheseprecautionsarenowbuiltintothebaseofmaturesoftwareorareincreasinglyfamiliarpartsoffacilities'operatingprocedures.

Highreliabilityoftechnologysupportsacapacityforhighlevelsoftrustandtheabilitytowidelydistributefunctionsanddatabases.Thesedistributedsystemscanachievethesamelevelsofqualityandtrustascentralizedarchivesthroughtheuseofthesameunderlyinghardwareandsoftwaretechnology,operatingprocedures,safestorageofcopies,andhigh-quality(error-corrected)telecommunicationconnections.HighreliabilityhasenablednewapplicationssuchastheWorldWideWeb,inwhichcontextswitchingfromonemachinetothenextonaworldwidebasisisreadilyaccomplished.Increasedreliabilityalsohasallowedcomputingtechnologytobeputintothehandsof

Page 144: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

businessmanagers,consumers,andshopclerks.Withoutsuchreliability,maintenanceeffortwouldoutweighproductivitybenefit.Asaresult,powerfulorganizationaloroperationalframeworkscanbebuilt,muchasnewmaterialsenablenewarchitectureornewmachines.

Page 145: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page47

DevelopmentandAcceptanceofStandards

Thedevelopmentofeffectivestandardshasbeenpivotaltopromotingthewidespreaduseofelectronicinformation.CommunicationprotocolssuchasTCP/IPhavefueledthegrowthoftheInternet.Otherformatstandardsfordocumentssupporttheirinterchange.Forexample,theStandardGeneralizedMarkupLanguage(SGML)providesauniformwayofformattingtextualdocumentssothattheycanbereadbydifferentdocumentprocessingtools.TheHyperTextMarkupLanguage(HTML)isastandardusedtorepresentandlinkdocuments;itisusedtodescribepagesviewedwithInternetviewerssuchasMosaic.Hardwareandsoftwarestandardssuchastheinstructionsetarchitecturesformicroprocessor-basedcomputers,modemprotocols,mediaformats,andquerylanguagesalsohaveplayedcriticalroles.

Standardscansimplifymanyofthetraditionaldatamanagementjobs.Forexample,thetimethatwouldbeusedtodecipheratapeformatissavedandthejobofinstallinganewapplicationisfacilitated.Havingeffectivestandardsinplacereducestheleveloftedious,nonproductiveeffortandfreesuptimefornewtasksforthearchivist.Standardsdeterminednowwilltypicallybeineffectforlongperiodsoftime,perhapsadecadeormore,withsomesmallevolutionaryaugmentations.Thismeansthatabaselineofappropriatestandardscanbeselectedforabodyofinformationwithsomereasonableexpectationthattheywillnotbequicklyreplaced.Whenitappearsthattheexistingstandardsbaselineneedstobeupdated,theinformationcanthenbemigratedtoanewone.Adeliberatedatamigrationstrategybasedonstandardstrackingispossible.

Theroleofstandardscertainlyisnotlimitedtothegeneralcomputingcommunity.Scientificteamsanddisciplinegroupscontinuouslyworktocodifybestpractices,definitions,andalgorithms.Thesearepropagatedascommunitystandards.Standardsdevelopedbythescientificcommunityareoftenthemostimportanttopromoteandapply.If

Page 146: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

communityareoftenthemostimportanttopromoteandapply.Ifproperlypromulgated,theycanenableimprovedunderstanding,broadercollaboration,andfacilitationofthedatamanagementandrelatedresearch.

Finally,itshouldbeemphasizedthatstandardsandguidelinestosupportlong-termarchivingmustnotinhibitinnovation,ortheevolutionofinformationsystemsandtechnology.Oftenthebeststandardsandguidelinesarethosethatareindependentoftechnology.

OpportunitiesForNewOrganizationalStructures

Withrapidtechnologicalimprovementsandnewlyenabledcapabilities,itissometimeseasytoforgettheimportanceoflong-termcommitmentbymanagerstopolicyandresourcerequirements.Notechnologicalchangeswillbythemselvesreplacethebasic,unsungeffortsofhigh-qualityscientificdatamanagement.Infact,althoughtechnologyitselfcanimprovetheavailabilityofdata,trulyaccessibleandusefulscientificinformationwillbeachievedonlythroughsuchmanagementcommitment.Thiscommitmentmustbebasedonacoherentstrategyforlife-cyclemanagementofdata,includingtechnologyacquisition,dataandinformationmanagementpractices,andtechnology-independentstandardstoensurethattheminimumlevelsofdatacontentandconsistencyforresearchusesaremet.Further,suchacomprehensivestrategywillbesuccessfulonlywiththeactiveandcommittedinvolvementofthescientificcommunityitself.Thelevelofeffortandchangethatmayberequiredtoachievethiscommunityinvolvementcannotbeunderestimated,andfundamentalchangetothevaluesystemofthecommunitymayberequired.

Nevertheless,asdiscussedabove,technologicaladvancesallowthecreationofnewinfrastructure,challengingexistingorganizationalassumptions.Effectiveorganizationaldesignsbasedonnewallocationsofresponsibilityareenabled.Forscientificdatamanagement,thetechnologicalchangessupportorganizationswiththefollowingattributes:

Page 147: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

attributes:Widelydistributedresponsibility.Newtelecommunications,datamanagement,andstandardstechnologyallowsforhighlevelsoftrustindistributeddatamanagement.Physicalpossessionofdataby

Page 148: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page48

archivistsisnolongeressential.Thewideavailabilityofinformationtechnologyprofessionalsandotherskilleddatamanagers(alongwiththelowertechnicalskilllevelsactuallyneeded)enhancestheabilitytodistributethedatamorebroadlyandincreaseuserparticipation.Suchdistributionofdataandtheirownership(whetheractualorimplied)byusergroupsimprovestheutilityofthedataandhelpscreateimportantsupportforlong-termretention.High-valuepeer-to-peercommunication.Withaccesstodataandtopeopleonline,avarietyofnewcollaborativerelationshipscandevelop.Informationcanbebroadcasttointerestedindividualsinatimelyfashion.Datacanbeprovideddirectlytofieldresearcherstofocusnewdatacollection.Physicalproximityandformallinesofcommunicationarenolongervitaltoeffectiveorganizationaloperation.Indeed,closed,highlystructuredorganizationsoftenwillbeuncompetitiveorfailtotakefulladvantageofinnovation.Specializeddatacenters.Distributionofresourcesimpliesthatsomespecificlocationscanspecializeandyetstillcontributeeffectivelytoall.Specializedgroupsorinstitutionscouldbecreatedinascientificdisciplineorinsomeaspectofdatamanagement,archives,orstandards.Designationofsuchspecializedcenters,inadditiontothosealreadyinexistence,isasignificantmechanismforachievingeconomiesofscale,reducingoverallcostswhileenhancingtheeffectivenessofcertainfunctionsforthebenefitofall.Explicitlong-term(technology)strategies.Along-termtechnologystrategyneedstobedeveloped.Therapidlychangingbaseoftechnologyrequiresthatadeliberatesequenceofphasesbeselected,throughwhichdataanddatamanagementwillmigrate.Theconstantevolutionofinformationtechnologiesdemandsthatanorganizationalelementtakeonthis''technologynavigation"function.Measurementasavitaltool.Inafast-paced,and,perhaps,widelydistributedeffort,metricsareimportanttoclearlycommunicateexpectationsofperformance,registerresults,andhelpindetectingweak

Page 149: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

spotsforcorrectiveaction.Inparticular,metricscouldbeestablishedtodeterminedatasetuseandtosupportarchivingstrategydecisions.Metricsalsocouldbedevelopedtohelpensurehigh-qualityserviceandproperdataprotection.

Page 150: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page49

5ANewStrategyforArchivingtheNation'sScientificandTechnicalDataThescientificandtechnicaldataheldbyfederalgovernmentagenciesandbyotherinstitutionssupportedbyfederalfundsconstituteanextremelyvaluablenationalresource.Unfortunately,inmanycasesthisresourcecanbeexploitedonlywithgreatdifficultybecausekeyelementsoftheinfrastructureforbroadandeasyaccesstoitareincompleteormissing.

Currently,themostimportantdevelopmentwithinthefederalgovernmentforimprovingthemanagementandlong-termretentionofscientificandtechnicaldataistheNationalInformationInfrastructure(NII)initiative.TheNIIfocusesontheapplicationofpublic,private,andacademicresourcestodefine,implement,andmaintainanevolvingnetworkofknowledgeresources(IITF,1993).Thisinfrastructurewillbethefoundationforinformation-centeredenterprisesofthenextcentury(NRC,1994).Thescientificcommunity,whoselifebloodiswidelyavailabledataandinformation,mustbecomefullyengagedinthisnationaleffort.Acoherentstrategyneedstobedefinedandimplemented,tocombinenewtechnologicalcapabilitywithanewwayofdoingbusinessthroughoutallphasesofthescientificinformationlifecycle(observation,measurement,analysis,interpretation,application,dissemination,andeducation).

Aneffectiveinformationinfrastructuremustbuildonenablingtechnologiestocreateanintegratedandadaptivesystemthatiseasilyaccessibletoallpotentialusers.EachusercommunitywillhaveitsownviewofwhattheNIImeanstoitsenterpriseandhowtheNIIcanbestserveitsusersbecausetheNIIwillbemadeupofmanyseparate

Page 151: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

"enterpriseinformationinfrastructures."Theexistingscientificandtechnicaldatacentersandarchivesalreadyconstituteaseparateenterpriseinformationinfrastructure,whichmustbecomefullyintegratedintotheNII.

Inthediscussionthatfollows,thecommitteelaysoutathree-partstrategyforthelong-termretentionofscientificandtechnicaldata.TheelementsofthisstrategyarebasedonthetechnologicaladvancesoutlinedinChapter4andontheissuesraisedinChapter2,whichprovidethecontextandtheneedforaction.

Thestrategybeginswithasetoffundamentalprinciplesforthelong-termretentionofscientificandtechnicaldata.Thesecondmajorelementoutlinesthecommittee'sproposaltoformaNationalScientificInformationResourceFederation,whichwouldprovideacoordinationmechanismforend-to-endmanagementofnetworkedscientificandtechnicaldatafacilities.ThefinalsectionshighlightsomespecificrecommendationsforNARAandNOAAintheirlong-termretentionofscientificandtechnicaldata.

Page 152: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page50

FundamentalPrinciplesForLong-TermDataRetention

Inordertorespondadequatelytotheimperativesforpreservingdataaboutthephysicaluniverseandeventuallytocreateanintegrated,adaptive,andaccessibleinfrastructure,thefederalgovernmentshouldhelpestablisheffectiveandaffordableprocessesforprovidingreadyaccesstothevastnationalresourceofscientificandtechnicaldataandrelatedinformation.Theprocessmustsupporttheneedsofdataoriginators,users,andcustodiansacrossallphasesofthedatalifecycle,fromorigintousebyfuturegenerations.Thecommitteebelievesthatthefollowingprinciplesshouldguidetheeffortofthegovernmentagenciesinthelong-termretentionofscientificandtechnicaldata:Dataarethelifebloodofscienceandthekeytounderstandingthisandotherworlds.Assuch,dataacquiredinfederalorfederallyfundedendeavors,whichmeetestablishedretentioncriteria,areacriticalnationalresourceandmustbeprotected,preserved,andmadeaccessibletoallpeopleforalltime.Theoriginalcollectionandanalysisofscientificandtechnicaldatatraditionallyhavebeenusedprimarilytosupportthescholarlypublicationofscientificinterpretationbyindividualinvestigators.Theavailabilityofcompleteandconsistentdatasetsforbroaderuses,bothwithinandoutsidethescientificcommunity,wouldsignificantlyincreasethereturnontheinvestmentmadeinobtainingthosedataandprovideinsightsnotattainableiftheoriginaldatawerelostorunusable.Thevalueofscientificdataliesintheiruse.Meaningfulaccesstodata,therefore,meritsasmuchattentionasacquisitionandpreservation.Technologycanmakedataavailablethroughfastcomputers,large-bandwidthnetworks,massivestoragecapabilities,andportablemedia.However,ifthepathstodataareobscure,orthereisnowayforausertodeterminewhatissignificantandrelevant,thenthedatabecomeinaccessibleandareeffectivelylost.Adequateexplanatorydocumentation,ormetadata,caneliminateoneoftoday'sgreatestbarrierstouseofscientificdata.Theproblemof

Page 153: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

today'sgreatestbarrierstouseofscientificdata.Theproblemofinadequatemetadataisamplifiedwhenusersareremovedfromthepointoforiginbybeinginadifferentdiscipline,byhavingadifferentlevelofexpertise,orbytime.Addressingthisproblemcomprehensivelywillmakedatausefulinthebroadestpossiblecontext.Asuccessfularchiveisaffordable,durable,extensible,evolvable,andreadilyaccessible.Thesetermsmayappeartobevaguetargets,buttheyimplybasicgoals.Thecostsofdeveloping,operating,andusinganarchivemustnotbeexcessive.Thearchivemustenduretheravagesoflong-termuse,anditmustbeabletoextendbroadlytheservicesitoffersandtherecordsitmanages.Itmustevolvetosupporttheassimilationofnewtechnology,policies,procedures,anduses.Finally,anarchiveisnoteffectiveifabroadpopulationofuserscannotuseit.Thearchivingsystemthusshouldprovidemultiplelevelsofaccesstoanysubsetofitsholdings,althoughholdingsnotaccessedoftenmaynotrequireasophisticatedaccessmechanism.Theonlyeffectiveandaffordablearchivingstrategyisbasedondistributedarchivesmanagedbythosemostknowledgeableaboutthedata.Archivecentersgenerallyshouldbeattheagenciesorinstitutionsthatcollectthedata,andtheyshouldberesponsibleforarchivingandprovidingaccesstothedataaslongastheagency'sorinstitution'smissionandscientificcompetencecontinuetoencompassthesubjectfield.Physicaltransfersofthedatashouldbeavoidedifpossible,soagenciesandinstitutionswillneedtoallocateadequateresourcestotheentirelifecycleoftheirdataholdings.Planningactivitiesatthepointofdataoriginmustincludelong-termdatamanagementandarchiving.ThisprincipleisrecognizedintheOfficeofManagementandBudgetCircularA-130onthe"ManagementofFederalInformationResources"(OMB,1994).Thescientificinformationmanagementspectrumspansdatacollectedfromasensortothescholarlypublicationsthatreportscientists'interpretationsofthedata.Scientists,informationtechnologyprofessionals,datamanagers,librarians,andarchivistsmustunifytheirexpertiseintheestablishmentofacoherentstrategyforend-to-enddataandinformationmanagement.

Page 154: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ofacoherentstrategyforend-to-enddataandinformationmanagement.Althoughthesecommunitiestraditionallyhavenotworkedcloselytogether,

Page 155: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page51

theircombinedknowledgeandeffortarenowrequired.Thebenefitofincorporatingplanningatthepointoforiginisthatitischeaperandmoreeffectivetoplanforretentionthantoreconstructdatasetslater.

TheProposedNationalScientificInformationResourceFederation

ThecommitteebelievesthatthefederalgovernmentshouldcreateaNationalScientificInformationResourceFederationanevolutionaryandcollaborativenetworkofscientificandtechnicaldatacentersandarchivestotakeonthechallengeofprovidingeffectiveaccesstoandpreservationofimportantscientificandtechnicaldataandrelatedinformation.Suchaninitiativewouldbegintoexploitmorefullyournation'ssignificantinvestmentinthephysical(andother)sciencesandthedataacquiredwiththatinvestment.Inthediscussionthatfollows,thecommitteereviewsthebasicelementsofafederatedmanagementstructure,describessomenotableexamplesofexistingfederalgovernmentorganizationsforlarge-scaledistributeddatamanagement,andoutlinesthemostimportantaspectsoftheproposedNationalScientificInformationResourceFederation.

ElementsofaFederatedManagementStructure

Severalcriticalconceptsmustgovernanyfederatedmanagementstructureforittofunctionproperly.Theseincludethenotionsofsubsidiarity,pluralism,standardization,theseparationofpowers,andstrongleadershipatalllevels(Handy,1992).

Subsidiaritymeansthatpowerisassumedtoliewiththesubordinateunitsofanorganizationandcanberelinquished,butnottakenaway.Thesubordinateunitstypicallyarebestqualifiedtomakeoperationaldecisionsthatdirectlyaffectthemandthattheywillbeimplementing.Thecentralmanagementisallowedonlythosepowersneededtoensurethatthesubordinatesdonotdamagetheorganization.Forexample,theConstitutionoftheUnitedStatesreservesonlyspecifiedpowersforthe

Page 156: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ConstitutionoftheUnitedStatesreservesonlyspecifiedpowersforthefederalgovernment,withanyunstatedpowersbelongingtothestates.Appliedtothesituationathand,itisclearthatthestrengthsofthecurrentsystemformanagingscientificandtechnicaldataandinformationintheUnitedStatesaredistributedamonganumberofdiversedatacentersandarchives,bothwithinandoutsidethegovernment.Asuccessfulfederationoftheseexistinginstitutionswouldrecognizethattheyarethelocationsofexpertiseontheirrespectivedataholdings.Thusthecentralorganizationshouldbesmallandshouldnotmicromanagetheday-to-dayoperationsofthesubsidiaryorganizations.

Pluralismmaybedefinedasinterdependenceofthemembers.Inafederation,theindividualsubsidiaryorganizationsrecognizetheadvantagesofbelongingtothefederation,becauseofproductsorservicesthatcanbeobtainedfromotherelementsinthefederation.Asnotedinthepreviouschapter,theexistenceofmanyspecializeddatacentersandarchives,aswellasthepossibilityofcreatingnewonesinanetworkedenvironment,canoffersignificanteconomiesofscaleandimprovedsharingofideasandexpertise.Whatisgoodforthesubsidiaryelementalsoshouldbegoodforthewhole.Pluralism,coupledwithsubsidiarity,guaranteesameasureofdemocracyinthefederation.

Interdependence,inturn,requiresstandardizationoflanguages,communications,basicrulesofconduct,andunitsofmeasurement.Theseelementsmaybesummarizedastechnicalandproceduralstandardization.ThistoowasdiscussedinChapter4,regardingthedevelopmentofstandardsinsoftware,hardware,anddatamanagement.Standardsthataredevelopedbyconsensusofthesubsidiaryelements(e.g.,theparticipatingdatacenters,archives,andresearchers)arewidelyrecognizedasessentialtothesuccessfulmanagementofdata.

Aseparationofpowers(responsibilities),withasystemofchecksandbalances,isnecessarytoensurethatthecentralauthoritydoesnottakeonunnecessarypower.Thisprinciplemustbeincorporatedintothefederation'sorganizationalstructure.

Page 157: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Finally,afederationrequiresstrongleadershipthatiseffective,yetnotoverbearing.Thecentralcoordinatingelementorexecutiveofficemustactasthestandardbearer,promotingthefederation's

Page 158: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page52

establishedgoalsandobjectiveswhileremindingthesubsidiaryorganizationsoftheimportanceofcarryingouttheirresponsibilities.

ExamplesofDistributedDataManagementOrganizations

Successfulexamplesofafederatedmanagementstructurearenumerousintheprivatesector(Handy,1992).Morespecifically,however,therealreadyaretwolarge-scale,federalgovernment,distributeddatamanagementgroupsthatembodymany,thoughnotall,ofthefederatedmanagementattributesoutlinedabove.ThesearetheInteragencyWorkingGrouponDataManagementforGlobalChangeandtheFederalGeographicDataCommittee.

InteragencyWorkingGrouponDataManagementforGlobalChange

In1990,CongressformallyestablishedtheU.S.GlobalChangeResearchProgram(GCRP),"aimedatunderstandingandrespondingtoglobalchange,includingthecumulativeeffectsofhumanactivitiesandnaturalprocessesontheenvironment,[and]topromotediscussionstowardinternationalprotocolsinglobalchangeresearch"(CENR,1994).TheactivitiesoftheGCRParecoordinatedbytheCommitteeonEnvironmentandNaturalResources(CENR),underthePresident'sNationalScienceandTechnologyCouncil.

Thetimelyavailabilityofabroadspectrumofscientificdataandinformation,frombothgovernmentalandnongovernmentalsources,isfundamentaltomeetingthegoalsofthisprogram.AGlobalChangeDataandInformationSystem(GCDIS)isbeingcreatedtofacilitateaccesstoanduseofthedataandinformationnecessarytosupportglobalchangeresearch.ThefederalorganizationsinvolvedintheGCDISplanningincludetheDepartmentsofAgriculture,Commerce,Defense,Energy,Interior,andState,aswellastheEnvironmentalProtectionAgency,theNationalAeronauticsandSpaceAdministration,andtheNationalScienceFoundation.

Page 159: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

AccordingtoTheU.S.GlobalChangeDataandInformationSystemDraftImplementationPlan(CENR,inpress),theGCDISisbuildingontheresourcesandresponsibilitiesofeachparticipatingagency,linkingthedataandinformationservicesoftheagenciestoeachotherandtotheusers.Thesystemthusiscomposedlargelyoftheseparatelyfundedcomponentscontributedbytheparticipatingagencies.Itissupplementedbyaminimalamountofcrosscuttingnewinfrastructurethroughtheuseofstandards,commonmanagementapproaches,technologysharing,anddatapolicycoordination.NeitheraleadagencynoraseparatelyfundedbudgetfortheGCDISisplanned;rather,implementationofthesystemisbeingcoordinatedthroughtheInteragencyWorkingGrouponDataManagementforGlobalChange(IWGDMGC).Decisionmaking,therefore,isdonethroughaconsensusprocessbasedonthecommoninterestsofallparticipants.

PlansfortheGCDISrecognizethattheglobalchangedatamustbeavailableforaverylongtime,regardlessofthechanginginterestsoftheresearcher,group,oragencythatoriginallycollectedandanalyzedtheobservations.AlthougheachagencyparticipatingintheGCDISisexpectedtomanage,store,andmaintainthedatasetsunderitspurview,theplandoesallowanagencytodesignateanotherGCDISagencytoarchivesomeofitsdata.Theparticipatingagenciesareexpectedtoadheretogovernmentstandardsformedia,storage,andhandlingasprescribedbyNARAandtheNationalInstituteofStandardsandTechnology.TheagencyarchivesassociatedwiththeGCDISaccesssystemwillbestaffedbyprofessionalswhounderstandthedataandtheirsources.TheIWGDMGCexpectstodevelopguidelinesforpreparingdatasetsandassociateddocumentationforlong-termretentionattheparticipatingagencies.Ideally,theGCDISarchivesalsowillbeassociatedwithresearchgroups,bothwithinandoutsidegovernment,who,asprincipalusersofthosedata,willverifyqualityanddocumentationofthedata.

Page 160: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources
Page 161: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page53

TheGCDISplangiveseachagencyresponsibilityforitsowndata-purgingpolicies,althoughinteragencycoordinationprocedureswillbedevelopedtopreventthelossofimportantdatasets.Beforeanydatasetsarepurged,however,anagencywillberequiredtonotifytheIWGDMGCofitsplansatleastoneyearinadvance,andtoallowotherGCDISagenciestoindicatetheirrequirementsforthosedata,ortoagreetoassumeresponsibilityforthearchivingofthosedata.Intheeventthatnoagreementcanbereachedonthedispositionofadatasetidentifiedforpurging,existingNARAprocedureswillapply(CENR,inpress).

FederalGeographicDataCommittee

Theothermajorfederaldatacoordinationentityimportanttothelong-termmanagementofobservationaldata(includingsomedatafromthebiologicalandsocialsciences)istheFederalGeographicDataCommittee(FGDC).TheOfficeofManagementandBudget(OMB)establishedtheFGDCin1990todevelopaNationalSpatialDataInfrastructure(NSDI)toworktowardthecoordinateddevelopment,use,sharing,anddisseminationofgeographicdata(OMB,1990).ParticipatinggovernmentorganizationsincludetheDepartmentsofAgriculture,Commerce,Defense,Energy,HousingandUrbanDevelopment,Interior,State,andTransportation,aswellastheEnvironmentalProtectionAgency,FederalEmergencyManagementAgency,LibraryofCongress,NationalAeronauticsandSpaceAdministration,NationalArchivesandRecordsAdministration,andTennesseeValleyAuthority.Infulfillingitsmandate,theFGDCcarriesoutthefollowingactivities,amongothers:promotesthedevelopment,maintenance,andmanagementofdistributeddatabasesystemsthatarenationalinscopeforgeographicdata;encouragesthedevelopmentandimplementationofstandards,exchangeformats,specifications,procedures,andguidelines;promotestechnologydevelopment,transfer,andexchange;andpromotesinteractionwithotherexistingfederalcoordinating

Page 162: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

mechanismsthathaveinterestinthegeneration,collection,use,andtransferofspatialdata(FGDC,1994).

TheFGDChasreceivedauthorityandsomelimitedfundingtopursuetheseobjectives.Specifically,ExecutiveOrder12906on"CoordinatingGeographicDataAcquisitionandAccess:TheNationalSpatialDataInfrastructure,"assignstotheFGDCtheresponsibilitytocoordinatethefederalgovernment'sdevelopmentoftheNSDI.ThatExecutiveOrderalsoinstructstheFGDCtoinvolvestateandlocalgovernmentsinitsNSDIactivities,andtousetheexpertiseofacademia,professionalsocieties,theprivatesector,andothersasnecessarytoassisttheFGDC.

TheFGDChasestablishedamatrixofsubcommitteesandworkinggroupsaccordingtodiscipline-relateddatacategoriesandinterests.Theworkinggroupissuesincludeaframeworkfordata,aclearinghousefordata,standards,technology,anddataarchiving.TheFGDCplansfordataarchivingarestillbeingdeveloped,however.

CreationoftheNationalScientificInformationResourceFederation

Thetwoexamplescitedaboveindicatethatafederatedmanagementstructureforhighlydistributedscientificdatacanbecreated.Infact,betweenthesetwogroups,thelife-cyclemanagementofmanyofthedatathatarethetopicofthisreportisbeginningtobesystematicallyapproached.Nevertheless,asdiscussedinthisreportandinthevolumeofworkingpapers(NRC,1995),manyimportantgapsandinadequaciesremaininthemanagementandretentionofournation'sscientificdataandrelatedinformation.ThecommitteebelievesthatthesedeficienciescanbestbeaddressedbyacomprehensivefederatedsystemaNationalScientificInformationResource(NSIR)Federationthatbuildsonthesuccessesof

Page 163: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources
Page 164: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page54

theexistinggroupsandhelpscoordinatethemwithotherdatamanagemententitiesthatstillneedimprovement.

Therearemanyreasonswhyitisnowpropitioustoestablishasystemoffederateddatamanagement,withanemphasisonlong-termretention.Fromapolicyperspective,itwouldbeconsistentwiththegoaloftheNationalInformationInfrastructuretodistributeinformationresourcesbroadlythroughoutoursociety,withthefederalgovernmentactingasfacilitatorforsuchactivities.Thetechnologyisavailabletomakeafullynetworked,buthighlydistributed,systemofdatacentersandarchivesbothfeasibleanddesirable.Suchasystemwouldbeefficientinprovidingaccesstoscientificdataandinformationtoalargenumberofpotentialusersandwouldmaximizethegovernment'sreturnonthesignificantinvestmentthatinitiallywentintoacquiringthosedata.Fromanorganizationalstandpoint,afederatedmanagementstructurewouldallowthedisparateelementstocontinuetospecializeinwhattheyeachdobestandtofulfilltheirindividualorganizationalmandates,whileprovidingsomeefficienciesofscaleandpoliticalleverageinaddressingthemostpressingissues.Moreover,thistypeofapproachisespeciallytimelyandimportantinaneraoffederalgovernmentbudgetreductions.Thecommitteethereforeenvisionsabroadlynetworkedorganization,whichwouldbeimplementedthroughthecollaborationofthefederalgovernment'sscientificandtechnicalagenciesaswellascommercialandnoncommercialorganizationsoutsidethegovernment,andintegratedintotheemergingNationalInformationInfrastructure.

MostoftheelementsoftheNSIRFederationarealreadyinplace.Theseincludethedatacentersandfieldarchivesrunbyseveralofthefederalagenciesthatareamongtheprimarygeneratorsandcollectorsofthenation'sscientificdataandinformation.Inadditiontoholdingdata,thesecentersandarchiveshavehighlyskilledstaffwiththerequisiteexpertise.Theorganizationsarewidelydistributed,both

Page 165: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

geographicallyandbydiscipline.

Theexistingdatacentersandfieldarchives,however,donotapproachthefederatedorganizationalmodelforseveralreasons.Thereisnounifyingorganizationamongthevariouselements,thereiswidedisparityinthequalityanddepthofserviceprovided,andfewofthemhaveachartertopreservedata"permanently."AlthoughNARAhasthestatutorychartertopreservefederalrecordsinperpetuity,itscurrentandprojectedholdingsofelectronicscientificrecordsareverysmall.WhilethecommitteedoesnotbelievethatNARA'sarchivesofscientificdatashouldincreasesubstantially,itfoundlittleevidenceofactivitywithinthescientificandtechnicalagenciesthatwouldindicatethattheirabilitytoprovideforlong-termretentionandaccesstotheirdatawouldimprovewithoutsomerestructuring.

Afundamentalpreceptisthatthosemostfamiliarwithscientificdatathescientiststhemselvesareinthebestpositiontooverseethemanagementofthosedata(NRC,1982).Inlightofthevolumeanddiversityofscientificdata,adistributedapproachthatmaintainsthedataclosesttotheprimaryusercommunityisthemosteffectivemethodformanagingthem.Asmentionedabove,severalagencieshaveadoptedanapproachofcaringfortheirdatainsystemsoffieldarchivesordisciplinedatacenters.Althoughtheseagencieshavedevotedsignificantattentiontothepreservationofdata,theirconcernislimitedtoprovidingimmediateservicetoprimaryusersofthedatafortheiroriginallyintendedpurpose.Littlethoughthasbeengiventotheperpetualarchivingofthedatawithinmostagencies,withthenotableexceptionofNARAandNOAA,whichalreadyhaveastatutorymandatethatallowsthemtopreservedatacollectedbythefederalgovernment.Becauseitisnotpossibletobesurethatanydatacenterwillexistinperpetuity,somemechanismmustbeinplacetoensurethatthedatawillberetainedbyanappropriateorganizationwithinoroutsidethegovernmentintheeventthatthecontinuedexistenceofadatacenterisjeopardized.

Page 166: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Ifaleadagencycanbedeterminedforasubjectmatter,thenitshouldtakeresponsibilityforcoordinationofscientificdataonthatsubject,nomatterwhichagencyhasphysicalownershiporcustodyofthosedata.Thecommitteerecognizes,however,thatsomedatasetsarelargelyofinterestattheboundariesofdisciplinesoragencychartersandthatconsequentlythesemaybemoredifficulttomanageordocumentproperly.Largedatasetsthatareofaninterdisciplinarynaturecausespecialproblemsin

Page 167: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page55

thisregard.Forthesecomplexsituations,nosimplerulewilltaketheplaceofnegotiationsamongtheinvolvedagenciestomakethenecessaryarrangementsforlong-termarchiving.Indeed,everyagencyshouldassumetheobligationtokeepitsholdingsofscientificdatainusableform,evenifthedataarenotinactiveuse,untilagreeingondispositionofthosedatawithNARAoranotheragency.

Inadditiontotheagency-administereddatacenters,thereareeducationalorprivateconcernsthatholdandadministerdataimportanttooneormoreagencies,suchasthearchiveddatafromtheNOAAGeostationaryOperationalEnvironmentalSatellitesattheUniversityofWisconsinortheseismicdataheldbytheIncorporatedResearchInstitutionsforSeismology.Whilesomeofthesenonfederalarchivesarefirmlyassociatedwithoneormorefederalagenciesthroughcontractualandfundingrelationships,inothercasesaone-to-oneassociationislessclear.Itfollowsthatawell-definedchainofresponsibilitymustbeestablishedforalldatathataretobepreserved.Thisdecisionshouldbemadebytheindividualsandinstitutionsmostcloselyassociatedwithandinterestedinthosedata,anditshouldbemadewithdueconsiderationforcostefficiency,appropriateexpertise,scientificinterest,andconvenience,amongotherfactors.Establishingaclearconnectionbetweenafieldarchiveandanagencyshouldinnowaylimitthecommunityofusersservedbythearchive,butshouldensureanorderlyandsecurepathofresponsibilityforthedata.

Thestructureofthenation'sscientificandtechnicalorganizationscontinuestochange.Insomeinstances,institutionsorevenagencieswillmerge,whileinothercases,organizationsmaydisappear.Whensuchchangesoccur,itislikelythatthescientificinterestsformerlyrepresentedbythoseorganizationswillbesubsumedbyexistingornewagenciesororganizations.ThegeneraltopologyoftheNSIRFederation,however,wouldnotchange.

Page 168: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

ThecommitteedoesnotanticipatethatthecreationandimplementationoftheFederationwillrequiremuchadditionalfunding,ifany,becauseitwillconsistprimarilyofimprovinglinkagesandcoordinationamongexistingdatacenters,archives,andrelatedorganizationswithinahighlydecentralizedmanagementstructure.Moreover,anycostsincurredinthisprocessshouldbemorethanoffsetbytheimprovementsinefficiencyandaccesstothedataandrelatedinformationresources.

RecommendationsForTheCreationOfTheNSIRFederation

Thecommitteethusrecommendsthatthefederalgovernmenttakethefollowingstepsforadequatelypreservingandprovidingaccesstodataaboutourphysicaluniverse:

AdopttheNationalScientificInformationResource(NSIR)FederationconceptasanintegralpartoftheNationalInformationInfrastructure(NII).Thisconceptmustencompassnotonlyanelectronicnetwork,butalsoindividuals,organizations,communities,dataresources,procedures,guidelines,andassociatedactivitiesofdatageneration,management,custodianship,anduse.TheNSIRFederationshouldprovidethefoundationfordefiningacoherentapproachtomanagementofthelifecycleofscientificdata,withthegoalofprovidingbroadandeffectiveaccesstoallpotentialusersascosteffectivelyaspossible.TheFederationshouldbedevelopedandimplementedthroughconsensusofcollaboratingorganizationswithdiverseandautonomousmissions.TheGCDIS,inparticular,isanexampleofaprototypeNSIR,focusedondataforaspecificsetofinterdisciplinaryscienceproblems.TheNSIRFederationwouldbuildonsuchefforts,providingforbettercoordinationandinteractionamongthem,andwouldhelporganizefledglingeffortstopreserveandprovideaccesstodatainotherdisciplines.

TheadministrationshouldtakethestepsnecessarytofullydefineandcreatetheNSIRFederation.Thereareatleasttwopotentialfocal

Page 169: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

pointswithintheadministrationforplanningsuchanactivity.ThesearetheinteragencyInformationInfrastructureTaskForcefortheNIIandtheNationalScienceandTechnologyCouncil.TheNSIRFederationcouldbecreatedinamannersimilartothecreationoftheFederalGeographicDataCommitteeanditsNationalSpatialDataInfrastructure(e.g.,

Page 170: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page56

throughanOfficeofManagementandBudgetCircularandExecutiveOrder),oroftheInteragencyWorkingGrouponDataManagementforGlobalChangeanditsGlobalChangeDataandInformationSystem(e.g.,throughlegislationincooperationwiththeadministration).Aconvocationofrepresentativesfromthescientific,dataandinformationmanagement,andarchivingcommunitieswouldbeagoodwaytodefineandinauguratethisinitiative,focusingonthemostsignificantissuesandproblemsidentifiedattheendofChapter2.

FollowingtheformalauthorizationbythefederalgovernmentforcreatingtheNSIRFederation,theprincipalparties,includingNARAandNOAA,shouldconcludeagreementsfortheimplementationofadistributedarchivesystem.Thesystemshouldinvolveallrelevantinstitutions,includingnongovernmentalentitiesthatarefundedbythefederalgovernmentorthatmaintaindatathatwereacquiredwithfederalfunds.Asageneralprinciple,datacollectedbyanagencyshouldremainwiththatagencyindefinitely.ThecommitteerecognizesthatthisrecommendationmayrequiresignificantoperationalchangesforagenciesotherthanNOAA,andevensomechangeswithrespecttoNOAA'sdataactivities.Inaddition,NARAshouldconsiderconcludinginteragencyagreementstogiveformalrecognitionofthisprocessasappropriate.Furthermore,theassociatedagenciesintheNSIRFederationmustworktogether,undertheleadofasmall,coordinatingexecutiveofficewiththeexpertisetoestablishdatamanagementguidelinesandminimumcriteriaforadequatemetadatathatcouldbeappliedacrosstheentireFederation.Theexecutiveofficecouldbeeitherahigh-levelinteragencycoordinatingcommittee,similartotheFGDC,oranewofficeatanappropriatefederalagency,suchastheNationalScienceFoundation,whichhasabroadscientificandtechnicalaswellascommunicationmandate.Inanycase,theexecutiveofficeshouldresistthetypicaltendency

Page 171: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

towardbureaucraticaccretionofpower,personnel,andresources,andthetendencytoconsolidateandcentralizedataholdings.AmanagementcouncilconsistingofrepresentativesofthememberorganizationsshouldbecreatedtohelpensurethatthecentralexecutivefunctionremainsfullyresponsivetoallmembersoftheFederation.

Dataaccessandpreservationservicesshouldbeimplementedonthemostcost-effectivebasispossiblefortheFederation.Forexample,oneinstitutionmayprovideaservicetooneormoreotherinstitutionsinordertoexploitpotentialeconomiesofscaleandfocalpointsofexpertise(e.g.,thespecializeddatacenterssuggestedinChapter4).Thismeasuremightincreasethecosttotheprovidinginstitution,butwoulddecreasetheoverallcosttothefederation,thegovernment,andthetaxpayer.Anexampleofthisisthemethodbywhichbackupcopiesofdatamightbekept.NARAmayhaveatanygiventimethemostcost-effective"vault"inwhichtokeepphysicallyseparatebackupcopiesofdataforallagencies,and,hence,thefederalgovernmentwouldsavemoneybyincreasingNARA'sbudgettoprovidethisservicefortheotheragencies.Ontheotherhand,ifcosttrade-offstudiesweretofindthatasinglelarge"vault"isnotascost-effectiveasdistributedfacilities,theneachagencywouldberesponsibleforitsownbackup.InallNSIRFederationactivities,emphasisshouldbeplacedoncontrolofcosts,withthemostsuccessfulmethodsusedbyindividualmembersidentifiedandsharedwithallothermembers.

TheinstitutionsbelongingtotheNSIRFederationshoulddevelopaprocessforcollaboratingeffectivelyonspecificinitiatives.Thisprocessshouldprovideamechanismtodefineandprioritizedatamanagementandpreservationinitiatives,toestablishtherequiredagreementsbetweencollaboratingorganizations,andtosecurefundingforeachinitiative.EachparticipatingorganizationwouldcontributetotheFederationaccordingtoitsparticularstrengthsandin

Page 172: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

amannerconsistentwiththefoundingcharter.Inaddition,anindependentadvisorybodyconsistingofexpertsfromusergroupsshouldbeformedinsupportofeachinitiative.

TheNSIRFederationshoulddevelopanationalresourceofinformationtechnologythatisconsistentwithitscharteredobjectivesandthatcanbeeffectivelydistributedtoinstitutionsthatmustmanagedata.Thesetechnologieswouldincludecompleteproducts,designs,guidelines,standards,

Page 173: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page57

andmethodologies.Arelatedlong-termtechnologystrategy,or"technologynavigation"function,shouldbedeveloped,assuggestedinChapter4.

TheNSIRFederationshouldinstituteanindependentlymanagedprocessforawardingNSIRcertificationtomemberscientificinstitutionsandtheirdataandinformationsystemsonthebasisofwell-definedcriteriaandstandards.Thecertificationprocessshouldbemanagedbyanongovernmental,not-for-profitorganization,whichwouldreceivetechnicalguidancefromtheparticipatingfederalagencies.Thecertificationneedstohavecredibilityinthecommunitysothatnonmemberinstitutionswillaspiretoattaincertificationandhaveittaggedtotheirproducts.Thecertificationalsoshouldbesomethingthatcommercialvalue-addedproviderswillseektoincreasethecredibilityoftheirproducts.

ItalsoisimportantforthecommitteetostatewhattheNSIRFederationshouldnotbe.Itshouldnotbecomeanexpensivebureaucraticentity.Theexecutiveofficemustnotimposeanystandardsorinformationtechnologiesfromabovethathavenotbeenvalidatedthroughaconsensusprocessofthememberorganizations.Finally,theexecutiveofficemustnotattempttomicromanagetheoperationsoftheparticipants,norshouldithaveanydirectcontrolovertheirbudgetsandfundingallocations.

RecommendationsSpecificallyForNARA

Inordertoimproveitsresponsibilitiesinthelong-termretentionofscientificandtechnicaldata,thecommitteerecommendsthatNARAstrengthenitsliaisonwitheachfederalagencythatproducessuchdatatoensurethatappropriateattentionisdevotedtolong-termdataretentioninadistributedstorageenvironment.

Asshownearlierinthisreport,NARAcannottoday,norwillitlikely

Page 174: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

everbeableto,actasthecustodianofmostphysicalsciencedata.ThedatavolumeistoogreatinrelationtothefundingappropriatedtoNARA,theNARAstaffdonothavethenecessaryspecializedscientificknowledge,theinteragencylinkagesarenotinplace,andahugeinfrastructuresimilartothatwhichalreadyexistsatotheragencieswouldneedtobeduplicatedatNARA.Theagenciesclosesttothedatasetsandbestequippedtodealwiththemarethemselvesalreadystrugglingwiththeseissues.However,NARAdoeshavegreatexpertiseinissuesinvolvingthelong-termstorageofdataandthepackagingrequirementsfordatatobeofvaluetofutureusers.

ThecommitteethereforebelievesthatNARA'sroleshouldbeprimarilyadvisoryorconsultative,tohelpensurethattheagenciesthataretheactualcustodiansofdataattheworkinglevelfollowalltherelevantfederallawsandguidelinesintakingcareofthedata.ThecommitteesuggeststhatscientificdataandrelatedinformationshouldgotoNARA'sphysicalpossessiononlyasalastresort,whentheagencythatcollectedthedatacannolongerprovideaccessfortheusercommunity.Ashasalreadybeennoted,scientificdataarebestmaintainedbytheagencythatoriginallyacquiredthosedataaslongasthereisanyregularactiveuse.Theholdingagenciesshouldcollect,analyze,store,andmakeavailablethemaximumfeasibleamountofrelevantphysicalsciencedata,consistentwiththeprinciplesandgoalssetforthfortheNSIRFederationandwiththeretentioncriteriaandappraisalguidelinesdiscussedabove.

Currently,agenciesinformNARAoftheirintentionsfortheirfederalrecords,includingscientificdata,throughvariousschedules.Allagenciesarerequiredtoschedulerecordswhentheyreach30yearsofage,althoughtheyareencouragedtodosoearlier.TheNationalClimaticDataCenterevenprovidesschedulesfordatathatitplanstoholdindefinitely,notingthatintention.Formosttypesofrecords,thepressuretoscheduleprovidestheusefulfunctionofpreventinganagencyfromsimplywarehousingcontinuallyincreasingvolumesof

Page 175: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

unusedrecordswithoutexamination.Fordatathatanagencydoesnotwishtodestroy,butthatarenotfrequentlyaccessed,NARAmakesavailablestoragespacewithouttakingownership.IfNARAdidnotprovidesomeworthinesstestforrecordsbeforeagreeingtoprovide

Page 176: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page58

storageforanotheragency,theFederalRecordsCenterscouldbecomeinundatedwithrecordsoflittlevalueorpotentialforfutureuse.

Asdiscussedinthisreport,weareheadingincreasinglytowardasystemofdistributedarchivesforelectronicrecords.Datasetsaredistributedamongvariousphysicallocations,andtheexpertisetointerpretthesedatasetsislikewisealreadydistributedandbecomingmoreso.TherapidincreaseincomputernetworkswithintheUnitedStatesandintherestoftheworldisbeginningtosignificantlyaffectthewaypeopleaccessinformation.Thereisalesseningneedfordatausersandproviderstophysicallypossessthedatatheyneedordistribute,andusersareincreasinglyunawareofthesourcelocation(s)ofthedatatheyareaccessing.NARAthereforeshouldcontinuetostudyarrangementsregardingthephysicalcustodyofelectronicrecords,therelationshipbetweenNARAandotheragencies,andhowthesewillandshouldbeaffectedbytheexpansionofelectronicnetworks.

Duringthecourseofthisstudy,thecommitteefoundthatwiththeexceptionofsomestaffmembersatgovernmentdatacenters,manygovernmentscientistsandmostnongovernmentscientistsarenotawareoftherequirementsoftheRecordsDisposalAct(44U.S.C.3301etseq.).EvensomeofthoseentrustedwithlargequantitiesofvaluabledatawerelargelyunawareofNARAanditsrelatedresponsibilitiesuntilcontactedbythecommittee,orbyitspanels.Thismaybepartiallybecausescientists,eventhosewithinthefederalgovernment,sometimesdonotrespondtothebureaucraticrequirementsoftheirowninstitutions.ThecommitteeisencouragedthatNARAisworkingtoaddressthisproblem.Nevertheless,manypanelvisitorsandmembersobservedthattheNARAbrochureshaveanauthoritarianandlegalistictoneandarenotconducivetoestablishingproductivepartnershipswithNARA.NARA'sfutureeffectivenessinoverseeingandadvisingonthearchivingofscientific

Page 177: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

andtechnicaldatarequiresthatitimproveitsrelationswithotheragenciesandinstitutions.

Asacorollary,noneofthecommittee'ssuggestionsshouldbeconstruedtoimplythatNARAshouldissueadditionalproclamationsorregulations.Thegoalshouldbetopresentmorecarrotsthansticks.Forexample,NARAshouldconsiderprovidingrewardsandrecognitiontoresearchers,managers,andfundersfordevelopingandimplementingsuccessfuldataretentionplans,withappropriatemetadata.Withbettercommunicationsandgreatersensitivitytotheneedsofthescientificcommunity,NARAcanplaytheroleofa''serviceprovider"and"appraisalconsultant."Forinstance,NARAisalreadyworkingwiththeDODLegacyResourceManagementProgramtoidentifyandpreserveculturalresourcesunderDODjurisdiction.NARAandthisDODprogramtogetherhavesponsoredaconferencetoassistmilitarycontractorsinpreservingtheirdocumentaryheritage.ThecommitteesuggeststhatNARApursueothersuchcollaborationsinthesamespiritofpartnership.

Asamatterofformalresponsibilityandtraining,NARAstaffaremoreconcernedwithlong-termarchivingissuesthanmoststaffatotheragencies.NARAthereforecanserveanessentialroleinremindingagenciesofthelong-termvalueofdataandshouldregularlyprovideadvicetoagenciesthatkeepscientificdataonhandforextendedperiodsoftime.NARAalsoshouldconductcontinuousresearchonretentionandappraisalissuestoremainwell-informed.ThecommitteerecommendsthatNARAformstandingadvisorycommitteeswithmanagersofscientificdata,historians,andscientificresearcherstoaddresstheretentionandappraisalofscientificandtechnicaldatacollections,andrelatedissues.

Unfortunately,NARAhasalmostnoscientificexpertisewithinitsranks(exceptrelatedtophysicalrecordspreservation).Despitethelargeamountsofscientificinformationwithinsomefederalrecords,

Page 178: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

NARAofficialshaveindicatedthattheydonotbelievethattheycouldkeepascientistonthestaffinterestedintheworkanddonotplantohireanypermanentscientificpersonnel.Nevertheless,NARAwillcontinuetobefacedwithdifficultissuesinvolvingthearchivingofscientificdata.Intheinterim,thecommitteesuggeststhatNARAshouldarrangefortemporarystaffassignmentsfromtheactivescientificranksofthefederalgovernmentonafrequentas-neededbasis.GiventhegreatchallengesthatNARAwillfacefromscientificdataandtheprovenabilityofotheragenciestoholdscientificallytrained

Page 179: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page59

personnelindatamanagementpositions,NARAshouldrethinkitspositionandconsidercreatingacadreofpermanentstaffwithscientificexpertise.

NARAalsomightconsidersettingupanin-housedatabasetotrackfederalholdings,especiallytoanticipateproblemswithdatasetshousedinotheragenciesthatmayeventuallyneedNARAprotectionorotherhelpfromNARA.Todothiseffectivelywouldrequireestablishingasetofcontactsinotheragencieswithpeoplewhounderstandthedatabasesintheagencycollections.

Thisbringsustotheneedforamoregenerallocatorfunction,or"directoryofdirectories,"fortheNSIRFederation'snetworkofnetworks.Archivesmustnotbeviewedormanagedasdatacemeteries,withonlyrareanddwindlingvisitsafterthedepositionofdata.Theprovisionofbroadaccesstodatamustbepartofarchivedesignandconstruction,andthussomesortofbroadlocatorismuchneeded.Thecommitteeisencouragedbytherecentinteragencyefforts,organizedbytheOfficeofManagementandBudget,todevelopaGovernmentInformationLocatorService.Nevertheless,thereisaneedforaNARA-maintaineddirectoryofarchiveddatawithinitsownsystem.ThisshouldincludearchivedrecordsmaintainedbyothergovernmentagenciesandfederallyfundedinstitutionsthatarerecognizedaspartofadistributedarchivesystemoverseenbroadlybyNARA.ThecommitteerecommendsthatNARAcollaboratewithotheragenciesthatmaintainlong-termcustodyofdatatodevelopaneffectiveaccessmechanismtothesedistributedarchives.Theinitialstepshouldfocusonlocatorsystemsandevolvetowardatransparentaccesssystem.

Finally,withregardtoitsrequirementsforaccessionofdata,NARAshouldworkwiththescientificcommunityandpotentialsourcesofscientificdatatodevelopadaptableperformancecriteriafordata

Page 180: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

formatsandmedia,ratherthanmandatingnarrowandinflexibleproductstandards.ThegoalwouldbetomeetNARA'sbasicneedtoensurelong-termusabilitywhilealsoenablingaccessionofdata,suchasimagesandstructures,thatcannotbeaccommodatedbyNARA'scurrentrestrictivefile-formatandmediastandards.

RecommendationsSpecificallyForNOAA

AsthelargestholderofearthsciencesdataintheUnitedStates,NOAAhasavastamountofscientificdatastoredatmanyfacilitiesacrossthecountry.TheprimarystoragesitesaretheNationalDataCenters,whichincludetheNationalClimaticDataCenter(NCDC),theNationalOceanographicDataCenter(NODC),andtheNationalGeophysicalDataCenter(NGDC).Eachofthesedatacentersnowhasitsownon-lineinformationservice.Thedatacentersareaccessiblethroughcommonnodes,forexamplethroughNOAA'swebserverorNASA'sMasterDirectoryserver.ThusauserwhounderstandsthestructureofNOAA'sdataholdingscannavigatethroughthedifferentdatacenters,lookfordataofinterestineachcenter'sholdings,andretrievethedataovertheInternet.However,itisnotpossibletosearchNOAA'sdataholdingswiththesameprecisionandaccuracywithwhichonecansearchforbibliographicdata,through,forexample,theCurrentContentsorINSPECdatabases.ThediversityandvolumeofdatathattheNationalDataCentersholdandregularlyreceivemakeitdifficulttoproduceanoveralldirectoryforallofNOAA'sdataholdings.Inparticular,NCDCreceivesdailyalloftheweatherinformationfortheUnitedStates.WithoutsuchageneraldirectoryitisdifficultforuserstoqueryacrossNOAAarchivestolocateandintegratediversedata.Moreover,oncetheuserfindsdata,thevarietyofstorageformatsanddatatypesmakesaccesscumbersome.Thus,thecommitteeencouragesNOAAtobeambitious.DevelopmentofanewcomprehensivedirectorycoveringallNOAA'sholdingsofgeosciencedatawouldsetthestandardforotheragenciesandwouldmakethedatamuchmoreaccessibletothe

Page 181: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

public.

Thisdirectorymayincorporatecapabilitiesofthemanydifferenton-linedirectoryservicescurrentlyinuseattheNationalDataCenters,buttheemphasisshouldbeonconnectivity,dataaccess,andinformation.Forthisreason,NOAAshouldconcentratefirstonthemorerecentdigitaldatathatcanmosteasilybeincorporatedintosuchadirectorysystem.Effortstogetolderanalogdatadigitizedshould

Page 182: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page60

continue,althoughsomedatamayhavetoremainintheiroriginalformat.Animportantfacetofthisdirectoryistolist,alongwiththedirectoryentry,howtolocateandaccessthedata.Oncetheyhavelocatedthedataofinterest,mostuserswantmainlytoretrievethedatainaformthattheycanuseforfurtheranalysis.

Thus,thedirectoryshouldspecifytheactuallocationofthedata,aswellasthemethodsbywhichthedatacanbeacquired.UnderthepresentNOAAsystem,acquisitioninvolvesaformalorderingprocedureandthetransferoffunds,atleastforanydatathatmustbetransferredviatapeorhardcopy.ExperimentalNOAAsystems(NOAA'sSatelliteActiveArchive)makeitpossibletoorderlimitedsatelliteimageryoverthenetworkatnocost.Forthoseordersrequiringthetransferoffunds,thedirectoryserviceshouldbeabletoestimatethecostofthedataordersothattheusercanfactorcostintothedecisiontoorder.

ThisinterconnectedNOAAdirectoryservicealsowouldassisttheNOAAdatacentersintheirmanagementofdata.ByhavingaccesstotoolsandtechniquesdevelopedatotherNOAAdatacentersandelsewhereinthedatastoragecommunity,theNOAAdatacenterswouldbebetterabletostayabreastofnewdevelopmentsandtoincorporatethemintotheirdataaccesssystems.SimilaritiesamongvariousearthsciencedataandtheemergingneedforinterdisciplinaryresearchmakeitnecessarytoimplementsuchanoveralldirectoryformanagingNOAAdata,forbothdatalocationandaccess.Asnotedearlier,NOAAalreadyhasstartedtodevelopdatadirectories,on-linedatasystems,anddataaccess.

NOAAandNASAhavemadeprogressindatarescueandinderivingbetterproductsfromolddata.Since1990,NCDChascopiedthousandsoftapesofsatellitedatathatwereattheendoftheirusefulshelflife.TheNOAA/NASAPathfinderprogramwasestablishedto

Page 183: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

makethesatellitedatamoregenerallyavailabletoresearchersandtocalculatenewproducts;ithasbeenaneffectiveprogram.Althoughthecommitteesupportsactivitiestopreserveolddata,rescueddata(includingdatamovedtobettermediaandanalogdatathathavebeendigitized)areoflittlevalueiftheycannotbeaccessedorretrieved.Thecommitteeadvocatesmoreemphasisonimprovingaccesstodataforinterestedusers.

Mostfederalagenciesarenowawarethatstorageandretrievalofdataareimportant.Problemsarisebecauseeachagency,andsometimesevendifferentpartsofthesameagency,setsupdatacentersandfacilities,andeachoftheseestablishesitsowntypeofsystem.Inaddition,becausethetechnologyforstoringdatachangesfrequently,itisdifficultifnotimpossibletodecidejustwhathardwareandsoftwaresystemshouldbeused.Thisuniquenessofsystemsoftenhinderssystemportabilityandtheexchangeofdataamongsystems.

Therearesomeapproachesandproceduresthataredesignedtobetechnology-independentandthereforecanbeusedtoavoidsomeoftheseproblems.Moreover,thetechnologicalandportabilityrequirementsforarchiving,storage,andtransmissionaredifferent,soa"universal"formatwillnotwork.Anarchivalformatmustbeutterlyportableandself-describing,ontheassumptionthat,apartfromthetranscriptiondevice,neitherthesoftwarenorthehardwarethatwrotethedatawillbeavailablewhenthedataareread.Astorageformatshouldbeoptimizedforretrievinganyaddressablesubsetofadataset.Asecondary,butimportant,considerationistheeasewithwhichthestorageformatmaybecastintoatransmissionformat.Atransmissionformatshouldbeoptimizedforeaseofconversiontootherformats,accommodationofbothdataandmetadatainasingledatastream,portability,andextensibility(i.e.,accommodatingdataandmetadatatypesandstructuresnotyetinvented).BecausebothNOAAandNARAhavealong-termarchivalproblem,thecommitteesuggeststhattheyworktogethertolocateandtesthardwareandsoftwareunits

Page 184: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

thatcanbeusedforthistechnology-independentapproach.Bylocatingthemostsimplecommontechnologies,itshouldbepossibletosetupsystemsthataresufficientlycapable,butyetareabletointeractwitheachother.Onceafewofthese"standards"aresetupandoperating,itislikelythatotheruserswillwanttorunthissuiteofsoftware.Ideally,thistypeofprojectwouldbebestcarriedoutundertheauspicesoftheNSIRFederation.

Page 185: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page61

Consideringtheforegoingdiscussion,thecommitteemakesthefollowingrecommendations:

NOAAshouldplaceahigherpriorityondocumentingandestablishingdirectoriesofitsdataholdings.

Furthermore,NOAA,withtheactivecooperationofNARA,shouldleadeffortstobetterdefinetechnology-independentstandardsforarchiving,storing,andtransmittingthedatawithinitspurview.

Finally,NOAA,aswellaseveryotherfederalscienceagency,shouldensurethatallitsdataaresharedandreadilyavailable;itfulfillsitsresponsibilityforqualitycontrol,metadatastructures,documentation,andcreationofdataproducts;itparticipatesinelectronicnetworksthatenableaccess,sharing,andtransferofdata;anditexpresslyincorporatesthelong-termviewinplanningandcarryingoutitsdatamanagementresponsibilities.

Thecreationofthecommittee'sproposedNSIRFederationwouldhelpprovideacollaborativemechanismandmoresustainedpeerpressuretomeettheseobjectives,andthusenhancethevalueofscientificandtechnicaldataandinformationresourcestothenation.

Page 186: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page62

ReferencesAmericanChemicalSociety(ACS).1993.ReportingExperimentalData,H.J.White(ed.),Washington,D.C.

Boorstin,D.J.1992.TheCreators,RandomHouse,NewYork.

CommitteeonEnvironmentandNaturalResources(CENR).1994.OurChangingPlanet:TheFY1995U.S.GlobalChangeResearchProgram,NationalScienceandTechnologyCouncil,Washington,D.C.

CommitteeonEnvironmentandNaturalResources(CENR).Inpress.TheU.S.GlobalChangeDataandInformationSystemDraftImplementationPlan,NationalScienceandTechnologyCouncil,Washington,D.C.

FederalGeographicDataCommittee(FGDC).1994.October1994FactSheet,FederalGeographicDataCommittee,Washington,D.C.

Gelsinger,P.P.,P.A.Gargini,G.H.Parker,andA.Y.C.Yu.1989.Microprocessorscirca2000,IEEESpectrum,October:43-47.

GeneralAccountingOffice(GAO).1990a.EnvironmentalData--MajorEffortIsNeededtoImproveNOAA'sDataManagementandArchiving,Washington,D.C.

GeneralAccountingOffice(GAO).1990b.SpaceOperations--NASAIsNotArchivingAllPotentiallyValuableData,Washington,D.C.

Haas,J.K.,H.W.Samuels,andB.T.Simmons.1985.AppraisingtheRecordsofModernScienceandTechnology:AGuide,MassachusettsInstituteofTechnology,Cambridge,Mass.

Handy,C.1992.BalancingCorporatePower:ANewFederalistPaper,

Page 187: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

HarvardBusinessReview70(6):59-72.

InformationInfrastructureTaskForce(IITF).1993.TheNationalInformationInfrastructure:AgendaforAction,Washington,D.C.

Jacobs,W.1947.Wartimedevelopmentsinappliedclimatology,MeteorologicalMonographs1(1),52pp.

Marshack,A.1985.HierarchicalEvolutionoftheHumanCapacity:ThePaleolithicEvidence,AmericanMuseumofNaturalHistory,NewYork.

NationalAcademyofPublicAdministration(NAPA).1991.TheArchivesoftheFuture:ArchivalStrategiesfortheTreatmentofElectronicDatabases,AreportfortheNationalArchivesandRecordsAdministration,Washington,D.C.

NationalAeronauticsandSpaceAdministration.1992.DraftGuidelinesforDevelopmentofaProjectDataManagementPlan(PDMP),NASAOfficeofSpaceScienceandApplications,Washington,D.C.

NationalResearchCouncil(NRC).1982.DataManagementandComputation--VolumeI:IssuesandRecommendations,SpaceScienceBoard,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1984.Solar-TerrestrialDataAccess,Distribution,andArchiving,SpaceScienceBoardandBoardonAtmosphericSciencesandClimate,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1986a.AtmosphericClimateData:ProblemsandPromises,BoardonAtmosphericSciencesandClimate,NationalAcademyPress,Washington,D.C.

Page 188: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources
Page 189: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page63

NationalResearchCouncil(NRC).1986b.IssuesandRecommendationsAssociatedwithDistributedComputationandDataManagementSystemsfortheSpaceSciences,SpaceScienceBoard,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1988a.GeophysicalData:PolicyIssues,CommitteeonGeophysicalData,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1988b.SelectedIssuesinSpaceScienceDataManagementandComputation,SpaceScienceBoard,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1990.SpatialDataNeeds:TheFutureoftheNationalMappingProgram,BoardonEarthSciencesandResources,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1992a.SettingPrioritiesforSpaceResearch:OpportunitiesandImperatives,SpaceStudiesBoard,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1992b.TowardaCoordinatedSpatialDataInfrastructureforthenation,BoardonEarthSciencesandResources,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1993.1992ReviewoftheWorldDataCenter-AforRocketsandSatellites,NationalSpaceScienceDataCenter,BoardonEarthSciencesandResources,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1994.RealizingtheInformationFuture--TheInternetandBeyond,NRENAISSANCECommittee,ComputerScienceandTelecommunicationsBoard,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).1995.StudyontheLong-term

Page 190: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

RetentionofSelectedScientificandTechnicalRecordsoftheFederalGovernment:WorkingPapers,CommissiononPhysicalSciences,Mathematics,andApplications,NationalAcademyPress,Washington,D.C.

NationalResearchCouncil(NRC).Inpress.FindingtheForestintheTrees:TheChallengeofCombiningDiverseEnvironmentalData,U.S.NationalCommitteeforCODATA,NationalAcademyPress,Washington,D.C.

OfficeofManagementandBudget(OMB).1990.CoordinationofSurveying,Mapping,andRelatedDataActivities,CircularNo.A-16,Washington,D.C.

OfficeofManagementandBudget(OMB).1994.ManagementofFederalInformationResources,CircularNo.A-130(59F.R.37906,July25,1994),Washington,D.C.

OfficeofTechnologyAssessment(OTA).1994.RemotelySensedData:Technology,Management,andMarkets,OTA-ISS-604,GovernmentPrintingOffice,Washington,D.C.

Silberschatz,A.,M.Stonebreaker,andJ.Ullman.1991.Databasesystems:Achievementsandopportunities,CommunicationsoftheACM34(10):110-120.

Page 191: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page64

AppendixAListofAcronymsCD-ROM CompactDisk-ReadOnlyMemory

CENR CommitteeonEnvironmentandNaturalResourcesDMC DataManagementCenterDOD DepartmentofDefenseDOE DepartmentofEnergyEROS EarthResourcesObservingSystemESDM EarthScienceDataManagementFGDC FederalGeographicDataCommitteeFITS FlexibleImageTransportSystemGARP GlobalAtmosphericResearchProgramGCDIS GlobalChangeDataandInformationSystemGCRP GlobalChangeResearchProgramGILS GovernmentInformationLocatorServiceHTML HyperTextMarkupLanguageIRIS IncorporatedResearchInstitutionsforSeismologyIWGDMGCInteragencyWorkingGrouponDataManagementfor

GlobalChangeJANAF JointArmy-Navy-AirForceJCL JointControlLanguageNARA NationalArchivesandRecordsAdministrationNCDC NationalClimaticDataCenterNGDC NationalGeophysicalDataCenterNII NationalInformationInfrastructureNOAA NationalOceanicandAtmosphericAdministrationNODC NationalOceanographicDataCenterNRC NationalResearchCouncilNSDI NationalSpatialDataInfrastructureNSF NationalScienceFoundation

Page 192: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

NSF NationalScienceFoundationNSIR NationalScientificInformationResourceNSSDC NationalSpaceScienceDataCenterOMB OfficeofManagementandBudget

Page 193: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page65

PDS PlanetaryDataSystem

PO.DAAC PhysicalOceanographyDistributedActiveArchiveCenterSGML StandardGeneralizedMarkupLanguageTCP-IP TransmissionControlProtocol-InternetProtocolUSGS UnitedStatesGeologicalSurveyUSNRC UnitedStatesNuclearRegulatoryCommissionWWSSN World-WideStandardizedSeismographicNetwork

Page 194: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

Page66

AppendixBMinorityOpinionThisreporthasawealthofgoodmaterialinit,butIfeelthatImustwriteaminorityopinionononemainissue,thecommittee'srecommendationtocreatetheNSIRFederation.IthinkthattheexactfunctionsoftheNSIRFederationarestillnotclearenoughtoimmediatelyformit,especiallysincemechanismstocoordinatedataactivitiesalreadyexist.

AgroupsuchastheNSIRFederationwouldnotbeagoodmethodtosetthehardwarestandardsthatareusedindatasystems(networks,tapes,etc.).Thecoordinatedpartofdatadirectoryeffortscanbebuiltaroundpresentinteragencywork.ItisreasonablethatNARAshouldrequestlistsofdatasetsintendedforlong-termarchival,butmostoftheprocessofevaluatingdatasetsneedstobekeptclosetotheworkinglevel.Thediscussionofstandardizationinthereportshouldnotbeinterpretedtomeanthatallagenciesandarchivesshouldbeforcedtoadoptcertainstandardsandreworktheirdataholdingsintoacommonformandformat.Thereareotherconcernsforwhichananalysisoftheissuescouldbeuseful,butIbelievethattheNSIRFederationrequiresabetterdescriptionoftasksandmoredebatebeforesuchanewbodyisestablished.Otherwisewemayhavemorecoordination,moresystems,morecost,andlessdata.

Considertheimportanttaskofdevelopinginformationaboutdata.Informationaboutdatasetsisneededinatleasttwoorthreelevelsofdetail.Atthehighestlevelofinformation,theMasterDirectorymethodsthatareinplacefortheGCDIScanbeadopted(orevensimplifiedmore)todescribethedatasets.ThisinteragencyDirectoryInterchangeFormat(DIF)isusednationallyandinternationally.We

Page 195: Preserving scientific data on our physical universe: a new strategy for archiving the nation's scientific information resources

needtokeepitsimpleenoughsothatpeoplewillsubmittheinformation.Someagency-levelcatalogeffortsfordatasetshaveexistedsinceabout1968,andbecamemoreseriousinthelate1970s.WeshouldbuildontheGCDIScatalogefforts,andcertainlynotinventmorecomplicatedsystems.Otherdatainformationeffortsareneeded,buttheywillbebasedonabottom-upflowofideas,onworkshops,andthelike.Eachdatasystemdoesnothavetodoexactlythesamething,buttheymustbeeasytouse.ItisnotclearthataformalNSIRFederationisneededtocoordinatethis.

HowdoestheNSIRFederationrelatetootherdatacoordinatingmechanisms?TheInteragencyWorkingGrouponDataManagementforGlobalChange(IWGDMGC)meetsregularlytohelpcoordinatedataissuesacrossmany"globalchange"disciplines,whichincludeair,water,ice,rocks,soils,andsomebiology.ItseemstomethattheIWGDMGCandtheproposedNSIRFederationaremainlytryingtodothesamething.Theycovermuchofthesameturfintermsofdisciplines.Theybothwantinformationaboutdata,accesstodata,anddatathatwillexistformorethan20years.Ifwecreateseparateorganizationsdoingroughlythesamething,thenitbecomesevenlesslikelythatkeyagency


Recommended