+ All Categories
Home > Documents > HathiTrust is a Solution

HathiTrust is a Solution

Date post: 10-Feb-2022
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
61
HathiTrust is a Solution The Foundations of a Disaster Recovery Plan for the Shared Digital Repository This report serves as recommendations made by Michael J. Shallcross, 2009 Digital Preservation Intern University of Michigan School of Information
Transcript

HathiTrustisaSolution

TheFoundationsofaDisasterRecoveryPlanfortheSharedDigitalRepository

ThisreportservesasrecommendationsmadebyMichaelJ.Shallcross,2009DigitalPreservationInternUniversityofMichiganSchoolofInformation

ii

ExecutiveSummary ThisreportseekstoestablishtheframeworkofaDisasterRecoveryPlanfortheHathiTrustDigitalLibrary.WhileprofessionalbestpracticesandinstitutionalneedshaveprovidedaclearmandateforHathiTrust’sDisasterRecoveryProgram,commonparlancehasoftenobscuredtwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailarangeofissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherestorationofhardwareanddata.Second,thereisnoconclusiontotheplanningprocess;itisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.

Theprimarygoalofthepresentdocumentistoprovideafoundationonwhichfutureplanningeffortsmaybuild.Tothatend,itexaminesthestrategiesbywhichHathiTrusthasanticipatedandmitigatedtherisksposedbytencommonscenarioswhichcouldprecipitateadisaster:

o Hardwarefailureanddatalosso Networkconfigurationerrorso Externalattackso Formatobsolescenceo Coreutilityorbuildingfailureo Softwarefailureo Operatorerroro Physicalsecuritybreacho Mediadegradationo Manmadeaswellasnaturaldisasters.

Asthislistreveals,adisasterwithinthedigitalrepositoryrefersnotmerelytodataloss,thedestructionofequipment,ordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Foreachscenario,thereportdiscussespossiblethreats,summarizesthepotentialseverityofrelatedevents,andthendetailssolutionsHathiTrusthasenactedthroughdirectquotationsfromtheHathiTrustWebsiteandTRACself‐assessment,ServiceLevelAgreements,andliteraturefromserviceprovidersandvendors.AttachedappendicesproviderelevantinformationandincludecontactsforimportantHathiTrustresources,anannotatedguidetoDisasterRecoveryPlanningreferences,andanoverviewofkeystepsintheDisasterRecoveryPlanningprocess.

TheconcludingsectionofthereportprovidesrecommendationsandactionitemsforHathiTrust

asitproceedswithitsDisasterRecoveryInitiative.ThesearedividedintoShort(0‐6mos.),Intermediate(6‐12mos.)andLong‐Term(12+mos.)objectivesandarearrangedinasuggestedorderofaccomplishment.

o Short‐termgoalsinclude: DescribingthenatureandextentofHathiTrust’sinsurancecoverage Testingandvalidationofcurrenttapebackupprocedures Improvedphysicalandintellectualcontroloversystemhardware Establishment,distribution,andmaintenanceofphonetrees Increaseddocumentationofinstitutionalknowledge IdentificationofDisasterRecoverymeasuresinplaceattheIndianapolissite.

o Intermediate‐termobjectivesfocuson: CreationofaDisasterRecoveryPlanningCommittee

iii

Initiationofthedatacollectionandanalysisessentialtothecreationofrecoverystrategies(ThissectionprovidesahighlevelbreakdownofvarioustasksandincludesthecoordinationofactivitiesbetweentheAnnArborandIndianapolissitesaswellaswithserviceprovidersandvendors.)

o Long‐termactionitemsdealwith: CompletionandimplementationofthesuiteofDisasterRecoverydocuments Initiationofstafftrainingandtestsoforganizationalcompliance. Storageofanadditionalcopyofbackuptapesataremotethirdlocation InvestigationofanalternatehotsiteinAnnArborintheeventadisaster

renderstheMACCunusable Considerationofathirdinstanceoftherepository Avoidanceofvendorlock‐inifakeysuppliershouldgooutofbusiness.

Thisreportdemonstratesthatvariousriskmanagementstrategies,designelements,operating

procedures,andsupportcontractshaveendowedHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofadisaster.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackupstoaremotelocation,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Unfortunately,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.

iv

Acknowledgements TheauthorwouldliketothankShannonZacharyforherencouragementandguidance;CorySnavelyandJeremyYorkfortheirgenerousexpenditureoftime,energy,andknowledge;andNancyMcGovernandLanceStuchellforaccesstotheiroutstandingDisasterRecoveryPlanningresources.Thefollowingindividualshavealsobeeninvaluablesourcesofadvice,support,andinformation:JohnWilkin,BobCampe,CyndiMesa,AnnThomas,JohnWeise,LarryWentzel,LaraUnger‐Syrigos,BillHall,EmilyCampbell,SebastienKorner,JessicaFeeman,PhilFarber,ChrisPowell,CameronHanover,StephenHipkiss,TimPrettyman,ReneGobeyn,andKrystalHall.ThanksalsotoDr.ElizabethYakel,MagiaKrause,andVeronicaandCoraFambrough.TheworkinthisreportwasmadepossiblebyanIMLSGrant.

v

TableofContents• ExecutiveSummary p.ii• Acknowledgements p.iv• Introduction p.1

o GoalsforHathiTrust’sDisasterRecoveryProgram p.1o TheMandateforDisasterRecoveryPlanninginDigitalPreservation p.2o DisasterPreparednessintheDesignandOperationofHathiTrust p.2o EssentialHathiTrustBusinessFunctions p.3

• HathiTrust’sDisasterRecoveryStrategies p.5o BasicRequirementsforDisasterRecovery p.5o DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSitesp.5o DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups p.6

• Scenario1:HardwareFailureorObsolescenceandDataLoss p.8o Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss p.8o HathiTrust’sSolutionsforHardwareFailureandDataLoss p.8o RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure p.9o KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage p.10o HardwareSupportandService p.12o EquipmentTracking p.13o HardwareReplacementSchedule p.13o TimelineforEmergencyReplacementofHathiTrustInfrastructure p.13o HathiTrustandInsuranceCoverageattheUniversityofMichigan p.14

• Scenario2:NetworkConfigurationErrors p.15o Review:RisksInvolvingNetworkConfigurationErrors p.15o HathiTrust’sSolutionsforNetworkConfigurationErrors p.15o ExtentofITComSupport p.15o ITComResponsibilities p.16o ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork p.16o HathiTrustResponsibilities p.16

• Scenario3:NetworkSecurityandExternalAttacks p.17o Review:RisksInvolvingNetworkSecurityandExternalAttacks p.17o HathiTrust’sSolutionsforNetworkSecurity p.17

• Scenario4:FormatObsolescence p.18o Review:RisksInvolvingFormatObsolescence p.18o HathiTrust’sSolutionsforFormatObsolescence p.18o SelectionofFileFormats p.18o FormatMigrationPoliciesandActivities p.19

• Scenario5:CoreUtilityand/orBuildingFailure p.20o Review:RisksInvolvingCoreUtilityorBuildingFailure p.20o HathiTrust’sSolutionsforUtilityorBuildingFailure p.20o GeneralMaintenanceandRepairsinUniversityofMichiganFacilities p.20o TheMichiganAcademicComputingCenter(MACC) p.20o ArborLakesDataFacility(ALDF) p.22

vi

• Scenario6:SoftwareFailureorObsolescence p.23o Review:RisksInvolvingSoftwareFailureorObsolescence p.23o HathiTrust’sSolutionsforSoftwareIssues p.23

• Scenario7:OperatorError p.24o Review:RisksInvolvingOperatorError p.24o HathiTrust’sSolutionsforOperatorError p.24o Ingest p.24o ArchivalStorage p.24o Dissemination p.24o DataManagement p.24

• Scenario8:PhysicalSecurityBreach p.25o Review:RisksInvolvingaPhysicalSecurityBreach p.25o HathiTrust’sSolutionsforPhysicalSecurity p.25o SecurityattheMACC p.25o SecurityattheALDF p.26

• Scenario9:NaturalorManmadeDisaster p.27o Review:RisksInvolvingaNaturalorManmadeDisaster p.27o HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents p.27o BasicDisasterRecoveryStrategies p.28

• Scenario10:MediaFailureorObsolescence p.29o Review:RisksInvolvingMediaFailureorObsolescence p.29o HathiTrust’sSolutionsforMediaFailure p.29o RemainingVulnerabilities p.29

• ConclusionsandActionItems p.30o Conclusions p.30o Short‐TermActionItems p.30o Intermediate‐TermActionItems p.31o Long‐TermActionItems p.32

• APPENDIXA:ContactInformationforImportantHathiTrustResources p.34• APPENDIXB:HathiTrustOutagesfromMarch2008throughApril2009 p.37• APPENDIXC:WashtenawCountyHazardRankingList p.38• APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences p.39• APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess p.45• APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) p.52• APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardService

Agreement(2006) p.53• APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) p.54• APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) p.55

**AppendicesF–IareembeddedPDFfiles.**

2009‐08‐24 1

Introduction

Intherealmofprintlibraries,adisasterisafairlyunambiguousevent:itisafire,abrokenpipe,aninfestationofpests—inshort,anythingwhichthreatensthecontinueduseandexistenceoftextsortheenvironmentinwhichtheyarestored.Thisbasicdefinitionmayalsobeappliedtothedigitallibrary,inwhichadisasterrefersnotmerelytothelossofcontentorcorruptionofdata,thedestructionofequipmentordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Thislastpartprovestobethegreatestdifferencebetweentheprintanddigitalworldsbecausethereareagreatmanythreatswhichcanleavedataintactbutincapacitatetheprimaryfunctionsofadigitallibrary.ThedailyoperationofaninstitutionsuchasHathiTrustinvolvestheanticipationandresolutionofavarietyofproblems—crashedservers,softwarebugs,networkingerrors,etc.—whichonlyrisetothelevelofa‘disaster’whentheyexceedthecapacityofnormaloperatingproceduresand/orthemaximumallowableoutageperiods.DisasterRecoveryPlanningthuspromptsustodeveloprobuststrategiestomitigateandlimittheeffectsofcommonproblemsandatthesametimeforcesustothinktheunthinkable.Nevertheless,confrontingworst‐casescenariosisavitalactivity;thebeliefthataneventwillneverhappensimplybecauseithasneverhappenedisaninvitationtotheverydisasterweseektoavoid.Hereinliesaconundrum,inthatthecreationofdetailedplansforeveryeventualityisnearlyimpossibleandalsoimpractical,sincetheresultsofsuchanendeavorwouldbeneedlesslycomplexaswellasexpensive.Atitsbasis,then,DisasterRecoveryPlanningdemandsanastuteassessmentofrisksothatwemayweighthecostsofpreparationsandsolutionsagainstthecostsofapotentialevent.

Sowheretobegin?WhenthesubjectofDisasterRecoveryPlanningarises,commonparlanceoftenobscurestwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailavarietyofrelatedissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherecoveryofhardwareanddataandtherestorationofcorefunctions.Second,thereisnoconclusiontotheplanningprocessorapointatwhichaplanis‘done’;thereisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.Theessentialfirststepisthereforeathoroughknowledgeoftheorganization,itsgoals,anditsmandateforaDisasterRecoveryProgramsothatlatereffortscanfocusonthearticulationofpoliciesandthedevelopmentofsolutions.Asapreliminarystepinthiseffort,thisreportlookstoestablishabasicfoundationfromwhichfutureplanningeffortsmaygrow.

• GoalsforHathiTrust’sDisasterRecoveryProgram WhileamoreformalstatementofHathiTrust’sgoalsandrequirementsforitsDisasterRecoveryProgrammustbeelucidated,therepository’smissionstatementprovidesagoodindicationofitsmainobjectiveintheformationofaDisasterRecoveryPlan.Aspartofitsaimto“contributetothecommongoodbycollecting,organizing,preserving,communicating,andsharingtherecordofhumanknowledge,”HathiTrustseeks“tohelppreservetheseimportanthumanrecordsbycreatingreliableandaccessibleelectronicrepresentations.”1Thisstatementclearlyjoinsthetwinimperativesofpreservationandaccesswithanadditionalrequirement:reliability.ThedevelopmentandimplementationofaDisasterRecoveryPlanwillensurethatdigitalobjectswillretaintheirauthenticityandintegrityoverthelongtermandthatpartnerlibrariesanddesignatedusersmayrelyonHathiTrustservices(ortheirtimelyresumption)andcontentinthefaceofcatastrophicevents.

1HathiTrust.“Mission&Goals”(2009)retrievedfromhttp://www.hathitrust.org/mission_goalson8July2009.

2009‐08‐24 2

• TheMandateforDisasterRecoveryPlanninginDigitalPreservation HathiTrust’smandateforacomprehensiveandproactiveDisasterRecoveryPlanstemsfromanumberofsignificantsources,amongwhichwemayincludeitsmissionandgoals.The“InstitutionalDataResourceManagementPolicy”(2008)oftheUniversityofMichigan’sStandardPracticeGuidealsoprovidesanimpetusforthecreationofaDisasterRecoveryProgram.WhilenotnecessarilyinclusiveoftheMichiganDigitizationProjectmaterialsstoredinHathiTrust,thisdocumentunderscoreshowimportantitisthatdataresources“besafeguarded[and]protected”and“contingencyplans[…]bedevelopedandimplemented.”2Initsdiscussionofthelatterpoint,thepolicyspecifiesthat:

DisasterRecovery/BusinessContinuityplansandothermethodsofrespondingtoanemergencyorotheroccurrencesofdamagetosystemscontaininginstitutionaldata[…]willbedeveloped,implemented,andmaintained.Thesecontingencyplansshallinclude,butarenotlimitedto,databackup,DisasterRecovery,andemergencymodeoperationsprocedures.Theseplanswillalsoaddresstestingofandrevisiontodisasterrecovery/businesscontinuityproceduresandacriticalityanalysis.3

WhiledatabackupproceduresandahostofriskmanagementpracticesarealreadyanintegralpartofHathiTrust’soperation,therepositorynowlookstoformalizetheotherstrategiessuggestedbythe“InstitutionalDataManagementPolicy.”Beyondtheexamplelaidoutbythisdocument,HathiTrust’smandateforDisasterRecoveryderivesfromtheprofessionalliteraturedetailingbestpracticesinthefieldofdigitalpreservation.TheReferenceModelforanOpenArchivalReferenceSystemidentifiesDisasterRecoveryasanessentialcomponentofits“ArchivalStorage”functionandhighlightstheimportanceofsuchplansinachievingthegoaloflong‐termpreservationofadigitalarchive’sholding.AsoutlinedintheOAISdocument,“theDisasterRecoveryfunctionprovidesamechanismforduplicatingthedigitalcontentsofthearchivecollectionandstoringtheduplicateinaphysicallyseparatefacility.”4HathiTrusthassuccessfullymetthisrequirementbyperformingnightlytapebackupsandestablishingamirrorsiteatIndianaUniversityinIndianapolis.TheTrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)isevenmoreexplicitinitsrequirementthatrepositoriesdocumenttheirpoliciesandprocedureswith“suitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).”5ProfessionalbestpracticesaswellasinternalneedsandgoalsthusprovidethemandatewhichunderliesHathiTrust’sdevelopmentofaformalDisasterRecoveryPlan.

• DisasterPreparednessintheDesignandOperationofHathiTrust OneoftheprimarygoalsofHathiTrustistoprovide“transparencyinallofitsoperations,includingitsworktocomplywithdigitalpreservationstandardsandreviewprocesses.”6Nowhereisthiscommitmentmoreclearthaninitseffortstoanticipateandmitigateriskswhichcouldthreatenthe

2UniversityofMichigan.“InstitutionalDataResourceManagementPolicy”(2008)StandardPracticeGuide,retrievedfromhttp://spg.umich.edu/on8July2009.3Ibid.4ConsultativeCommitteeforSpaceDataSystems.ReferenceModelforanOpenArchivalInformationSystem(2002)p.4‐8.5OCLCandCRL.“SectionC3.4”TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.6HathiTrust.“Accountability”(2009)retrievedfromhttp://www.hathitrust.org/accountabilityon25June2009.

2009‐08‐24 3

contentsandfunctionsoftheSharedDigitalRepository.AsafirststepinaddressingthedisasterpreparednessrequirementinsectionC3.4oftheTRACCriteriaandChecklist,7thisdocumentservestwopurposes.First,itprovidesanoverviewofthepolicies,procedures,resourcesandcontractsthatenableHathiTrusttoaddressthechallengesandthreatsendemictothefieldofdigitalpreservation.MaterialisthereforeciteddirectlyfromtheHathiTrustWebsite(http://www.hathitrust.org),themostrecentversionofHathiTrust’sreviewofitscompliancewiththeminimumrequiredelementsoftheTRACCriteriaandChecklist,8andrelevantliteratureprovidedbykeyvendorsandserviceproviders.9Second,thisreportexaminesHathiTrust’scurrentlevelofdisasterpreparednessanddefinescurrentandforthcomingeffortsinitsdevelopmentofadynamicandproactiveDisasterRecoveryProgram.PertherecommendationsoftheTRACCriteriaandChecklist,thisdocumentrecordsthemeasuresandprecautionsalreadyinplaceinregardsto“specifictypesofdisasters”thatcouldbefallHathiTrust.Theseeventsincludehardwarefailure,dataloss,networkconfigurationerrors,externalattacks,coreutilityfailure,formatobsolescence,softwarefailure,physicalsecuritybreach,andmanmadeaswellasnaturaldisasters.Whileaformal,writtenplandetailingindividualrolesandresponsibilitiesintherepository’sresponsetoeachofthesescenariosisstillforthcoming,theevidencegatheredinthisreportrevealsthatcrucialelementsofaDisasterRecoveryPlanarealreadyinplacewithinHathiTrust.10

• EssentialHathiTrustBusinessFunctionsAsthedevelopmentoftheDisasterRecoveryPlanproceeds,itisimportanttobearinmindthat

itsgoalisnotmerelytherestorationofhardwareanddatabutalsotherecoveryandcontinuityofessentialrepositoryfunctions.ThefollowinglistrepresentscorefunctionsthatneedtobeaddressedbyHathiTrust’sDisasterRecoveryPlanandassuchshouldnotbeconsideredacomprehensiverepresentationoftherepository’sfunctions.Bydirectingplanningeffortstowardspecificfunctions(ratherthantheorganization’sactivitiesasawhole),HathiTrustmayprioritizeandfocusitsrecoveryresponsesandresourcestoensurethatthemostessentialfunctionsgobackonlinefirst.SubsequentdiscussionofDisasterRecoverystrategiesandriskmanagementsolutionsinthisreportarepresentedundertheassumptionthatthecontinuityofthesefunctionsisaprimaryobjective.Theprioritizationofthesefunctionsremainstobedeterminedbyanappropriateauthority.11

7“Repositoryhassuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).Therepositorymusthaveawrittenplanwithsomeapprovalprocessforwhathappensinspecifictypesofdisaster(fire,flood,systemcompromise,etc.)andforwhohasresponsibilityforactions.Thelevelofdetailinadisasterplanandthespecificrisksaddressedneedtobeappropriatetotherepository’slocationandserviceexpectations.Fireisanalmostuniversalconcern,butearthquakesmaynotrequirespecificplanningatalllocations.Thedisasterplanmust,however,dealwithunspecifiedsituationsthatwouldhavespecificconsequences,suchaslackofaccesstoabuilding.”OCLCandCRL.TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.8HathitrustDigitalLibraryReviewofCompliancewithTrustworthyRepositoriesAudit&Certification:CriteriaandChecklistMinimumRequiredElements,revisedMay20,2009.Availableathttp://hathitrust.org/documents/trac.pdf9ContactinformationforrelevantUniversityofMichigandepartmentsandserviceprovidersaswellasforexternalvendorsmaybefoundinAppendixA.10AlistofresourcesrelatedtodisasterrecoveryandtheplanningprocessmaybefoundinAppendixD(AnnotatedListofDisasterRecoveryPlanningResources).11ThislistofessentialHathiTrustbusinessfunctionswasdevelopedinconjunctionwithJeremyYork.

2009‐08‐24 4

o Ingest Ingestdigitalobjects(SIPs)viaGRIN—theGoogleReturnInterface(ora

modifiedingestportalforlocalcontent) ValidateingestedcontentwithGROOVE—theGoogleReturnObject‐Oriented

ValidationEnvironment(oramodifiedversionforlocalizedingest)o ArchivalStorage

Preserveindefinitelydigitalobjectsandmetadata(AIPs)intheSharedDigitalRepository(includesensuringtheintegrityandauthenticityofmaterials).Thisfunctionaddressestheneedsofpartnerlibrariesaswellasindividualusers.

Recordchangestoandactionsonitemswhiletheyareintherepository Maintainapersistentobjectaddressforitemswithinrepository

o Dissemination Provideaccesstodigitalobjectsforusers Allowforthetextsearchesthroughavarietyoffields Enablelargescalefull‐textsearches Permitthecreationofpublicandprivatecontentcollections Disseminatedigitalobjects(DIPs)tousers(viathepage‐turneraccesssystem

anddataAPI) DistributedatasetsandHathiTrustAPIstodevelopers ResearchanddevelopadditionalapplicationsandresourcesforHathiTrust

o Administration Providetransparentandup‐to‐dateinformationtousersandthegeneralpublic

viahttp://www.hathitrust.org/ Communicateinformationandcoordinateactivitiesamongstpartnerlibraries

andHathiTrustboardsandcommittees.o DataManagement

UpdateandmanagetheRightsandGeoIPdatabases BuildandmaintainCollectionBuilderandLargeScaleSearchSolrindexes Determineappropriateuseraccesstotextsviadatabasequeries SynccontentwiththeIndianapolissiteandbackupcontenttotape

2009‐08‐24 5

HathiTrust’sDisasterRecoveryStrategies

• BasicRequirementsforDisasterRecovery RoyTennanthasidentifiedthreerequisitecomponentsofadigitalDisasterRecoveryPlan:(1)theuseofaneffectivedataprotectionsystem(i.e.RAID),(2)redundantpowerandenvironmentalsystems,and(3)regularbackupofinformationtotapeand,ideally,toaremotemirroredsite.12HathiTrusthasincorporatedalltheseelementsintoitsdesignandoperation.ItsIsilonIQstorageclusterprovidesahighdegreeofdataredundancywithitsN+3parityprotection;theMichiganAcademicComputingCenterprovidesfullyredundantpowerandenvironmentalsystemsforHathiTrustinfrastructure;andnightlytapebackupsandthereplicationofdatatoafullyoperationalmirrorsitelocatedatIndianaUniversityinIndianapoliswiththesamelevelsofpowerandenvironmentalconditioningprovidemultiplecopiesaswellasgeographicdistributionofcontent.

o “HathiTrustisintendedtoprovidepersistentandhighavailabilitystoragefordepositedfiles.Inordertofacilitatethis,theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragewillbelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparateAnnArborfacility).Eachofthesestorageortapeinstancesisphysicallysecure(e.g.,inalockedcageinamachineroom)andonlyaccessibletospecifiedpersonnel.Eachseparatestoragesystemisalsoequippedwithmechanismstoprovidemirroredmanagementandaccessfunctionality,andemploy100%dataredundancyinanefforttopreventdataloss.”13

DetailsonparityprotectionandtheHathiTrustserverenvironmentareavailablebelow(seeScenario1andScenario5,respectively).

• DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSites HathiTrust'sfirstlineofdefenseintheeventofadisasterisitshotmirrorsiteinIndianapolis.WhileingestofmaterialisrestrictedtotheAnnArborlocation,bothsitespossesstwowebservers,aMYSQLdatabaseserver,andanIsilonIQstoragecluster(currentlycomposedof21‘nodes,’serverscomposedofCentralProcessingUnitsaswellasstorage).Duringnormaloperations,thisarrangementallowsHathiTrusttobalanceahighvolumeofwebtrafficacrossbothsitessuchthatindividualuserrequestsmaybehandledbyeithersiteinatransparentmanner.Shouldthetolerancesforfailurebeexceededatasite(asinadisastersituation)thefailovercapabilitybuitintotheHathiTrustarchitectureenablestheremainingsitetoprovideaccesstothedesignatedcommunitywithoutnoticeableservicedisruptions.AsnotedintheMay2009HathiTrustUpdate,withthefulloperationofbothlocations,“Wearenowensuringthatusersdonotfeeltheeffectsofsingle‐siteoutages,suchasroutinemaintenance,

12Tennant,Roy.“DigitalLibraries:CopingwithDisasters.”LibraryJournal,15November2009.Retrievedfromhttp://www.libraryjournal.com/article/CA180529.htmlon13July2009.13HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.

2009‐08‐24 6

bytakingadvantageofsiteredundancy.”14However,becauseingesttakesplaceonlyinAnnArbor,thelossofkeycomponentstherewouldinhibittherepository’sabilitytoacquirenewcontent.

HathiTrustutilizesIsilonSystem’sSyncIQApplicationSoftwaretosynchronizedataattheIndianapolissitewithnewlyingestedorupdatedmaterialfromtheAnnArborsite.ThesynctoIndianapolisrunson24separatesubsetsofthedataandeachonerunsevery2hours,withtheexceptionofSundays.Inotherwords,subset1runsatmidnightonMonday,subset2runsat2a.m.,andsoon.ThemaximumtimefordatatobereplicatedfromAnnArbortoIndianapoliswouldthereforebethreedaysplustheruntimeofthesyncprocess(whichtendstotakelessthanthreehours.)15

o “SyncIQisanasynchronousreplicationapplicationthatfullyleveragestheuniquearchitectureofIsilonIQstoragetoefficientlycopydatafromaprimaryclustertoonelocatedatasecondarylocation.”16

o “Allnodes[…inboththesourceandtargetIsilonIQclusters]concurrentlysendandreceivedataduringreplicationjobsinrealtime,withoutimpactingusersreadingandwritingtothesystem.”17

o “Arobustwizard‐drivenweb‐basedinterfaceisfullyintegratedinto[…Isilon’sproprietary]OneFSmanagementtooltocontrolallthefunctionality,includingscheduling,policysettings,monitoringandloggingofdatatransferredandbandwidthutilization.”18

o “Onlyfilesthathavechangedwillbereplicatedtothetargetclusters.Thiswilloptimizetransfertimesandminimizebandwidthused.”19

o “Intheeventthesecondarysystemisnotavailableduetoasystemornetworkinterruption,thereplicationjobwillbeabletorollbackandrestartatthelastsuccessfulcopyoperation.”20

o “Uponacriticalfailureorlossofnetworkconnection,analertwillbesenttoallrecipientsconfiguredtoreceivecriticalalerts.”21

• DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups

HathiTrust’sabilitytorecoverfromadisasterisalsoensuredbythenightlyautomatedtapebackupsperformedbytheTivoliStorageManager(TSM)clientapplicationinstalledontheingestserversconnectedtotheHathiTruststorageclusterandmanagedbyMichigan’sITCSTSMGroup.TheTSMBackupServiceStandardServiceLevelAgreement22outlinestheobligationsandresponsibilitiesofboththeserviceproviderandHathiTrust:

14HathiTrust.“UpdateonMay2009Activities”(2009)retrievedfromhttp://www.hathitrust.org/updates_may2009on2July2009.15Snavely,Cory(Head,UMLibraryITCoreServices).Personalemailon13July2009.16“BackupandRecoveryWithIsilonIQClusteredStorage,”2007p.1117Ibid.18Ibid.19Ibid.20Ibid.21Ibid22PleaserefertoAppendixF(TSMBackupServiceStandardServiceLevelAgreement).

2009‐08‐24 7

o “TheprogressiveincrementalmethodologyusedbyTivoliStorageManageronlybacksupneworchangedversionsoffiles,therebygreatlyreducingdataredundancy,networkbandwidthandstoragepoolconsumptionascomparedtotraditionalmethodologiesbasedonperiodicfullbackups.”23

o “ITCSisresponsibleforallofthecentralserverhardware,tapehardware,networkinghardware,andrelatedcomponents.ITCSisalsoresponsibleforhardwaremaintenanceaswellassoftwaremaintenance,administration,andsecurityauditsonthecentral(non‐client)TSMservers.”(TSMBackupServiceSLA,sec.4.1)

o “ITCSprovides7x24on‐callmonitoringandsupport,andstrivestokeeptheserversupinproductionatalltimes.Thetargetup‐timeis99.9%ofthetime.TheTSMhardwaredesignismodularandshouldallowustotakepiecesoutofservicewithoutaffectingcustomers.Wheneverpossible,systemmaintenancewillbeperformedduringstandardweekendmaintenancewindowsasdefinedbyITCS.”(sec.4.2)

o “Inanemergency,[email protected](thiswillgototheon‐callstaff’spagerinrealtime).(sec.4.6)

o “ITCSisresponsibleforphysicalsecurity.Machineaccessaudits,OSsecurity,andnetworksecurityontheTSMserverendarealsotheresponsibilityofITCS.”(sec.4.9)

o “Theservice[…]includesdatacompression,dataencryptions,anddatareplication.”(sec.1.0)

o “ITCSwillmaintainatleasttwoTSMsitesandwillmirrordatabetweenthesitestoprovideredundancyintheeventofadisaster.CurrentlythosesitesaretheArborLakesDataFacility(ALDF)at4251PlymouthRd.andtheMichiganAcademicComputingCenter(MACC)locatedat1000OakbrookDr.”(sec.4.10)

o “Bothfacilitiesaresecure,climatecontrolledsitesdesignedandbuiltforhighavailableproductionservices.”24

o “Intheeventofacustomerdisasterwithlarge‐scale(afullserverormore)dataloss,ITCSwillworkwiththecustomertooptimizetherestoretimetobestofourability.Wewillonlybeabletodevoteresourcestotheextentthatothercustomersarenotaffected.Restoringlargefileservers(multipleTerabytes)cantakeseveraldays.Ifcustomerswanttominimizethisamountoftimetorestore,wecanpurchaseadditionalresourcesforthispurpose.Contactusdirectly,andwe’llworkoutascenariowithcostinginformation.IntheeventofaMAJORcampusoutageaffectingalargenumberofcustomers,ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)

o “DisasterRecoveryplanningistheresponsibilityofthecustomerunit.”(sec.5.8)HavingestablishedthemainDisasterRecoverystrategiesemployedbyHathiTrust,wemaynowproceedtoinvestigatethemeansbywhichitanticipatesandmitigatesthemostcommonthreatsfacingdigitalrepositories.

23IBM.“IBMTivoliStorageManager:FeaturesandBenefits”(2009)retrievedfromhttp://www‐01.ibm.com/software/tivoli/products/storage‐mgr/features.html?S_CMP=rnavon16June2009.24InformationTechnologyCentralServicesattheUniversityofMichigan.“FrequentlyAskedQuestionsabouttheTSMBackupService”(2009)retrievedfromhttp://www.itcs.umich.edu/tsm/questions.phpon16June2009.

2009‐08‐24 8

Scenario1:HardwareFailureorObsolescenceandDataLoss

• Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss ThefollowingtablehighlightsthevariouseventswhichposearisktothehardwareanddataofHathiTrust.Thesethreatsmaystemfromflawsormalfunctionsintheequipmentitselforasaresultofexternaleventsthatincludephysicalsecuritybreachesandnaturalormanmadedisasters.Thearrangementofthesepotentialrisksreflectstherelativeseverityoftheirrespectiveconsequences.

• HathiTrust’sSolutionsforHardwareFailureandDataLoss

ThethreatsfacedbyHathiTrust’shardware(andassociatedapplicationsaswellasthedatastoredtherein)arecomprisedofthefailureofredundantfeatures,failurethatexceedscomponents’toleranceforredundancy,andsinglepointsoffailure.Whilethefailureofredundantcomponentsmayhappenmorefrequently(i.e.,thelossofanindividualdrivewithintheIsilonIQcluster),suchlossesdonothavealargeimpactontherepository;eventswhichcompromisesinglepointsoffailurewillhavemuchgreaterconsequencesforthecontinuityofHathiTrustoperations.Atthesametime,whileacomponentmayhaveredundancyononelevel(forexample,therearefiveserversdedicatedtoingest),thatcomponentsimultaneouslymaybeconsideredatahigherleveltobeasinglepointoffailure(i.e.,becausetheingestserversarehousedinasinglechassis,theentireunitisvulnerabletoaneventsuchasafire).Thisdualityhighlightstheneedforvigilanceandforesightinmanagingtherepository’sinfrastructure. BecauseHathiTrustreliesheavilyuponhardwaretofulfillitsmissionanddeliverservicestoitsdesignatedcommunityofusers,theselectionofequipmentanddevelopmentofsystemarchitecture

Severity EventHighimpact Lossatasinglepointoffailure

• Anadditionalfailurepasttoleranceswhenonlyonesiteisoperational• Serviceisunavailableandcannotberestoreduntilcomponentisrepaired/restored

ModerateImpact Failureofacomponentpastredundancytolerance• Systemnolongerhasredundancy:additionallossorfailureofcomponentswill

resultinlossofsystem.Thisisaparticularproblemifonesiteisalreadydown.• Lossofdbserver(homeofRightsdb)orofbothWebserversatasitewillrender

thatlocationinaccessible• LossoffourdrivesornodesineitherIsilonstorageclusterwillresultinthelossof

thatinstance.Theclusterwillbeofflineandunabletohandlereadorwriterequests;alltrafficwouldhavetobehandledbytheremainingsite.

• LossofUMArborLakessitewouldpreventperformanceoftapebackups.• LossofUMMACCsitewoulddepriveIUsiteofdataredundancy• Lossofingestserverswouldpreventnewcontentfromenteringrepository

LowImpact Failureofredundantsystemcomponents• Includesredundantcomponentswithineachsiteaswellasgeneralredundancy

betweentheIUandUMsiteso HTinfrastructurehasbeendesignedtoavoidsinglepointsoffailureandto

ensuredataandequipmentredundancyo Servicecontinuesinanuninterruptedandtransparentmanner

2009‐08‐24 9

hasaimedatminimizingthedangersposedbysinglepointsoffailurethroughtheintroductionofstrategicredundancies.ThebasicmeansforavoidingthedisastrouseffectsofhardwarefailureordatalosshavebeentheestablishmentoftheIndianapolismirrorsiteandthenightlybackupofcontenttotape.(Formoredetail,pleaserefertotheprecedingsection).Whilethesestrategiesaccountforextraordinaryevents,HathiTrust’sserverreplacementscheduleallowstherepositorytoanticipatetheresultsofnormalequipmentuseanddepreciation.Stepstosafeguardthelong‐termfunctionalityofHathiTrusthavethereforebeencomplementedbyaconsiderationofbestpracticesfordisasterpreparedness.

• RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructureThefollowingsectionsprovideageneraloutlineofHathiTrust’sredundantcomponentsand

singlepointsoffailure.Giventhecomplexityoftherepository’sinfrastructure,unknownorunanticipatedscenariosmayexist;futureDisasterRecoveryPlanningwillthusinvolveaperiodicreviewofkeyfeaturesandvulnerabilities.

o SiteRedundancy:TheestablishmentofthemirrorsiteinIndianaprovidesHathiTrustwithafullyredundantoperation.Becausebothinstancesprovidefullaccesstocontentinadditiontootherrepositoryfunctions,userswillnotexperiencealossordegradationofserviceintheeventthatserviceislostfromonesite.KeyexceptionstoHathiTrust’ssiteredundancyarenotedbelow.

o RedundantComponentsatEachSite:ThefollowingcomponentsprovideeachsitewithatoleranceunderwhichlimitedfailureswillnotdisruptmajorHathiTrustfunctionsanduserservices.

Webservers:eachsitehastwoserverssothatifonefails,theothermaycontinuetohandletraffic.ThesealsohosttheGeoIPdatabase.

IsilonIQclusters:thecurrentconfigurationof21nodesfeaturesN+3parityprotection;thisdataredundancypermitsthesimultaneousfailureof3drivesonseparatenodesorthelossofthreeentirenodeswithoutservicedegradation.

Ingestservers:theAnnArborsitepossessesfiveserverssothatingestmaycontinue(albeitataslowerrate)intheeventofanyfailures.

LargeScaleSearch(LSS)Solrindex:currentlyhousedonthewebservers,butwillsoonbemaintainedonfivenewserversinAnnArbor.

o SinglePointsofFailure:25Thesearecomponentsofasystemwhich,iflost,willpreventtheentiresystemfromfunctioning.Eventhosecomponentswithwhollyredundantpeerdevices(suchastheweboringestservers)maybeconsideredsinglepointsoffailureiftheyhaveexceededtheircapacitytosustainlosses(i.e.,ifonewebserveratasitehasalreadybeenlost).

SinglePointsofFailureattheComponentLevel:BecauseonlyoneofthesecomponentsexistsateachHathiTrustsite,alosswillresultinsystemfailure.

• MYSQLdatabaseserver:housestherightsdatabase,ingesttrackingdatabase,andtheCollectionBuilderSolrindex

• Servernetworkswitches• Outboundnetworkswitches

SinglePointsofFailureattheSystemLevel:Whileanygivencomponentmayhavevariousdegreesofinternalredundancy(suchasmultiplepowersuppliesor

25ContentinthissectioniscourtesyofCorySnavely(personalemailfrom13July2009).

2009‐08‐24 10

multipledrives)itmightstillfailasawholeandthusresultinthelossofaparticularinstanceofHathiTrust.Thefollowingarecomponentslocatedateachsitewhich,whilepossessedofinternalredundancies,arestillsubjecttocompleteloss(asintheeventofafire)andmaythusrenderasiteinoperable.

• IsilonIQstoragecluster:theentireclustercouldbelostinalarge‐scaleevent.Additionally,thelossofafourthdriveornodewillexceedthecluster’sfailuretoleranceandresultinaservicedisruption.

• Webservers:shouldonefail,theremainingserverwillbeasinglepointoffailure.

• Bladeserverchassis:sinceweb,ingest,anddatabaseserversarehousedinonechassis,theentireunitcouldpotentiallyfail.

• LSSindex:inthenearfuture,theserversinAnnArborwillbethesoleinstanceoftheLargeScaleSearchindex.

• MirlyndatabaseandMirlyn2Solrindex26:thesearecurrentlykeycomponentsoftheUMLibraryinfrastructure;shouldthesebeunavailable,accesstoanduseofHathiTrustwillbecompromised.

• KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage

TheIsilonIQstorageclusterstoresandprovidesdigitalobjectsforHathiTrust’spartnerlibrariesandmembersofitsdesignatedcommunity.Theclusterprovidesahighdegreeofinherentredundancy,whichgivesbothHathiTrustsitesaconsiderabledegreeoftoleranceinregardstothefailureofvariousaspectsofthestorageunits.Asoneexample,Isilon’sproprietaryOneFSoperatingsystempermitstheindividualstoragenodes—theindividualserversthatarethebuildingblocksofthecluster—tofunctionas‘coherentpeers’sothatanyonenode‘knows’everythingcontainedontheotherunitsinthecluster.

o “Isilon'sOneFSoperatingsystem[…]intelligentlystripesdataacrossallnodesinaclustertocreateasingle,sharedpoolofstorage.”27

o “Becauseallfilesarestripedacrossmultiplenodeswithinacluster,nosinglenodestores100%ofafile;ifanodefails,allothernodesintheclustercandeliver100%ofthefileswithinthatcluster.”28

o “Adistributedclusteredarchitecturebydefinitionishighlyavailablesinceeachnodeisacoherentpeertotheother.Ifanynodeorcomponentfails,thedataisstillaccessiblethroughanyothernode,andthereisnosinglepointoffailureasthefilesystemstateismaintainedacrosstheentirecluster.”29

26MirlynisthenameoftheUniversityofMichigan’scurrentOnlinePublicAccessCatalog,whichissupportedbytheAlephintegratedlibrarysystem.Mirlyn2isabetaversionofUM’srecentlyimplementednextgenerationcatalog,basedontheVuFindplatform,whichwillbecomethemainlibrarycatalogonAugust3,2009.27IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon17June2009.28IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.7.“Incomputerdatastorage,datastripingisthetechniqueofsegmentinglogicallysequentialdata,suchasasinglefile,sothatsegmentscanbeassignedtomultiplephysicaldevices.[…]ifonedrivefailsandthesystemcrashes,thedatacanberestoredbyusingtheotherdrivesinthearray.”(http://en.wikipedia.org/wiki/Data_striping,retrievedon16August,2009).29IsilonSystems.“BreakingtheBottleneck:SolvingtheStorageChallengesofNextGenerationDataCenters”(2008)p.8

2009‐08‐24 11

HathiTrust’sIsilonIQclustersensureahighdegreeofdataredundancywiththeirN+3parityprotection.N+3providestriplesimultaneousfailureprotectionsothatuptothreedrivesonseparateIsilonIQnodes,orthreeentirenodes,canfailatthesametimeandalldatawillstillbefullyavailable.

o “TraditionalRAID‐5parityprotectionresultsindatalossifmultiplecomponentsfailpriortothecompletionofarebuild.FlexProtect,incontrast,automaticallydistributesalldataanderrorcorrectioninformationacrosstheentireIsilonclusterandwithitsrobusterrorcorrectiontechniquesefficientlyandreliablyensuresthatalldataremainsintactandfullyaccessibleevenintheunlikelyeventofsimultaneouscomponentfailures.”30

o “Eachfileisstripedacrossmultiplenodeswithinacluster,with[three]paritystripesforeachdatablock.”31

ThefilesystemmayalsoperformaDynamicSectorRepair(DSR)atthetimeofanyfilewriting.Ifitencountersabaddisksector,thefilesystemwilluseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblockelsewhereelseonthedrive.Thebadsectorwillberemappedbythedrivesothatitisneverusedagainandthewriteoperationwillbecompleted. TheIsilon“restriper”isameta‐process/infrastructurethathasfourprimaryphasestohelpmanageandprotectdataintheeventthatcomponentsoftheclustersustainapartialfailureormalfunction.Theprocessesrunasbackgroundoperationsanddonotrequiresystemdowntime.3233

o FlexProtectrepairsdata(i.e.,intheeventofadriveloss)usingparity. “IsilonOneFSwithFlexProtectcanboasttheindustryleadingMeanTimeto

DataLoss(MTTDL)forpetabyteclusters.”34 “FlexProtectintroducesstate‐of‐the‐artfunctionality,whichrebuildsfaileddisks

inafractionofthetime,harnessesfreestoragespaceacrosstheentireclustertofurtherinsureagainstdataloss,andproactivelymonitorsandpreemptivelymigratesdataoffofat‐riskcomponents.”35

o AutoBalance“rebalancesthedatainaclusteraccordingtobusinessrules,inrealtime,non‐disruptively.”36

“Assoonasthe[neworrepaired]nodeisturnedonandnetworkcablesareconnected,AutoBalanceimmediatelybeginstomigratecontentfromtheexistingstoragenodestothenewlyaddednodeacrosstheclusterinterconnectback‐endswitch,re‐balancingallofthecontentacrossallnodesintheclusterandmaximizingutilization.”37

30IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon30June2009.31IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.732IsilonX‐SeriesSpecifications(productbrochure)33InformationontheIsilonrestripercomesfromapersonalemailsentbyKipCranfordofIsilonSystems,Inc.on1June2009.34IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.435IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon15June2009.36McFarland,Anne.“IsilonAcceleratesDeliveryofDigitalContent”TheClipperGroupNavigator(2003).37IsilonSystems.“TheClusteredStorageRevolution”(2008)p.13

2009‐08‐24 12

o Collectcleansuporphanednodesanddatablockstopreventfragmentationofdata.o MediaScanverifiesdisksectors.

ThefunctionofMediaScanistoscaneveryblockinthefilesystemlookingforbaddisksectors.Ifitencountersabadsector,itwillperformaDynamicSectorRepair(DSR)anduseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblocksomewhereelseonthedrive.

MediaScanperiodicallyreviewsdatablocksanddisksectorsthatmaynothavebeenaccessed,fromafilelevel,inmonthsoryearsandtherebyhelpstokeepthedrivesashealthyaspossible.

o AsoftheOneFS5.0release,allfilesystemmetadatacanbecheckedbytheIntegrityScanrestriperphase.ThisprocesswillallowHathiTrusttocompletelycheckfiledataandmetadataviaassociatedchecksums.

Otherinstancesofinherentredundancyincludenon‐volatileRAM,afullyjournaledfilesystem,andsoftwareapplicationsthatmanageclientconnectionsintheeventofanode’sfailure.

o “OneFSisafully‐journaledfilesystemwithlargeamountsofbattery‐backednon‐volatilerandomaccessmemory(NVRAM)withineachnode,whichensurestheintegrityofthefilesystemintheeventofunexpectedfailuresduringanywriteoperation.”38

o “TheIsilonSmartConnectmodule[…ensures]thatwhenanodefailureoccurs,allin‐flightreadsandwritesarehandedofftoanothernodeintheclustertofinishitsoperationwithoutanyuserorapplicationinterruption.[…]Ifanodeisbroughtdownforanyreason,includingafailure,thevirtualIPaddressesontheclientswillseamlesslyfailoveracrossallothernodesinthecluster.Whentheofflinenodeisbroughtbackonline,SmartConnectautomaticallyfailsbackandrebalancestheNFSclientsacrosstheentireclustertoensuremaximumstorageandperformanceutilization.”39

• HardwareSupportandService HathiTrustequipmentiscoveredbysupportandserviceagreementswithitsvariousvendors(SunMicrosystems,Dell,CDW‐G,etc.).Agoodexampleofonesuchagreementisfoundinthe“Platinum”supportprovidedbyIsilonSystemsandwhichincludes:

o Extended24x7x365Telephone&OnlineHardwareandSoftwareSupporto 24x7ProactiveMonitoring&Alerts–EmailHome(forHardwareandSoftware)o ReturnPartstoFactoryforRepairand4‐hourReplacementPartsDeliveryo SupportIQ(EnhancedServiceabilityDiagnostics)andSystemEventTrackingo On‐siteTroubleshootingo IsilonHardwareInstallationo SoftwareProductDocumentation,ReleaseNotes,andaccesstoProductTechnicalNoteso RemoteDiagnosis(ProvidedUserGrantsAccess)o Maintenance&PatchReleases

38IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.939IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.6

2009‐08‐24 13

o MinorandMajorUpgradeReleases(IncludesPerformanceImprovements,NewFeatures,ServiceabilityImprovements).40

• EquipmentTrackingLITCoreServices(CS)maintainsaninventoryofserversonawikipageaccessibletoitsstaff.

Detailsincludeeachserver’sname,location,onlineandretiredates,upgrades,notesonstorage,anditsprimaryservice.Additionalinformationisprovidedrelatedtospecifications,supportcontracts,andkeycontactinformation.TheCSserverinventoryiscurrentlyoutofdate.

• HardwareReplacementSchedule

o “HathiTrustreplacesstorageregularly,approximatelyevery3‐4yearsorastheusablelifeofstorageequipmentdictates”(HTTRACC1.7)

o “HathiTruststaffupgradehardwareonaregularbasis(i.e.,everythreeorfouryears),andtohelpdetectmorerapidgrowthindemands,thewebserverandstorageinfrastructureshavetheirownperformancemonitoringthatindicateoverloadconditions.”(HTTRACC1.10)

• TimelineforEmergencyReplacementofHathiTrustInfrastructureShouldaseriouseventrequirethereplacementofpart(orall)oftheHathiTrusttechnical

infrastructure,thefollowingtimelineprovidesageneralestimateofthetimerequiredtoorder,ship,andinstallnewequipment.AcursoryreviewofthetimenecessaryforHathiTrusttorecoverfromamajordisasteratthemainAnnArbororIndianapolisdatacentersuggeststhatalargeeventcouldidleaninstanceoftherepositoryforatleastamonthandahalf.Inadditiontotheserversandswitchesmentionedabove,criticalcomponentsincludefour30Apowerdistributionunits(PDUs)perrackandfourracksperdatacenterasofthiswriting.

o SubmissionofPurchaseOrders: Forordersunder$5,000,theM‐PathwaysapplicationallowstheUniversity

Library’sbusinessmanagertosendpurchaseordersdirectlytovendors. Forordersover$5,000,ProcurementServicesnormallytakesonetotwo

businessdaystoapprovethepurchase,buttheprocessmaytakeuptoaweekifquestionsariseoradditionalpurchaseinformationisneeded.

o DeliveryofEquipment: Productsthevendorhasinstockandavailableforimmediateshipmenttake1‐3

daystobedelivered. Itemsthatneedtobeconfigured(suchasservers)usuallytake1‐2weeks. Isilonstoragewilltake3weekstobedeliveredinaworstcasescenario.

o Installation: 3daysFTEforIsilonIQclusterinadditiontothetimerequiredforotherservers,

switches,PDUsandrackunits.

40IsilonSystems.“SupportAdvantageOfferings”(2009)retrievedfromhttp://www.isilon.com/support/?page=planson30June2009.

2009‐08‐24 14

o DataRestoration:about.5TB/hour(15days,asofJune2009)41 WhileHThasabout110TBofdatainitsstorage,thebackuptapesmaintained

bytheTSMGroupcontainroughly176TBofinformationduetothedataencryptionusedtoprotecttheintellectualrightsofthematerial(asof06/2009).

Thelengthoftimerequiredfora‘bare‐metalrestoration’willbeinfluencedbytapemounts,networkspeed,restoringtotheNFSshares,decryption,etcetera.

Ifthelibrary/HTweretopurchaseanadditionaltapedrive(atroughly$20,000),theprocesscouldbespedup,perhapstoabout1TB/hour.

Intheeventofalarge‐scaledisasterinwhichmultiplecampusunitsrequireextensivedatarestoration,theTSMBackupServiceSLAstatesthat“ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)ThisdeterminationwillreflecttheUniversityofMichigan’sorganizationalpriorities42:

• Priority1:Healthandsafetyoffaculty,staff,students,hospitalpatients,contractors,renters,andanyotherpeopleonUniversitypremises.

• Priority2:Deliveryofhealthcareandhospitalpatientservices• Priority3:Continuationandmaintenanceofresearchspecimens,

animals,biomedicalspecimens,researcharchives.• Priority4:Deliveryofteaching/learningprocessesandservices• Priority5:SecurityandpreservationofUniversityfacilities/equipment.• Priority6:Maintenanceofcommunity/Universitypartnerships.

o Fractionalrestoreswould,forthemostpart,runatcomparablespeedsunlesstherewasaneedtorestorealargenumberofrandomfiles,inwhichcasetherewouldbeadecreaseinspeedduetotapeseekandmounttimes.

o DelaysinrecoverycouldbeincreaseddramaticallyiftheMACCdatacenteroritsinfrastructurehassustaineddamageandneedsrepair.

• HathiTrustandInsuranceCoverageattheUniversityofMichigan

TheOfficeofFinancialOperationsreviewsandaddsfinancialassetsgreaterthan$5,000totheassetmanagementsystemoftheUniversityofMichigan.ThePropertyControlOfficeisthenresponsiblefortaggingfinancialassetswithuniqueUniversityofMichiganidentifiersandtrackingthem.RiskManagementServicesadministerstheUniversity’spropertyinsuranceandwillprovidethereimbursementofreplacementcostsforitemsself‐insuredbyMichigan.AsofJuly2009,thenatureandextentoftheUniversityofMichigan’sinsurancecoverageforHathiTrusthardwareremainedunderreview.ThemaincontactwithRiskManagementServicesinthismatterhasbeenCyndiMesa,HeadofUMLibraryFinance.

41Hanover,Cameron(ITCSTSMGroupStorageEngineer).Personalemailon23June2009.42UniversityofMichiganAdministrativeInformationServices.“EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning”(2007)retrievedfromhttp://www.mais.umich.edu/projects/drbc_methodology.htmlon6July2009.

2009‐08‐24 15

Scenario2:NetworkConfigurationErrors

• Review:RisksInvolvingNetworkConfigurationErrorsThefollowingtablesummarizestherisksfacingHathiTrustastheresultofnetworkconfiguration

errors.ConsiderationisgiventonetworkconnectionswithinUMdatacentersaswellasatUM’sHatcherGraduateLibrary(siteofkeyadministrativeanddevelopmentactivities).Thearrangementoftheseeventsreflectstherelativeseverityoftheirrespectiveconsequences.

• HathiTrust’sSolutionsforNetworkConfigurationErrors

HathiTrust’scontinuedaccesstotheInternetviatheUMnetBackboneisessentialforitscontinuedprovisionofservice.TherepositoryreceivesnetworkinfrastructuremaintenancethroughUM’sITCS/ITCom;withitsrobustdisasterplanninginadditiontothelessonslearnedfromtheMidwestblackoutof2003,ITComguaranteescontinuednetworkaccessinallbutthemostcatastrophicscenarios.Intheeventofawidespreadpoweroutage,HathiTrustwouldbeabletomaintainaccesstotheUMnetBackbonesincedatacentersareequippedwithredundantpowersuppliesandtheHatcherGraduateLibraryiscurrentlycategorizedasapriorityrecipientofpowerfromtheuniversity.ITCSalsohas17generatorswhichcanbeusedtomaintainpowertonetworkswitchesintheeventofablackout.TheresponsibilitiesandobligationsofbothpartiesareoutlinedintheCustomerNetworkInfrastructureMaintenanceServiceAgreement.43

• ExtentofITComSupporto “ITComagreestoprovidetheUnitNetworkInfrastructureMaintenancetoincludedata

switches,routers,accesspoints,hubs,uninterruptiblepowersupplies(UPS’s),firewalls,andotheridentifiedandagreeduponcomponents.”(ITCSsec.1.0)

43PleaserefertoAppendixG(ITCS/ITComCustomerNetworkInfrastructureMaintenanceServiceAgreement).

Severity EventHighimpact • Lossofservernetworkswitchoroutboundnetworkswitch

• LossofaccesstoUMnetBackbone

ModerateImpact • ExtendedlossofpoweratHatcherLibrarycouldleadtolossoflocalserversanddisruptionofadministrativeandoperationalactivities.

LowImpact • LossofpowerthatthreatensabilitytoconnecttoLocalAreaNetwork(LAN)/Backbone

o Thelibraryremains(fornow)apriorityrecipientofelectricityfromtheUMpowerplant

o CampusdatacentershaveUPSsandredundantbackuppower• Failureoflocal/server‐sideconnections

o Shouldproblemsarisewithconnectionstoindividualnodes,theclusteredarchitectureoftheIsilonsystemwillallowread/writerequeststobehandledbyalternatenodes.

o IfconnectionsfailatoneHTsite,trafficcanbehandledbyremainingsite.

2009‐08‐24 16

• ITComResponsibilities

o “ProvideandmaintainthenecessarymaterialsandelectroniccomponentstooperatetheUnitNetworkInfrastructure.”(sec.5.2)

o “ProvideconfigurationandNetworkInfrastructureAdministrationsupportnecessarytorepairandmaintaintheUnitNetworkInfrastructurehardwareandsoftwarecoveredbythisagreement.”(sec.5.3)

o “Monitor24hours/dayand365days/year(24x365),supportedprotocolstothebackboneinterfaceoftheUnitsnetworkuptoandincludingtheextensiontothefirsthuborswitch.”(sec.5.6)

o “Monitor24hours/dayand365days/year(24x365),networkinterfacesonuninterruptiblepowersupplies(UPS)thatsupporttheUnitnetworkswitches.ProvidenotificationintheeventthataUPSisactivated,(inputpowerislostordegradedandsystemswitchestobatterypower),deactivated,(inputpowerisrestored),orunreachable.ProvidenotificationtotheUnitNetworkAdministratorwhenbatteriesdegradetothepointofneedingreplacement.”(sec.5.7)

o “ProvidemaintenanceonthestationcablingasinstalledbyITCom,oranapprovedU‐MvendorwhichmetITCominstallationspecifications.”(sec.5.8)

o “ProvidePreventativeMaintenance(clean&vacuum)oneachCustomerUnitswitchcoveredinthisagreementyearly.”(sec.5.9)

• ITComServicesinResponsetoOutagesorDegradationImpactingtheNetworko “Aresponsewithin30minutesoftheITComNOCnotificationortheUnit’scall,to

provideinformationtotheUnitonspecificstepsthathavebeen/willbetakentoresolvetheproblem.”(sec.7.2.1)

o “Anon‐sitevisit,ifnecessary,withintwo(2)hoursoftheresponse(i.e.,themaximumon‐siteresponsetimewillbetwoandahalf(21/2)hours).AnupdatewillbeprovidedtotheUnitNetworkAdministratorifonsiteandabestguessETRwillbeprovidedbasedonavailablefacts.ITComwillcontinuetoprovidetheUnitwithupdateseverytwohoursduringanoutage.”(sec.7.2.1)

o “IfanoutageisidentifiedwithintheagreementservicehoursITComwillresolvetheoutageeveniftherepairtimeextendsbeyondtheserviceagreementhours.”(sec.7.2.1)(Repairsoutsideoftheagreementhoursresultinadditionallaborexpenses.)

o ConductmonitoringviaSNMPPOLLINGatoneminuteintervals.(sec.7.2.1)

• HathiTrustResponsibilitiesITCom’sresponsibilitiesendatthefirstnetworkswitchandfromtheretoitsservers,HathiTrust

isresponsibleformaintainingnetworkconnectivityandsecurity.TherepositoryusesInternet2forcommunicationandsynchronizationbetweentheAnnArborandIndianapolissites.EachIsilonnodehasdual10GBInfinibandportsforinternal(i.e.,intra‐cluster)communicationanddual1GBEthernetforexternalcommunication.Scenario3:NetworkSecurityandExternalAttacks

2009‐08‐24 17

• Review:RisksInvolvingNetworkSecurityandExternalAttacks

ThefollowingtablegivesageneraloverviewofthebasicthreatanexternalattackornetworksecuritybreachposestoHathiTrust;entriesarearrangedbyseverity.Thelist,however,isnotexhaustiveandnoattempthasbeenmadetopublicizepotentialvulnerabilities.

• HathiTrust’sSolutionsforNetworkSecurity

MaliciousactivityagainstHathiTrustcouldinvolveunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestothesystem,software,ordata.Asanacademicentity,therepositoryisseenaslessofatargetforsuchactionsthancommercialorgovernmentaltargets;despitethisperceivedlowerrisk,HathiTrusthasnotbeenlulledintoafalsesenseofsecurity.TherepositorytakesseriouslythepotentialforviolationsofitsnetworkandoperatingsystemsecurityandthereforehasinstitutedaprogramofperiodicsoftwareupdatesinadditiontothemaintenanceofanITCom‐supportedfirewall,authentication‐requiredaccess,andothermeasures(suchasthrottlingsoftwaretodeterdenialofserviceattacks).Becausecontentiscurrentlyacceptedfromtrustedsources(namely,GoogleandlegacydigitalcollectionsfromHathiTrustpartners)theGROOVEprocessdoesnotincludeavirusdetectionphase.Asdigitalobjectsareingestedfromagreaternumberofsources,additionalsecuritymeasuresshouldbeconsidered.

o “HathiTruststaffapplysecurityupdatestotheoperatingsystemandtonetworkingdevicesassoonastheybecomeavailableinordertominimizesystemvulnerability.Aswithnewsoftwarereleases,securityupdatesaretestedinadevelopmentenvironmentbeforebeingreleasedtoproduction.Softwarepackagesthatpresentalowersecurityriskandthathaveagreaterpotentialtoaffectapplicationbehavior(webservers,languageinterpreters,etc.)aregenerallyinstalled,configuredandtestedmanuallytoallowforgreatercontrolinmanagingupdates.Softwareupdatesarenotappliedautomatically;moreover,updatesthatpresentapotentialforhavinganimpactonsystembehaviorareappliedandtestedfirstinthedevelopmentenvironment.Ifnoimpactsareseen,HathiTruststaffapplytheseupdatesinproductionafteratestingperiodofatleastoneweek.”(HTTRACC1.10)

Severity EventsHighimpact • UnauthorizedaccesstoHathiTrustcontentleadstotheinfringementofcopyrights.

• Lossofdataorfunctionalityforanextendedperiodoftimeasaresultofmaliciousactivity.

ModerateImpact • HathiTrustservicesaretemporarilyunavailableasaresultofmaliciousactivity.LowImpact • ThedeliveryofHathiTrustservicesslowsastheresultofmaliciousactivity.

• Asecurityweaknessexistswithinthesystembutremainsunexploited.

2009‐08‐24 18

Scenario4:FormatObsolescence

• Review:RisksInvolvingFormatObsolescenceThefollowingtableoutlinesthethreatsposedbyformatobsolescenceandarrangesthem

accordingtotheirpotentialseverity.

• HathiTrust’sSolutionsforFormatObsolescence

AnawarenessandacknowledgementofthedangersofformatobsolescencehasledHathiTrusttoimplementproactivepoliciesandprocedurestoensurelong‐termaccesstotherepository’scontent.Therepositoryonlyacceptsspecificformatsthatmeetrigorousspecificationsand,throughthepriorexperienceofUniversityofMichiganpersonnel,hasdevelopedprotocolsforthesuccessfulmigrationofcontentfromoneformattoanother.Inaddressingthethreatofformatobsolescence,thepreservationoftheintegrityandauthenticityofdepositedcontenthasbeenanoverarchingconcern.

• SelectionofFileFormatso “HathiTrustiscommittedtopreservingtheintellectualcontentandinmanycasesthe

exactappearanceandlayoutofmaterialsdigitizedfordeposit.HathiTruststoresandpreservesmetadatadetailingthesequenceoffilesforthedigitalobject.HathiTrusthasextensivespecificationsonfileformats,preservationmetadata,andqualitycontrolmethods,includedintheUniversityofMichigandigitizationspecifications,datedMay1,2007.”44(HTTRACB1.1)

o “HathiTrustcurrentlyingestsonlydocumentedacceptablepreservationformats,includingTIFFITUG4filesstoredat600dpi,JPEGorJPEG2000filesstoredatseveralresolutionsrangingfrom200dpito400dpi,andXMLfileswithanaccompanyingDTD(typicallyMETS).HathiTrustsupportstheseformatsbecauseoftheirbroadacceptanceaspreservationformatsandbecausetheformatsaredocumented,openandstandards‐based,givingHathiTrustaneffectivemeanstomigrateitscontentstosuccessivepreservationformatsovertime,asnecessary.TheRepositoryAdministratorshaveundertakensuchtransformationsinthepast;moreover,HathiTrustoffersend‐userservicesthatroutinelytransformdigitalobjectsstoredinHathiTrustto“presentation”formatsusingmanyofthewidelyavailablesoftwaretoolsassociatedwithHathiTrust’s

44Specificationsareavailableathttp://www.lib.umich.edu/lit/dlps/dcs/UMichDigitizationSpecifications20070501.pdf

Severity EventsHighimpact • Applicationsandhardwarearenolongerabletoreadordisplaydigitalobjects.

• Errorsintranslatingandreadingfilesarenotunderstoodoracknowledgedbyrepositoryusers.

ModerateImpact • ProblemswiththetranslationoffileformatsresultinDIPsthatdonotfaithfullyreflecttheoriginaldigitalobjects.

LowImpact • Formatsandassociatedapplicationschangebutretaincompatibilitywitholderversionsofthefileformats.

2009‐08‐24 19

preservationformats.HathiTrustgivesattentiontodataintegrity(e.g.,throughchecksumvalidation)aspartofformatchoiceandmigration.”45

o “Eachformatconformstoawell‐documentedandregisteredstandard(e.g.,ITUTIFFandJPEG2000)and,wherepossible,isalsonon‐proprietary(e.g.,XML).”(HTTRACB4.2)

• FormatMigrationPoliciesandActivitieso “HathiTrustiscommittedtomigratingtheformatsofmaterialscreatedaccordingto[its]

specificationsastechnology,standards,andbestpracticesinthedigitallibrarycommunitychange.”(HTTRACB1.1)

o “HathiTruststaffmembersconductmigrationsfromonestoragemediumtoanotherusingtoolsthatvalidatechecksumsinternally.(Digitalobjectsarestoredbothonlineandontape,andtheonlinestoragesystemconductsregularscanstodetectandcorrectdataintegrityproblems.)Atotalfilecountisdonefollowingalargedatatransfer,andregularlyscheduledintegritychecksfollow.”(HTTRACC1.7)

o “[HathiTrust]hasmigratedlargeSGML‐encodedcollectionstoXML,andLatin‐1characterencodingstoUTF‐8Unicode.Oursuccessinmigratingfromolderformatstonewerformatsdemonstratesourcommitmenttoourcollectionsandourabilitytokeepmaterialsinourrepositoryviable.Allmigrationsaredocumentedinchangelogs.”(HTTRACB4.2)

45HathiTrust.“Preservation”(2009)retrievedfromhttp://www.hathitrust.org/preservationon16June2009.

2009‐08‐24 20

Scenario5:CoreUtilityand/orBuildingFailure

• Review:RisksInvolvingCoreUtilityorBuildingFailureThefollowingtablesummarizesthedangersautilityorbuildingfailureposestoHathiTrustand

rankseventsbytheirpotentialseverity.

• HathiTrust’sSolutionsforUtilityorBuildingFailure

ThecontinueddeliveryofHathiTrust’sservicesdependsuponthemaintenanceofpower,environmentalcontrol,andsecurityinitsserverenvironmentattheMichiganAcademicComputingCenter(MACC)andotherlocationsthathostcomponentsoftherepository.Inthisrespect,HathiTrustisheavilyreliantupontheinfrastructureoftheMACCaswellasthatoftheArborLakesDataFacility,hometooneinstanceoftheTSMGroup’sbackuptapelibrary.BothlocationsprovidecloselymonitoredandhighlyredundantenvironmentsthathelpensurethatHathiTrust’sinfrastructureremainssecureandoperable.Atthesametime,administrativeanddatamanagementfunctionscriticaltothedevelopmentandmaintenanceoftherepositorytakeplaceintheUniversityofMichigan’sHatcherGraduateLibrary.TheserviceandcooperationofMichigan’sPlantOperationsDivisionarethereforecriticalforthecontinuedaccesstoanduseofthisstructureintheoperationofHathiTrust.

• GeneralMaintenanceandRepairsinUniversityofMichiganFacilitiesFacilitiesandmaintenanceissuesontheUniversityofMichigancampusarereportedtothe

PlantOperationsDivision,theDepartmentofPublicSafety(DPS),andOccupationalSafetyandEnvironmentalHealth(OSEH)inadditiontotheimpactedfacility’smanager.RepairworkiscoordinatedbytheUniversityLibraryfacilitiesmanagerinconjunctionwithadministratorsandworkersfromPlantOperations.

• TheMichiganAcademicComputingCenter(MACC) TheMACChostsmanyofthekeycomponentsoftheMichigan’sUniversityLibrarysystemandas

wellasthetechnicalinfrastructureofHathiTrust.TheUniversityofMichigandoesnotownthebuildinginwhichthedatacenterislocatedbutinsteadoperatestheMACCinconjunctionwiththeMichiganInformationTechnologyCenter(MITC)Foundationandotherpartners.TheMACCServerHostingService

Severity Events• ExtensivestructuraldamagerenderstheMACC(orkeyelementsofits

infrastructure)unusableandnecessitatestheestablishmentofahotsitetorecoverandcontinueoperations.

• Additionalfailurepasttoleranceinbackupcoolingorpowerinfrastructure

Highimpact

ModerateImpact • Failureofbackuppowerpastredundancytolerance(failureof2generators)

o DatacentercoordinatormayinitiateloadshedandshutdownhalfoftheMACC(butlibraryrackswillremainoperational)

• Structuraldamagerendersfacilitytemporarilyunsafeand/orunusable.LowImpact • Lossofpower

• Lossofenvironmentalcontrolunitswithinredundancy

2009‐08‐24 21

LevelAgreement46liststheresponsibilitiesofthedatacenteraswellastherepository;ofparticularsignificancearetheMACC’sagreementsto:

o “Provideacontrolledphysicalenvironmenttosupportservers[with]roomaveragetemperatureofbetween65and75degreesand35‐50%relativehumidity[and]monitoredenvironmentals(temperature,humidity,smoke,water,electrical.”(sec.4.1)

o “Provideadequate,conditioned,60‐cycleelectricalservicewithadequatebackupelectricalcapacitytosupportcircuits,service,andoutlets[andalsoto]provideUninterruptiblePowerSupply(UPS)andgeneratorbackup”(sec.4.2)

o “Provide7x24telephonecontactforemergenciesandforemergencyaccesstofacility.”(sec.4.4)

Inadditiontofeaturessuchasredundantelectricalandenvironmentalsystems,theMACCmaintainsafull‐timecoordinatorandstaffwhoprovide24x7responsestofailuresormalfunctionsintheserverenvironment.AlertspromptedbyissueswiththeenvironmentalsystemsorpoweraresenttotheUniversityofMichiganNetworkOperationsCenter(NOC)duringnon‐businesshours.

o Overview: “TheMACC'sredundancyisdesignedtoensurethesafetyandsecurityofthe

datahousedwithin.Itconsistsof:• Adualpowerpathfromthepropertylinetothepowerdistribution

units• Dieselpoweredgeneratorsforelectricalbackup• Flywheels(notbatteries)toprovidepowerwhilethegeneratorscome

on• State‐of‐the‐artgeneratorsandflywheelsforbackuppower• Threeextracomputerroomairconditioners• Twoextradrycoolers• Glycolloopforcoolingwithtwoparallelpathwayswithcrossovervalves

atregularintervals.”47 “Astate‐of‐the‐artmonitoringsystemkeepstrackof1,700differentparameters

andautomaticallynotifiesstaffofanyirregularity.”48o EnvironmentalControlsandMonitoring

“TheMACChas18ComputerRoomAirConditioningunits(CRACs).Atanygiventime,only15arenecessarytomaintaintherequiredtemperatureandhumidity.[Thus,thecomputerroomhasN5+1redundancyinitscoolingability.]Italsoisequippedwithanumberofportablecoolerstoaddressspecificcoolingneeds.Theheatfromtheroomistransferredtoanunder‐floorglycolloopthatreleasestheheattotheoutdoors.”49

46PleaserefertoAppendixH(MACCServerHostingServiceLevelAgreement).47MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.48‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.49‐‐.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.

2009‐08‐24 22

“Thelayoutofthefacilityallowsthefrontonthecomputerrackstobefacingthecoldaisles.Theseaisleshaveperforatedfloortilesthroughwhichthecoolairispumpeddirectlytothecomputerslocatedthere.Heatisdischargedfromthebacksofthecomputers,whichcreatesthehotaisles.Thisalternatingarrangementfacilitatesthecoolingprocess,asthehotairproducedbythecomputerscanbesiphonedoffbeforeitminglestoomuchwiththecoolerairofthefacility.”50

“TwoseparatesmokedetectionandfirealarmsystemsprotecttheMACC.Oneisforthebuilding;theotherisfortheMACCitself.Thetwosystemsworktogethertoactivatealarmsystemsandnotifythefiredepartmentandkeypersonnel.Intheeventofanactualfire,thefire‐suppressionsystempipeswillnotfillwithwaterunlessthereisapressuredropcausedbymeltingofoneormoreofthesprinklerheads.”51

o BackupPower “Threegenerators,eachroughlythesizeofarailcar,providebackuppower.

Onlytwoofthethreearerequiredtorunthefacilityintheeventofapoweroutage.”52

“TheMACCusesenvironmentallyresponsibleflywheelsinsteadofbatteriesforpowerbackupwhilethegeneratorscomeonline.Thecombinationofgeneratorsandflywheelsprovidesthefacilitywithafullyredundantuninterruptiblepowersystem(UPS).”53

TheMACChasacontractwiththeUMPlantOperationsDivisionforthedeliveryofdieselfuelforitsgeneratorsintheeventofanextendedblackout.54

Intheeventthatabackupgeneratorisdisabled,theMACCcoordinatorwillinitiateloadshed,inwhichonehalfoftheMACCwillbeshutdownsothattheotherhalf(andrequisiteenvironmentalsystems)maycontinuetooperate.TheHathiTrustandUMLibraryracksareamongthosewhichwillretainpowershouldthisresponseprovenecessary.55

• ArborLakesDataFacility(ALDF)TheALDFhousestheTSMGroup’sinfrastructureandoneinstanceofthebackuptapelibrary

thatformsanintegralpartofHathiTrust’sDisasterRecoverystrategy.AsthehomeofcriticalcomponentsoftheUMnetBackbone,theALDFprovidesasafeandsecurelocationforonesetoftherepository’sbackuptapes.Intheinterestofsecurity,thisreportwillomitfurtherinformationontheexactnatureofthefacility’spowerandenvironmentalsystems.

50Ibid.51Ibid.52‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.53Ibid.54Gobeyn,Rene(MACCDataCenterCoordinator).Personalinterviewon23June2009.55Ibid.

2009‐08‐24 23

Scenario6:SoftwareFailureorObsolescence

• Review:RisksInvolvingSoftwareFailureorObsolescenceThefollowingtabledetailsvariousrisksinherenttosoftwarefailureorobsolescenceandranks

themaccordingtotheirseverity.

• HathiTrust’sSolutionsforSoftwareIssues

ThedevelopmentanduseofHathiTrust’stoolsandresourcesdependsonhighlyfunctionalsoftwareapplications.Repositorypolicieshavethereforebeencraftedtoensurethattheseapplicationsarethoroughlytestedandregularlyupdatedtominimizethethreatofserviceoutagesasaresultofsoftwarefailureorobsolescence.HathiTrustfurthermoreemploysopensourceapplicationsthatarewell‐supportedandenjoywidespreaduseanddevelopmentwithinthedigitallibrarycommunity.

o “Changesinsoftwarereleasesofallcomponentsofthesystem(fromingesttoaccess)aredevelopedandtestedinanisolated“development”environmenttoprepareforreleasetoproduction.Whenreadyforrelease,developersrecordthechangesmadeandincrementversionnumbersofsystemcomponentsasappropriateusingaversioncontrolsystem.Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).Majorchangesandupgradesinhardwarearchitecturearerecordedinmonthlyreportsofunitactivity,andthusaretraceabletothatlevelofdetail.”(HTTRACC1.8).

o “Additionally,subsetsofproductiondataareavailableinthedevelopmentenvironmenttoallowdeveloperstoensurepropersystembehaviorbeforereleasingchangestoproduction.”(HTTRACC1.9)

o “Inordertodesign,buildandmodifysoftwareforthedesignatedend‐usercommunity,HathiTrustconductsanactiveusabilityprogramandseeksinputfromtheStrategicAdvisoryBoardofHathiTrust.Similarly,withregardtosoftwaredevelopmentinsupportofthearchivingneedsoftheParticipatingLibraries,HathiTrustfocusesonthedevelopmentofhighlyfunctionalingestandvalidationmechanisms.HathiTrustalsoseeksandrespondstoguidancefromtheStrategicAdvisoryBoardwithregardtoarchivingservices.”(HTTRACC2.2)

Severity Events

Highimpact • Softwarebugescapesdetectionindevelopmentenvironmentandresultsincrashofapplication.

ModerateImpact • Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfullaccesstodigitalobjects.

• Improperversionofsoftwareisintroducedtosystem(couldhaveagreaterorlesserimpactdependingonresultsoferrorandrepository’sabilitytodetectit).

LowImpact

• Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfulluseofsystemcapabilities(i.e.,rotationofimagesoradditionalfunctionality)

2009‐08‐24 24

Scenario7:OperatorError

• Review:RisksInvolvingOperatorErrorThefollowingtablesummarizesriskstoHathiTrustposedbyoperatorerror;eventsareranked

accordingtotheirpotentialseverity.

• HathiTrust’sSolutionsforOperatorError

Inanyhumanenterprise,occasionaloperatorerrorisunavoidable;HathiTruststrivestoensurethatanysucheventsaredetectedandresolvedinatimelyfashion.56Tohelpavoidoccurrencesandmitigatetheirpotentialimpact,HathiTrusthasautomatedmanyproceduresandalsoreliesuponapplicationassertions,whichcannotifyadministratorswhenprocessesarenotoperatingcorrectly.Evenifanerrorisintroducedtothefilesystemandthenbackedup,theTSMclientsavesuptosevenversionsofafileforuptosixmonthssothatanearlierversioncanberetrieved.

• Ingest:TheGoogleReturn(Object‐Oriented)ValidationEnvironment(GROOVE)processis

entirelyautomatedtoavoidtheintroductionofoperatorerrortotheprocess;stepsinclude:o Identificationofmaterialforingesto DecryptionandunzippingoffilesFormatverificationandvalidationwithJHOVEo LunBarcodeandMD5checksumvalidationo CreationofHathiTrustMETSdocumentso EstablishmentofHathiTrusthandles(persistentURLs)o Extensionofthepairtreefiledirectory(asnewmaterialentersthesystem)

• ArchivalStorage:Filesstoredwithintherepositoryarenotaccesseddirectlyormanipulatedby

staffsothatneitherthezippedimageandOCRfilesnortheMETSdocumentmaybeaccidentlyalteredordeleted.

• Dissemination:Thepage‐turnerapplicationreferencesthestoredimageandthencreatesa.png(forTIFFs)or.jpg(forJPEG2000s)filefordisplaytotheviewer.

• DataManagement:“Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).”(HTTRACC1.8)

56PleaserefertoAppendixB(HathiTrustOutagesfromMarch2008throughApril2009).

Severity EventsHighimpact • Operatorerrorresultsintheirreparablelossofdataordamagetoequipment.

• Operatorerrorresultsinlossofkeyrepositoryfunctions(ingest,storage,dissemination,etc.)foranextendedperiodoftime.

ModerateImpact • Operatorerrorremainsundetectedandcausespersistentproblemsinthesystembuthasnolongtermconsequences.

LowImpact • Operatorerrorisdetectedbynormalproceduresorviaanactivitylogandcanbereadilycorrected.

2009‐08‐24 25

Scenario8:PhysicalSecurityBreach

• Review:RisksInvolvingaPhysicalSecurityBreach MaintainingthephysicalsecurityoftheHathiTrustinfrastructureisyetanothercrucialelementintherepository’seffortstomanagerisksandtherebylessenthechancethatadisaster‐typeeventoccurs.Risksinvolvethedamageanddestructionofequipmentandcouldevenextendtounauthorizedsystemaccess.MultiplelevelsofsecurityexistatboththeMichiganAcademicComputingCenter(MACC)andtheArborLakesDataFacility(ALDF)toprotectHathiTrustfromtheactsofvandalism,destructionormalicioustampering.Detailsonthepotentialimpactsofaphysicalsecuritybreacharecoveredin“Scenario1:HardwareFailure”and“Scenario3:NetworkSecurity.”

• HathiTrust’sSolutionsforPhysicalSecurityo “Eachof[theHathiTrust]storageortapeinstancesisphysicallysecure(e.g.,inalocked

cageinamachineroom)andonlyaccessibletospecifiedpersonnel.”57

• SecurityattheMACCTheMACCServerHostingSLAstatesthedatacenterstaffwill:

o “Provideservicesnecessarytomaintainasafe,secure,andorderlyenvironmentforalltenantsoftheMACC.”(sec.4.7)

o “ProvideaccesscontrolviaHiDcardandbiometricreadersforthoselistedontheTenantStaffAuthorizedforAccesslist.”(sec.4.5)

TheMACCWebsiteandtheMichiganAcademicComputingCenterOperatingAgreement58provideadditionaldetailsconcerningtheresourcesandproceduresthathelpprotectHathiTrust’sequipmentattheMACC.TheMACCDataCenterCoordinatorpersonallyoverseestheenforcementofsecurityprotocolsandconductsregularauditsofsecuritylogsand,whennecessary,reviewssurveillancevideofootage.

o SecuritySystems “State‐of‐the‐artsecuritydevicessuchasirisscanners,cameras,closedcircuit

televisionandon‐callstaffkeepthedataandmachineshousedintheMACCsafe.”59

“Accesstothedatacenterwillbebytwo‐factorauthentication(accesscardandirisscan)orescorted,supervisedaccess.Accesstothebuildingwillbebyaccesscard.”(MACCOA,sec.5.3.1)

“Camerasthroughoutthecorridor,securitytrap,andfacilitywillbemonitoredandmaintainedbytheDataCenterCoordinator.”(sec.5.2.1)

o SecurityProcedures

57HathiTrust.“Technology”(2009)retrievedfromhttp://www.hathitrust.org/technologyon15June2009.58PleaserefertoAppendixI(MichiganAcademicComputingCenterOperatingAgreement).59MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon17June2009.

2009‐08‐24 26

“TheOperationsAdvisoryCommitteewillestablishproceduresforgrantingaccesscardstothefacilitytothosewhosejobsrequirehands‐onaccesstosystems.AllrequestsforaccesscardswillbevettedandapprovedbytheOperationsAdvisoryCommitteeattheirnextmeeting.”(sec.5.3.2)

“Everyoneontheaccesslistforthedatacenterwillberequiredtoattendatrainingsessionbeforeworkinginthedatacenterandsignanaccessagreementstatingpoliciestheymustobservewhileinthedatacenter.”(sec.5.3.8)

• SecurityattheALDFAsnotedintheTSMBackupServiceSLA,theUniversityofMichigan’sITCS“isresponsiblefor

physicalsecurity”attheALDF.(sec.4.9)WhilethisdocumentwillnotdetailspecificfeaturesoftheALDF’soperation,multiplelevelsofsecurityandoversightareemployed.

2009‐08‐24 27

Scenario9:NaturalorManmadeDisaster

• Review:RisksInvolvingaNaturalorManmadeDisasterThefollowingtabledetailstheriskstoHathiTrustposedbyanaturalormanmadedisaster;

eventsarerankedbyorderoftheirseverity.DuetopossibleoverlapbetweenthisscenarioandScenario1(HardwareFailure),readersareencouragedtoconsultthatearliersection.

• HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents

TheUniversityofMichiganAnnArborCampusEmergencyProcedures(revisedJanuary2008)hassetprocedurestoaddressbuildingevacuations(intheeventoffire),tornadoes,severeweather,flooding,chemical/biological/radioactivespills,aswellasbombthreats,civildisturbances,andactsofviolenceorterrorism.60Inallcases,staffwillfollowthedirectionsofPublicSafetyandnotre‐enterbuildingsorresumework“untiladvisedtodosobyDPSorOSEHorsomeonefromon‐siteincidentcommand.”

Intheeventofaseverenaturalormanmadedisaster,therepairandrestorationofthephysicallocationsofHathiTrustinfrastructurewouldneedtobecoordinatedbetweentherepositoryandtheappropriatefacilitymanagers.SuchactivitywouldrelyuponthedisasterrecoveryplansinplaceattheMITCBuilding(homeoftheMACC)andUniversityofMichigan(whichincludestheHatcherGraduateLibraryandtheALDF).Itmustbenotedthataneventwhichcausessignificantdamagetoanimportantstructureortoabuilding’sinfrastructurecouldresultinthelossofaninstanceoftherepositoryforanextendedperiodoftime.Insuchacase,HathiTrustwouldneedtosetupanalternatehotsiteuntilstructuralrestorationiscomplete(oranewfacilityhasbeenfound).

60PleaseseeAppendixC(WashtenawCountyHazardRankingList).

Severity EventsHighimpact • Widespreaddamagetoadatacenterand/oritsinfrastructurethatforcesan

instanceoftherepositorytofindanewhotsitewithsufficientpowersupply,environmentalcontrols,andsecurity.

• Damagetoworkareasforcestafftorelocatetoanewcenterofoperations.• Extensivelossordamagetohardwarerequireslarge‐scalereplacement.• Withtheextendedlossofonesite,HathiTrustlosesredundancy(andpossiblysome

functionality:i.e.theabilitytoingestnewmaterialinAnnArbor)andthusacentralcomponentofitsdisasterrecoveryandbackupplans.

• AnactofviolenceorterrorismoccursatornearHathiTrustfacilities.ModerateImpact • Aneventresultsinanextendedoutageatonesitethatexceedstherecoverytime

objective.• Hardwaresustainssomedamageandsiteisabletocontinueoperationina

reducedcapacity.• Anactualorthreatenedactofviolenceorterrorismforcesthetemporary

evacuationorquarantineofHathiTrustfacilities.LowImpact • LocalconditionsresultinatemporaryoutageataHathiTrustsite.

2009‐08‐24 28

• BasicDisasterRecoveryStrategies

Intheimmediateaftermathofalarge‐scalemanmadeornaturaldisaster,therepository’simmediaterecoverywillbeenabledbyitsbasicsystemarchitecture:

o “theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragearelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparatefacilityoutsideofAnnArbor).”61

TheestablishmentofthemirrorsiteinIndianapolisandtheretentionofmultiplebackuptapesattwolocationsinAnnArborensurethataseriouseventateitherlocationwillnotimpedethecontinuedfunctioningoftherepositoryattheother.ConsiderationmustbegivenastohowdataattheIndianapolissitewillbebackedupandhowkeyrepositoryfunctions(suchasingest)willproceediftheAnnArborinstanceisoff‐lineforanextendedperiodoftime.Likewise,along‐termoutageattheIUlocationwouldrequireHathiTrusttoestablishathirdsitefordatabackup(i.e.,alocationwhereadditionalcopiesofbackuptapescouldbestored).

61HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.

2009‐08‐24 29

Scenario10:MediaFailureorObsolescence

• Review:RisksInvolvingMediaFailureorObsolescenceThefollowingtablesummarizesriskstoHathiTrustposedbythefailureofthemediausedforits

databackups.Whiletherisksfromthisarelimited(bothcopiesofthetapebackupswouldhavetobeimpactedfordatatobeunavailable),theissueshouldnonethelessbeaddressedwithregulartestrestorationsand/orinspectionsofthemedia.

• HathiTrust’sSolutionsforMediaFailure

GiventhenatureofHathiTrust’sstoragesystem,thisscenarioisonlyaconcerninregardstothedigitalmagnetictapesusedbytheTSMGroupforbackups.

o Twotapecopiesofallbackupdataaremadeandthesearestoredinseparateclimate‐controlledconditionsintapelibrariesattheMACCandtheALDF.

o Contentistransferredtonewtapeduringdatadefragmentation(whichoccurswhenexistingtapesare80%full),

o Ifadegradedorotherwise‘bad’sectionoftapeisdetectedduringabackupprocedurethattapeisimmediatelymarkedas“readonly.”

Dataisthenceforthwrittentoadifferenttape;existingdataonthebadtapewillbecopiedtoproperlyfunctioningmedia.

Ifdatacannotbereclaimedfrombadtape,theTSMGroupwouldcontactHathiTrustsothatthebackupofcontentcanbeproperlycompleted.

• RemainingVulnerabilities

ThereissomereasonforconcerninthisareabecausetheTSMGroupdoesnothavearegularprogramtomonitoritsmediaforphysicaldegradationorimpairmentafterdatadefragmentation.Whilethetapesarereportedtobehighlydependable,problemssuchas“stickyshed”(thehydrolysisofthetape’sbinder)couldbecomeanissuewitholdertapes.Aregularprogramoftapevalidationortestrestorationswouldprovideanopportunitytocheckonthephysicalconditionanddataintegrityofthetapes.Likewise,thecreationofascheduleforthereplacementofoldertapescouldavoidfutureproblemswithmediadegradation.

Severity EventsHighimpact • Physicaldegradation(i.e.intapebinder,substrate,ormagneticcontent)affects

bothcopiesofolderbackuptapes.ModerateImpact • Becausebackuptapesarenotregularlytestedoraudited,thephysicalsubstrateof

tapesmaydegradeovertime.

LowImpact • Badtapeisdetectedduringatapebackup.

2009‐08‐24 30

ConclusionsandActionItems

• ConclusionsAsthisreportdemonstrates,avarietyofriskmanagementstrategiesinadditiontodesign

elements,operatingprocedures,andserviceandsupportcontractsendowHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofarangeofdisasters.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackups,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Asitis,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.

IntheefforttosecureHathiTrust’slong‐termcontinuity,thepresentdocumentstandsmerelyasapreliminarystepintheestablishmentofalegitimateDisasterRecoveryPlan.ThedataonHathiTrust’spolicies,procedures,andcontractsconsolidatedhereinshouldfacilitatethedatacollectionrequisitetotheinitialphasesoftheplanningprocess,butthecoreactivitiesofformulatingtechnicalandadministrativeresponsestrategiesanddelegatingrolesandresponsibilitiesremaintobeundertaken.ThefollowingsectionoutlinesrecommendationsandactionitemsderivedfromresearchintotherepositoryaswellasfromdiscussionswithCorySnavelyandotherHathiTruststaffmembers.ItemshavebeenseparatedintoanapproximatetimelineofactivityrangingfromShortTermthroughLongTermandthearrangementwithineachcategoryrepresentsasuggested(butbynomeansdefinitive)orderofaccomplishment.ForamoredetailedexplanationofactionitemsrelatedexplicitlytoDisasterRecoveryPlanning,pleaserefertotheoverviewoftheplanningprocessinAppendixEorconsultAppendixDforalistofmorecomprehensiveguidesandresources.(NB:*=Denotesanongoingactivity.)

• ShortTermActionItems(0‐6months)a. ResolvethenatureandextentoftheinsurancecoverageforHathiTrustequipment.b. ArrangewithTSMGroupadministratorstoperiodicallyperformavolumeauditof

backuptapestoensuredataintegrity.c. InstituteperiodictestrestoreswithTSMGrouptoensurethattheprocesswillrun

smoothlyintheeventofadisaster.d. Discussthecreationofalong‐termreplacementscheduleforbackuptapeswiththe

TSMGrouptoavoidthepossibilityofmediadegradation.e. Improvecontroloversystemcomponents

i. Updatethehardwareinventorytoincludeallimportantsystemcomponents;documentmodels,serialnumbers,UMID’s,associatedsoftwareandversionnumber,dateofpurchase,originalcost,aswellasvendorcontactinformationandproductsupportcontracts.*

2009‐08‐24 31

ii. Establishasoftwareinventorytodocumentnecessaryapplicationsintheeventofhardwareloss;shouldincludepurpose,acquisitiondate,cost,licensenumber,andversionnumber.*

iii. CreateamapidentifyingwherecomponentsareintheMACCandwithinindividualracks*

iv. Reviewandassesspointsoffailureaswellastheadequacyofredundantcomponents.*

f. Establishphonetreesi. Includekeycontactsfordifferenttypesofdisasterii. Prioritizephonetreestotargetindividualswho

1. Makedecisions2. Havevitalinformation3. Canofferassistanceinresolvingsituations

iii. Distributeinformationandexplainprotocolstoallrelevantstaff*iv. Developaregularmaintenance/updateschedule(onceevery4‐6months)*

g. Thoroughlydocumentandmakeavailable(asneeded)importantinstitutionalknowledgesothatHathiTrustmaycontinuetofunctionintheeventoftheextendedabsenceorlossofkeystaff.*

h. IdentifydisasterpreparednessanddisasterrecoverymeasuresinplaceatIndianapolis.

• IntermediateTerm(6‐12months)a. FormaDisasterRecoveryPlanningCommitteetoresearchanddevelopplansandto

overseetheirimplementation.b. CommunicateandcoordinateplanningactivitiesbetweenAnnArborandIndianapolis.*

i. Considertheformationofsub‐committeesforlocalizedresearchanddevelopmentofplansandanexecutivecommitteetooverseetheimplementationandmanagementofplans.

c. DraftaDisasterRecoveryPlanningpolicystatementtodefinethemandate,responsibilities,andobjectivesfortheplan.

d. Initiatethedatacollectionandanalysisphaseoftheplanningprocess.i. Identifycorerepositoryfunctionsandassociatedhardwareandinfrastructure

elements.ii. Determinethepotentialimpactfromthelossofthosefunctionsiii. Definethelevelsoffunctionalityrequiredforpartialaswellasfullrecovery.

EstablishwhatlevelisneededforHTtofulfillitsmissionandtheneedsofitsusers.

iv. DefineHathiTrust’sRecoveryTimeObjective(RTO:themaximumallowableoutageperiodforservices)andRecoveryPointObjective(RPO:thepointintimetowhichdatastoresmustbereturnedfollowingadisaster).

v. Determinetheavailabilityofresourcesintheeventofadisasterandestablishtherepository’sprioritizationwithmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.).

2009‐08‐24 32

e. Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.*

f. Developrecoverystrategiestobringcorefunctionsbackonlineassoonaspossiblewithinasetcostrange.

i. Establishalogicalprogressionintherestorationofservicesandassociatedcomponents.

ii. Identifytheresourcesrequiredfortheseefforts.iii. Consideralternativesolutions,includingpartial(vs.full)recovery

g. Communicateplanninggoalsandeffortstokeycontactsfromserviceprovidersandvendorstobettercoordinaterecoveryefforts.*

h. InitiatetheproductionofcoreDisasterRecoverydocuments(seeAppendixEformoreinformation).Thefollowinglistisnotexhaustive;datacollectionandanalysiswillhelpdetermineifallorotherplans(i.e.,awebcontinuityplan)areneeded.

i. BusinessContinuityPlan:detailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.

ii. ContinuityofOperationsPlan:focusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.

iii. ITContingencyPlan:addressesexplicitlythedisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.

iv. CrisisCommunicationsPlan:establishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.

v. Cyber‐IncidentResponsePlan:definestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.

vi. OccupantEmergencyPlan:definesresponseproceduresforstaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofHathiTrustpersonnelortheirenvironment.(ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergencyActionPlans.)

vii. DisasterRecoveryPlan:bringstogetherguidanceandproceduresfromtheotherplanstoenabletherestorationofcoreinformationsystems,applications,andservices.ThisplandefinesrolesandresponsibilitieswithinDisasterResponseTeams.

viii. DisasterRecoveryTrainingPlan:establishesthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.

• LongTerm(12+months)

a. CompleteandimplementDisasterRecoveryPlans.i. Distributephysicalcopiesoftheplansasneededandincludeatleastonecopy

inanoff‐sitelocation.ii. Integrateelementsofresponsestrategiesintosystemarchitecturetofacilitate

theirdeploymentintheeventofadisaster.*

2009‐08‐24 33

b. DisasterRecoveryCommitteeshouldmonitorchangesinbestpracticesandtechnology,updateplans,andoverseeorganizationalreadiness.*

i. InitiatestafftrainingsothatindividualsarefamiliarwithDisasterRecoveryproceduresandcommunicationprotocols.*

ii. InstituteregulartestsofdisasterpreparednesswithsimulateddisastersinvolvingdifferentcomponentsofHathiTrustoperations.*

iii. EstablishascheduleformaintenanceandrevisionstotheDisasterRecoverydocuments.*

iv. CoordinateDisasterRecoveryPlanimplementation,training,andreviewwithIndianapolis.*

c. StoreanadditionalcopyofbackuptapesatathirdsitetoincreaseexposureandlimitthechancethatawidespreadeventinAnnArborcouldimpactbothlocalcopies.

d. ExplorethepossibilityofestablishingathirdsiteforHathiTrust’sdigitalobjectstoincreaseexposureandaddressconcernsovertherelativegeographicalproximityofIndianapolisandAnnArbor.

e. Determinethefeasibilityofmovingoperationstoa“hot”siteinAnnArborshouldadisasterrendertheMACCunusable.

i. Identifysuitablesitesandconsidermakingpreliminaryarrangements.ii. Identifyandpriceoutequipment/infrastructurenecessarytocontinue

operations.f. PlanforintegrationofnewsystemcomponentsshouldthesuddencollapseofIsilon

leaveHathiTrustwithoutservice/support.g. Consideranincreasetosystemsecuritymeasuresascontentbecomesacceptedfroma

widerrangeofsourcesandasHathiTrustbecomesahigher‐profileorganization.

2009‐08‐24 34

APPENDIXA:ContactInformationforImportantHathiTrustResources

IndianaUniversityMirrorSite

• AndrewPoland(Staff,InformationTechnologyServices)o [email protected] (317)274‐0746

• TroyDeanWilliams(VicePresidentforInformationTechnology,IUatBloomington)o [email protected] (812)856‐5323

UniversityofMichiganMichiganAcademicComputingCenter(MACC):HousesmuchofthetechnicalinfrastructureoftheUniversityLibrary’sdigitalresources.

• ReneGobeyn(MACCDataCenterCoordinator)o [email protected] (734)936‐2654

• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888

ITCS‐ITCom:ResponsibleformaintainingnetworkconnectionstotheUMnetBackboneandInternet;ITComprovidesmaintenanceandsupportservicesforhardwareandsoftware.

• MikeBrower(SeniorProjectManager,UMLibraries)o [email protected] (734)936‐9736

• KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations)o [email protected] (734)647‐3214

• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888

TivoliStorageManagerGroup:Responsiblefornightlyautomatedtapebackupsofstorageservers.

• AndrewInman(ServiceManager)o [email protected] (734)615‐6286

• CameronHanover(StorageEngineer)o [email protected] (734)764‐7019

• GeneralSupport:[email protected]• Emergencycontact:[email protected]

o Messagewillgotoon‐callstaff’spagerinrealtime• [email protected]

ArborLakesDataFacility:HousesoneinstanceoftheTSMbackuptapelibrary.

• ITComUMNOC(NetworkOperationsCenter)

2009‐08‐24 35

o [email protected] (734)615‐4209

• KenPritchard(ALDFfacilitymanager)o [email protected] (734)615‐2812

ProcurementServices:Approvesdepartmentalpurchasesover$5,000;buyersalsoworkasintermediarieswithvendors.

• SteveWorden(UMHardwarePurchasingSpecialist)o [email protected] (734)645‐8972

• ShellyEauclaire(SeniorBuyer,PurchasingServices)o [email protected] (734)615‐8767

• IanPepper(UMDellComputersContractAdministrator)o [email protected] (734)647‐4981

• JeffRabbitt(AlternateDellContractAdministrator)o [email protected] (734)644‐9232

PropertyControl:Responsiblefortrackingandtaggingtheuniversity’sassets.

• MaryEllenLyon(BusinessOperationManager)o [email protected] (734)647‐3351(t,th)o (734)763‐1197(m,w,f)

OfficeofFinancialAnalysis:

• DavidStorey(InventoryCoordinator):DeliversUMpropertytagstoequipmentattheMACC.o [email protected] (734)647‐4264

RiskManagementServices:Providesinsurancecoverageofuniversityassets.

• KathleenRychlinski(AssistantDirector,RiskManagementServices)o [email protected] (734)763‐1587

Non‐UniversityContactInformationIsilonSystems

• JimRamberg(RegionalTerritoryManager)o [email protected] Desk:(847)330‐6399o Cell:(630)561‐2463

SunMicrosystems

• ChristineSluman(ServiceSalesRep—Education)o [email protected] (303)557‐3660,ext.60519

2009‐08‐24 36

o (303)949‐1567(Cell)• LarryZimmerman(MichiganAccountManager‐Sales)

o [email protected] (248)880‐3756

CDW‐G

• UniversityofMichiganAccountTeamo [email protected]

• HansenChennikkra(AccountManager)o [email protected] (866)339‐3639

• AdamSullivan(AccountManager)o [email protected] (866)339‐4118

DellComputers

• BrianUllestad(HigherEducationAccountManager)o [email protected] 1‐800‐274‐7799ext.7249522

2009‐08‐24 37

APPENDIXB:HathiTrustOutagesfromMarch2008throughApril200962

• April2009:HathiTrustexperiencedreducedperformancefrom11:00pmEDTonThursday,April23to8:22amEDTonFriday,April24duetoadatabaseproblematoneofthesitesandfrom5:30pmto9:00pmEDTonThursday,April30duetounintendedconsequencesfromanetworkingconfigurationchange.

• March2009:HathiTrustwasunavailableonTuesday,March3from7:00‐8:00amESTandonThursday,March5from7:00‐7:45amESTforoperatingsystemanddatabasesoftwareupgrades.

• February2009:OnSunday,February22at8:40amEST,apowersurgeresultingfromelectricalsystemmaintenancecausedHathiTrustdatabaseandwebserverstogooffline.Stafflearnedoftheproblematapproximately6:00pmEST,andservicewasrestoredby6:30pmEST.

• January2009:AbriefoutageisscheduledinJanuaryforastoragesystemsoftwareupgrade.• December2008:OnFriday,December19at7:30amEST,HathiTrustwasdownbrieflytoapply

securityupdatestoadatabaseserver.Servicewasrestoredat7:40amEST.• November2008:OnTuesday,November4at7:30amEST,HathiTrustwasdownbrieflytoapply

securityupdatestoadatabaseserver.Servicewasrestoredat7:45amEST• October2008:Nooutagesreported.• September2008:OnThursday,September18atapproximately9:30amEDT,HathiTrustbecame

inaccessibleduetoasoftwareproblemonastoragesystem;theproblemwasrelatedtoourworkwithdatasynchronization.Supportwascontactedandtheproblemwasresolvedat10:45amEDT

• August2008:OnTuesday,August26atapproximately9:00amEDT,adatabaseserverwasbroughtdowntomovetoIndianapolis.Priortoshuttingthisserverdown,wedidnotupdateamanualfailoverconfiguration,causingvolumestobeinaccessibletosomeusers.Theproblemwasresolvedat11:15amEDT.

• July2008:ServicewasunavailableonThursdayJuly31from7:00‐7:30amEDTforastoragesystemsoftwareupgrade.

• June2008:Nooutagesreported.• May2008:Nooutagesreported.• April2008:Nooutagesreported.• March2008:Nooutagesreported.

62HathiTrust.“Updates”fromhttp://www.hathitrust.org/updatesretrievedon16June2009.

2009‐08‐24 38

APPENDIXC:WashtenawCountyHazardRankingList

ThefollowinglistranksavarietyofnaturalandmanmadeeventswithinWashtenawCounty,Michigan,basedupontheirfrequencyofoccurrenceandtheextentoftheirpotentialimpact(onthecounty’spopulation).

Rank Hazard FrequencyPopulationImpacted

1Convectiveweather(severewinds,lightning,tornados,hailstorms)

Onceormore/yr.

250,000

2Hazardousmaterialsincidents:transportation

Onceormore/yr.

2,000

3 Hazardousmaterialsincidents:fixedsiteOnceormore/yr.

10,000

4Severewinterweatherhazards(ice/sleet/snowstorms)

Onceormore/yr.

250,000

5 InfrastructurefailuresOnceevery5yrs.

30,000

6 Transportationaccidents:airandlandOnceormore/yr.

100

7 ExtremetemperaturesOnceevery5yrs.

10,000

8 Floodhazards:riverine/urbanfloodingOnceevery10yrs.

2,000

9 NuclearattackHasnotoccurred

250,000

10Petroleumandnaturalgaspipelineaccidents

Onceevery10yrs.

1,000

11 Firehazards:wildfiresOnceormore/yr.

0

Source:WashtenawCountyHazardMitigationPlan(availableonlineathttp://www.ewashtenaw.org/government/departments/planning_environment/planning/planning/hazard_html)

2009‐08‐24 39

APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences

Thetopicofdisasterrecoveryplanningfortheprintandanalogresourcesoflibrarieshasbeenwidelydealtwithinprofessionalliterature,butcomparativelylittleinformationexistsconcerningthedevelopmentandimplementationofplansforthedigitalcontentofculturalinstitutions.Thefollowingbibliographydetailsresourceswhichprovideguidance,examples,andexplanationsoftheobjectivesandstrategiesfordigitalDisasterRecoveryPlans.ItconsistsprimarilyofmaterialcompiledbyLanceStuchell(ICPSRIntern)andNancyMcGovern(ICPSRDigitalPreservationOfficer)andisincludedherewiththeirpermission.

UniversityofMichiganResources

• UniversityofMichiganAdministrativeInformationServices(MAIS):EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning.

o http://www.mais.umich.edu/projects/drbc_methodology.htmlo ThissitebroadlyoutlinestheneedforandfunctionsofEmergencyManagement,

BusinessContinuity,andDisasterRecoveryPlanningatUM.Italsocontainstemplatesdesignedtohelpunitsplan,test,andauditdisasterandcontinuityprograms.

• ProvostandExecutiveVicePresidentforAcademicAffairs:StandardPracticeGuide:InstitutionalDataResourceManagementPolicy

o http://spg.umich.edu/o ThispolicydefinesinstitutionaldataresourcesasUniversityassetsandmakes

recommendationsonidentifying,preserving,andprovidingaccesstotheseassets.Thedigitalresourcesofthelibrarymaybeidentifiedassuch,basedupontheirusebydepartmentsacrosstheuniversity.

• ICPSRDisasterPlanningResources:

o DigitalPreservationOfficerNancyMcGovernispartofaDisasterRecoveryinitiativeatICPSRandoverthepastseveralyearsherteam(includingLanceStuchell)hasproducedavarietyofdocumentsandtemplatestohelpotherinstitutionsworkthethroughtheplanningprocess.

o Documentsareavailableuponrequestandshouldbepostedinthenearfuture(asofJuly2009)totheICPSRWebsite(http://icpsr.umich.edu/).

• DisasterRecoveryExperts:o ReneGobeyn(MACCDataCenterCoordinator)

ManagedandcoordinatedDisasterRecoveryforU.S.militarydatacenters [email protected]

o KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations) HelpeddevelopcurrentITCSDisasterRecoveryplans [email protected]

2009‐08‐24 40

ExternalResources

• GeneralGuidetoDisasterPlanningo ContingencyPlanningGuideforInformationTechnologySystems:Recommendationsof

theNationalInstituteofStandardsandTechnology,NISTSpecialPublication800‐34,June2002.

http://csrc.nist.gov/publications/nistpubs/800‐34/sp800‐34.pdf AnindispensableresourcewhichwasusedheavilybyICPSRinitsDisaster

Recoveryplanning.Itcoverseverythingfrominitialdatacollectionandpolicyformationtothestructureofdisasterresponseteamsandthearticulationofrecoverystrategies.

• ExamplesandToolsfortheDocumentationOutlinedbyNISTGuide:o FullDisasterRecoveryPlan:

UnitedStatesDepartmentofAgricultureDisasterRecoveryandBusinessResumptionPlans

http://www.ocio.usda.gov/directives/doc/DM3570‐001.htmo BusinessContinuityPlan(BCP):

MAIS:EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning

http://www.mais.umich.edu/projects/drbc_templates.html Thissiteprovidesseveralresourcesthatdealwithcontinuityplanning.

o ContinuityofOperationsPrograms(COOP): FEMA:ContinuityofOperations(COOP)Programs

• http://www.fema.gov/government/coop/index.shtm• Containsalotofusefulinformationongovernmentpolicy,templates,

andtrainingresourcestoassistinthecreationofaCOOP. Ready.gov:ContinuityofOperationsPlanning

• http://www.ready.gov/business/plan/planning.html• GuidelinesforcomposingabusinessCOOP,includingwhatoutside

actorsshouldbeinvolvedintheplanningprocess. TheFloridaDepartmentofHealth:ContinuityofOperationsPlanforInformation

Technology• http://www.naphit.org/global/library/basement_docs/FL_DisasterReco

very_template.doc• Lengthy(40pages)anddetailedCOOPtemplatewrittenforanIT

environment. FloridaAtlanticUniversityLibraries:ContinuityofOperationsPlan

• http://www.staff.library.fau.edu/policies/coop‐2007.pdf• AdetailedworkingCOOP,whichincludesreactionstospecificdisaster

scenarios.o ITContingencyPlan:

2009‐08‐24 41

SeetheUSDADisasterRecoveryPlanforanexampleofanITContingencyPlan.o CyberIncidentResponsePlan:

Multi‐StateInformationSharingandAnalysisCenterCyberIncidentResponseGuide

• http://www.msisac.org/localgov/documents/FINALIncidentResponseGuide.pdf

• Theguideprovidesastep‐by‐stepprocessforrespondingtoincidentsanddevelopinganincidentresponseteam.ItmayalsoserveatemplateinordertodraftaCyber‐IncidentResponsePolicyandPlan.

o CrisisCommunicationPlan: Ready.gov:WriteaCrisisCommunicationPlan

• http://www.ready.gov/business/talk/crisisplan.html• Thissiteprovidesguidelinesforcomposingabusinessdisaster

communicationplanandincludessuggestionsfortheplan’sWebpresence.

NCStateUniversity:CrisisCommunicationPlan• http://www.ncsu.edu/emergency‐information/crisisplan.php• ThisisthepolicyandplanfortheUniversityasawhole.Whilemuchof

thispolicydealswithcommunicationatahighlevel,usefulsectionsdetailvitalcontactswithintheorganization(includingwhotocontactfirst),andhowtomanageexternalcommunications.

OtherthoroughuniversitypoliciesandplansincludetheLSU:CrisisCommunicationPlanandtheMissouriS&T:CrisisCommunicationPlan.

HeritageMicrofilmFloodUpdateEmail• ThisemailwassentinresponsetotheJune2008floodingthatoccurred

intheMidwest.• ItupdatesclientsontheoutageofNewspaperArchive.comwhich

resultedfromaflood‐inducedwidespreadpowerfailure.Itisanexcellentexampleofanexternalcrisiscommunicationtousers.

o DisasterRecoveryPlans(DRP): TheUniversityofIowa:ITServicesDisasterRecoveryPlan

• http://cio.uiowa.edu/ITplanning/Plans/ITSdisasterPrep.shtml• Thispolicydetailsthedatacollectionandassessmentwhichinformsthe

UIplanandalsoincludesemergencyprocedures,responsestrategies,andacrisiscommunicationplan.

UniversityofArkansas:ComputingServicesDisasterRecoveryPlan• http://www.uark.edu/staff/drp/• Acompleteandthoroughplanthatoutlinestheinitiationofemergency

andrecoveryprocedures,andaddresseshowtheplanwillbemaintained.

AdamsStateCollege(CO):InformationTechnologyDisasterRecoveryPlan• http://www.adams.edu/administration/computing/dr‐plan100206.pdf

2009‐08‐24 42

• Thisplanhasathoroughsectiononriskassessment. DigitalPreservationEuropeRepositoryPlanningChecklistandGuidance

• http://www.digitalpreservationeurope.eu/platter.pdf• DesignedforusewiththePlanningToolforTrustedElectronic

Repositories(PLATTER),thisdocumentoutlinesconsiderationsforaDisasterRecoveryStrategicObjectivePlan(SOP)andplacesthemincontextwithotherrepositoryplans.

o OccupantEmergencyPlan(OEP): ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergency

ActionPlans(EAP).• http://www.umich.edu/~oseh/guideep.pdf

o DisasterRecoveryTrainingGuides: dPlan.org

• Providesusefulinformationontrainingandanonlineformthatwouldbeusefulinassigningtrainersandmonitoringthetrainingprocess.

CalPreservation.org:DisasterPlanExercise• http://calpreservation.org/disasters/exercise.html• Providesrolesandteachingpointsforarole‐playtrainingexercisethat

focusesonadisasterinalibrary.

• PolicyPlanningTools:o AssociationofPublicTreasurersoftheUnitedStatesandCanada:DisasterPolicy

CertificationGuidelines www.aptusc.org/includes/getpdf.php?f=Disaster_Policy.pdf Thisplanningdocumentandtemplatefordisastermanagementpolicies

providesoutlinesandexamplelanguageonseveralfacetsofastrongpolicy,includingthepossiblelossofabuilding,thereplacementofcomputerresources,andtestingandtrainingforthedisasterplan.Italsooutlinestheneedtoidentifypossiblethreatstoassets.

• ExamplesofDisasterPlanningPolicies:

o ArkansasSecretaryofState:DisasterPlanningPolicy http://www.sos.arkansas.gov/elections/elections_pdfs/register/oct_reg/016.14.

01‐020.pdf Thispolicyoutlinesareasofresponsibilitybetweendepartmentsandunits,and

includestraining,communication,andrecoveryplanupdates.o WashingtonStateDepartmentofInformationServices:DisasterRecoveryandBusiness

ResumptionPlanningPolicy http://isb.wa.gov/policies/portfolio/500p.doc ThisdocumentillustratespolicyformationforanITDisasterRecoveryPlan.It

providesguidelinesforDisasterRecoveryPlanningaswellasmaintenance,testing,andtraininginvolvedwiththerecoveryplan.

2009‐08‐24 43

o FloridaStateUniversity:InformationTechnologyDisasterRecoveryandDataBackupPolicy

http://oti.fsu.edu/oti_pdf/Information%20Technology%20Disaster%20Recovery%20and%20Data%20Backup%20Policy.pdf

ThisdocumentincludespolicyfordatabackupaswellasDisasterRecovery.PartofthepolicyincludesadefinitionofBestPracticeDisasterRecoveryProcedures,aswellasanoutlineoftheuniversity’sownITrecoveryplanningandimplementationprocedures.

• ExampleofaRelevantDisasterPlanningProgram:o OCLCDigitalArchivePreservationPolicyandSupportingDocumentation

http://www.oclc.org/support/documentation/digitalarchive/preservationpolicy.pdf

ThisdocumenthasacleararticulationofOCLC'sdisasterpolicy,alongwithanoutlineofdisasterpreventionandrecoveryproceduresandatime‐framefortherestorationofservicesintheeventofadisaster.

Thepolicyincludesagooddefinitionofadisasterpreventionandrecoveryplan:“Asetofresponsesbasedonsoundprinciplesandendorsedbyseniormanagement,whichcanbeactivatedbytrainedstaffwiththegoalofpreventingorreducingtheseverityoftheimpactofdisastersandincidents.”

OCLCembedsitsdisasterplanwithinitsoverallpreservationpolicy,stating:“Thegoalofdisasterpreventionistosafeguardthedata(contentandmetadata)intheDigitalArchiveandtosafeguardtheDigitalArchive’ssoftwareandsystems.Fordisasterpreventionandrecovery,alldata(contentandmetadata)isconsideredofequalvalue.”

• DesigningaDisasterPlanningProgram:o MichiganStateUniversity:StepbyStepGuidetoDisasterRecoveryPlanning

http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm Thisprogrambreaksdownthedisasterplanningprocessintosteps,and

providesinformationrelevanttoindividualunitswithinauniversitysetting.TheMSUDisasterRecoveryPlanningHomepage(http://www.drp.msu.edu/)alsooffersavarietyofresources.

o MinnesotaStateArchives:DisasterPreparedness http://www.mnhs.org/preserve/records/docs_pdfs/disaster_000.pdf Thisdocumentisadetailedguidetothedisasterplanningprocess.Whilemostly

dealingwithpaperrecords,thedocumentclearlyidentifiesdifferentrolesandresponsibilitiesformembersoftheplanningandrecoveryteam.

o CiscoSystems:DisasterRecoveryBestPracticesWhitePaper http://www.cisco.com/warp/public/63/disrec.pdf

2009‐08‐24 44

ThepaperoutlinesDisasterRecoveryusingtheframeworkoftheaboveresources,buttailorsittoanITpointofview.Ithasusefulinformationonhowtoprepareandrecoverbothhardwareandsoftwareassets.

o AT&T:KeyElementstoanEffectiveBusinessContinuityPlan http://www.business.att.com/content/article/Key_to_Effective_BC_Plan.pdf Ashortpaperthatsummarizesbusinesscontinuityplanningintheprivate

sector.

• GeneralInformationo FederalEmergencyManagementAdministration:EmergencyManagementGuidefor

Business&Industry http://www.fema.gov/business/guide/index.shtm Apracticalguidewithstep‐by‐stepadviceoncreatingaDisasterRecovery

program.Includesinformationontheformationonaplanningcommittee,organizationalanalysis,anddetailsonspecifichazards.

o SpecialLibrariesAssociationInformationPortal:DisasterPlanningandRecovery http://www.sla.org/content/resources/infoportals/disaster.cfm Anexhaustivelistofresources,thispageincludesarticlesondigitaldisaster

recoverystrategiesaswellasinformationonplanning,examplesofplans,andlinkstoawiderangeofresourcesinthepublicandprivatesector.

WrittenResources:

• Wellheiser,JohannaandJudeScott.AnOunceofPrevention:IntegratedDisasterPlanningforArchives,Libraries,andRecordCentres.Lanham,MD:ScarecrowPress,2002.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004233950&local_base=AA_PUB

• Cox.RichardJ.FlowersAftertheFuneral:ReflectionsonthePost‐9/11DigitalAge.Lanham,MD:ScarecrowPress,2003.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004341258&local_base=AA_PUB

• Matthews,GrahamandJohnFeather,eds.DisasterManagementforLibrariesandArchives.Burlington,VT:Ashgate,2003.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004354795&local_base=AA_PUB

2009‐08‐24 45

APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess

VariousresourcesagreethatthereisnoonewaytogoaboutinitiatingaDisasterRecoveryprogramordraftingaDRplan.Anorganizationmustproceedaccordingtoitsfunctionsandresourcesaswellastheneedsofitsdesignatedcommunityofusers.ThefollowingdiscussiondrawsheavilyupontheICPSRDisasterPlanningPolicyFramework(writtenbyNancyMcGovernandLanceStuchell)andtheContingencyPlanningGuideforInformationTechnologySystemspublishedbyNIST(2002).Assuch,itrepresentsaconsolidationandsimplificationofinformationpresentedinmoredepthelsewhere.Alistofplanningresources(withlinkinformationtofulltexts)isavailableinAppendixD.

• BasicPreceptsofDisasterRecoveryPlanning

1) DisasterRecoveryPlanningisacontinuousactivitythatinvolvesmonitoringinternalconditionsaswellasevolutionsintechnologyandthreats;respondingtonewdevelopmentsthatarise;revisingplanssothattheyremainrelevantandeffective;trainingstaffaccordingtoplans;andtestingorganizationalreadiness.

a. Thereisnosingledocumentwhichcontains“theplan”;rather,aDisasterRecoveryPlanconsistsofasuiteofdocumentsthatrequirearegularscheduleoftestingandrevisiontobeeffective.

b. ThereisnopointatwhichaDisasterRecoveryPlanis“finished.”

2) DisasterRecoveryPlanningneedstobeanorganizationwideactivity

a. DisasterrecoverymustbeoneofthebasicfunctionsofHathiTrust.

b. Aneffectiveplanneedsfulladministrativesupport.

c. Policiesandproceduresmustcomplementandconformtodisasterresponseplansestablishedbytheuniversity,city,andDepartmentofHomelandSecurity.

3) DisasterrecoverycannotbelimitedtothehardwareandsoftwarecomponentsordatacollectionsofHathiTrust;planningmustalsoaccountfortheimpactofhumanemergenciesontherepository’soperations.

• EssentialStepsinDisasterRecoveryPlanning

1) EstablishaDisasterRecoveryPlanningCommittee.

a. Thisgroupwillresearchanddeveloptheplanandhelpwithitsimplementationaswellasmonitorthetraining,testing,andrevisingofplanstoensureorganizationalcomplianceandreadiness.

b. Thecommitteeshouldinvolveindividualsrepresentingthevariousmissioncriticalunitswithinthelibrary(fromadministrationtoCoreServicestotheDigitalPreservationLibrarian)whowillparticipateinthedevelopmentofpolicyandrecoveryplanning.

c. Itisessentialthatthecommitteeinvolveindividualswiththeauthoritytosupportandenforcerecommendations.

d. Thecommittee’sactivitiesshouldinitiatetheformationofaDisasterResponseProgram.

2) DraftaDisasterRecoveryPlanningPolicyStatement

2009‐08‐24 46

a. Enablestheorganization—andothers—tounderstandthescopeandnatureoftheDisasterRecoveryPlan.

b. Establishestheorganizationalframeworkandresponsibilitiesfortheplanningprocess.

c. Keypolicyelements(asdetailedintheNISTreport):

i. Rolesandresponsibilitieswithintheorganizationinregardstoplanning

ii. MandateforDisasterRecoveryaswellasanystatutoryorregulatoryrequirements

iii. Scopeasappliestothetype(s)ofplatform(s)andorganizationalfunctionssubjecttoDisasterRecoveryPlanning

iv. ResourcerequirementsfortheDisasterRecoveryprogram

v. Trainingrequirements

vi. Exerciseandtestingschedules(atleastonemajorannualtest)

vii. Planmaintenanceschedule(elementsshouldbereviewedannually)

viii. Frequencyofbackupsandstorageofbackupmedia.

3) ConductDataCollectionandAnalysis(i.e.“BusinessImpactAnalysis”)

a. Determinecriticalfunctionsandidentifyspecificsystemresourcesrequiredtoperformthem.Minimumrequirementsforfunctionalityshouldbeestablished.

b. Determinerisksandvulnerabilitiesfacingtherepository’ssystemsandinfrastructure.

c. Identifyandcoordinatewithinternalandexternalpointsofcontacttodeterminehowtheydependonorsupporttherepositoryanditsfunctions;considerhowonefailuremightcascadeintoothers.

i. IdentifyresourcesthatarecrucialtoHathiTrust(I.e.,Mirlyn)

ii. Determinetheallowableoutage/disruptiontimefortheseresources

d. Developrecoverypriorities;balancethecostofinoperabilityagainstthecostofrecovery

i. DetermineHathiTrust’spositionwithintheprioritiesoftheuniversityaswellaswithitsmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.)tobetterunderstandhowthatprioritizationwillimpactrecoveryefforts.

ii. Establishthemostcrucialfunctionswhichmustberestoredfirst.

iii. DetermineHathiTrust’sRecoveryTimeObjective(RTO,i.e.,themaximumallowableoutageperiod)andRecoveryPointObjective(RPO,i.e.,thepointintimetowhichdatafilesmustberestoredafteradisaster).

iv. Reviewpotentialresources(financial,personnel,etc.)withinHathiTrustaswellasthoseavailableviacontracts,serviceproviders,andproductsupport.ThisstepshouldinvolvetheclarificationofHathiTrust’spositionwithintheuniversity’saswellaskeyserviceproviders’andvendors’priorities.

4) Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.

2009‐08‐24 47

5) Developrecoverystrategiesthatrespondtothepotentialimpactsandmaximumallowableoutagetimesestablishedinthedatacollectionphase.Effortsshouldfocusonsolutionsthatarecost‐effectiveandtechnicallyviable.

a. Strategiesshouldbedesignedtobringcorefunctionsbackonlineassoonaspossiblewithinanestablishedcostrange.

b. Recoveryeffortsmustbeprioritizedaccordingtothenatureofcorefunctionsaswellaslogicalorderofprocedures.

c. Alternativesolutionsshouldbeconsideredbaseduponcost,availabilityofresources,outagetimes,levelsoffunctionality(partialvs.full),andabilitytointegratemethodswithexistinginfrastructure.

d. Determinethepracticalityofpartial(vs.full)recoveryinordertobringservicesbackonlineinatimelyandcost‐effectivemanner.

e. Recoverystrategiesandresourcesshouldbeincorporated(aspossible)intotherepository’ssystemarchitecturesothatintheeventofadisaster,theresponsemayproceedinanefficientandstraightforwardmanner.

6) FormalizeandrecordcollecteddataandrecoverystrategiesinDisasterRecoveryDocuments.Intheprocessofproducingthiswiderangeofdocuments,anorganizationisforcedtoconsideranddocumentpoliciesandproceduresrelatedtoavarietyofkeyadministrativeandtechnicalissues.Thedecisionofwhichplanstoinclude(andwhichtoexclude)mustbedeterminedbaseduponareviewofHathiTrust’sneedsandobjectives.Additionaldocuments(aWebcontinuityplan,forinstance)maybenecessarybasedupondatacollectionandanalysis.

a. BusinessContinuityPlan

i. Businesscontinuityistheabilityofabusinesstocontinueitsoperationswithminimaldisruptionordowntimeintheeventofnaturalormanmadedisasters.

ii. Suchplanningallowsanorganizationtoensureitssurvivalbyconsideringpotentialbusinessinterruptionsandestablishingappropriate,cost‐effectiveresponses.

iii. TheBusinessContinuityPlandetailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.Itshouldaddresskeyadministrativeandsupportfunctionsaswellasthosewhichdirectlyinvolvetherepository’sdesignatedcommunity.

iv. Theplanshouldthoroughlydocumentthenatureofkeyfunctions,interdependences,theimpactoftheirloss,andalternativemeanstoensuretheircontinuationintheeventofadisaster.MAISoffersausefulBusinessContinuityplanningtemplateathttp://www.mais.umich.edu/projects/drbc_templates.html.

b. ContinuityofOperationsPlan(COOP)

i. TheCOOPfocusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.

2009‐08‐24 48

ii. ThisplanmayincludetheBusinessContinuityPlanandDisasterRecoveryPlanasappendices.

c. ITContingencyPlan

i. TheITContingencyPlanaddressesdisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.

ii. Itshouldaccountforthefollowing:

1. Documenthardwareandsoftware

2. Developanemergencycontactlist

3. Backupandstorealldatafilesoff‐site

4. Proactivelymonitorequipmentanddata

5. Installandupdateantivirussoftwareonbothcomputersandservers

6. Developrecoveryscenarios

7. Communicateandmonitortheplan

iii. TheplanallowsHathiTrusttoformalizeanddocumentproceduresandpoliciesalreadyinplaceanddetailstherepository’sadherencetothesegoals.

d. CrisisCommunicationsPlan

i. CommunicationisavitallyimportantaspectofDisasterRecoveryPlanningandanorganization’sactualresponseinadisaster.

ii. TheCrisisCommunicationsPlanestablishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.

iii. Thedifferentphasesofcrisiscommunicationencompasstheinitialnotificationofanevent,damageassessment,andplanactivationaswellasstatusreports(asneeded)andtheeventualcompletionofrecoveryefforts.

iv. Activationofthecommunicationsplanmustbetheresponsibilityofaspecificindividual.

v. TheDisasterResponseTeamcoordinateswiththeCrisisCommunicationTeamtoensurethatinformationprovidedaboutanemergencyisclear,concise,andconsistent.

e. Cyber‐IncidentResponsePlan

i. ThisplandefinestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.

ii. Itprovidesaformalframeworkfortheidentification,mitigation,andrecoveryfrommaliciouscomputerincidents,suchasunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestosystemhardware,software,ordata.

2009‐08‐24 49

f. OccupantEmergencyPlan

i. TheOccupantEmergencyPlandefinesresponseproceduresforlibrarystaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofpersonnel,theenvironment,orHathiTrustproperty.

ii. HathiTrustmayutilizetheframeworkprovidedbyUMBuildingEmergencyActionPlansforthiselement.

g. DisasterRecoveryPlan

i. TheprimaryfocusoftheDisasterRecoveryPlanistherestorationofcoreinformationsystems,applications,andservices.

ii. Theplanbringstogetherguidanceandproceduresfromtheotherplans(i.e.,BusinessContinuityPlan,ITContingencyPlan,CrisisCommunicationsPlan,etc.)pertainingtoemergenciesthatresultininterruptionsofservicethatexceedacceptabledowntimes,asdefinedintheBCP.

iii. Theplanshoulddetailestablishedrecoverystrategiesforspecificdisastersituationsaswellastheteamsinvolvedintheirexecution.

iv. Personnelshouldbechosentostaffdisasterresponseteamsbasedontheirskillsandknowledge.Ideally,teamswouldbestaffedwiththepersonnelresponsibleforthesameorsimilaroperationundernormalconditions.It’salsoimportantthatteammembersshouldbefamiliarwiththegoalsandproceduresofotherteamstofacilitateinter‐teamcoordination.Eachteamisledbyateamleader(withasuitablealternate)whodirectsoverallteamoperationsandactsastheteam’srepresentativetomanagementandliaisonswithotherteamleaders.DisasterResponsecannotbeindividual‐specificoroverlyreliantonspecificpeople.Teamsmustassigneachroleatleastonealternateintheeventthatcorepeopleareunavailableatthetimeofadisaster.

v. NISTsuggeststhatacapablestrategywillrequiresomeorallofthefollowingfunctionalgroups.ForHathiTrust,manyofthesearealreadyinplaceintheformofUniversityofMichiganunitsandserviceproviders.

1. Anauthoritativeroleforoveralldecision‐makingresponsibility

2. SeniorManagementOfficial

3. ManagementTeam

4. DamageAssessmentTeam

5. OperatingSystemAdministrationTeam

6. SystemsSoftwareTeam

7. ServerRecoveryTeam(e.g.,clientserver,Webserver)

8. LAN/WANRecoveryTeam

9. DatabaseRecoveryTeam

10. NetworkOperationsRecoveryTeam

11. ApplicationRecoveryTeam(s)

2009‐08‐24 50

12. TelecommunicationsTeam

13. HardwareSalvageTeam

14. AlternateSiteRecoveryCoordinationTeam

15. OriginalSiteRestoration/SalvageCoordinationTeam

16. TestTeam

17. AdministrativeSupportTeam

18. TransportationandRelocationTeam

19. MediaRelationsTeam

20. LegalAffairsTeam

21. Physical/PersonnelSecurityTeam

22. ProcurementTeam(equipmentandsupplies)

h. DisasterRecoveryTrainingPlan

i. ThisplanwillestablishthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.

ii. Thecontentsoftheplanshouldreflecttherangeofresponsibilitiesheldbetweenadministrators,departmentheads,andstaffwithinHathiTrust.

iii. TheplanshouldaccommodateDisasterRecoveryPlanningCommitteemembersaswellasthoseoftheDisasterResponseTeam.Forthelatter,itshouldidentifykeyrolesandresponsibilitiesinrecoveryefforts.

iv. Theplanshouldallowin‐housetrainingtobesupplementedbyexternalopportunities.

v. Aregularlyscheduledemergencydrillsshouldalsobeincludedtotestthereadinessofstaffandtheappropriatenessofresponseprocedures.

7) Implementelementsdevelopedinplanningprocess.Proceduresandpoliciesrelatedtocommunication,technologicalsolutions,etc.mustbeincorporatedintoHathiTrust’soveralldesignandoperationsothatDisasterRecoverybecomesacriticalorganizationalfunction.

8) InstituteregularprogramoftrainingandtestingtobesurethatstaffunderstandandacceptpoliciesandproceduresandtoensurethatHathiTrustispreparedforadisaster.

9) ConductregularreviewandmaintenanceofDisasterRecoverydocumentstorespondtochangesinpersonnel,organizationalstructureorfunctions,andevolutionsintechnologyand/orthreats.

• MainPhasesinaDisasterResponse:

1) Notification/Activation:Thisphasecoverstheinitialactionsonceasituationhasbeendetectedoristhreatened.Itincludesdamageassessmentandtheimplementationofanappropriateresponsestrategy.

a. Properdiagnosisandcommunication(bothinternalandexternal)ofadisasterisessential.

2009‐08‐24 51

b. Thenatureofindividualeventswilldeterminewhoneedstobeinvolved(i.e.,facilitiesmanagement,coreservices,etc.).

2) Recovery:Thisphasefocusesonthereturntoapre‐establishedleveloffunctionality(plansshoulddetailpartialaswellasfullrecoveries).

a. ResponseteamsimplementrecoverystrategiesandadheretoproceduresandprotocolsoutlinedinDisasterRecoveryDocuments

3) Reconstitution:Afterrecoveryeffortsarecomplete,normaloperationsmustberestored.Thismayinvolvethereconstructionoffacilitiesand/orinfrastructureaswellasthetestingofrestoredelementstoensuretheirfullfunctionality.

2009‐08‐24 52

APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

2009‐08‐24 53

APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardSA(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

2009‐08‐24 54

APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

2009‐08‐24 55

APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)


Recommended