Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | madeline-hawkins |
View: | 215 times |
Download: | 1 times |
Global Data Grids for 21st Century Science Paul AveryUniversity of Floridahttp://www.phys.ufl.edu/~avery/[email protected] Colloquium University of Texas at Arlington Jan. 24, 2002
Paul Avery
What is a Grid?Grid: Geographically distributed computing resources configured for coordinated usePhysical resources & networks provide raw capabilityMiddleware software ties it together
Paul Avery
Applications for GridsClimate modelingClimate scientists visualize, annotate, & analyze Terabytes of simulation dataBiologyA biochemist exploits 10,000 computers to screen 100,000 compounds in an hourHigh energy physics3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of dataEngineeringCivil engineers collaborate to design, execute, & analyze shake table experimentsA multidisciplinary analysis in aerospace couples code and data in four companiesFrom Ian Foster
Paul Avery
Applications for Grids (cont.)Application Service ProvidersA home user invokes architectural design functions at an application service providerAn application service provider purchases cycles from compute cycle providersCommercialScientists at a multinational soap company design a new productCommunitiesAn emergency response team couples real time data, weather model, population dataA community group pools members PCs to analyze alternative designs for a local roadHealthHospitals and international agencies collaborate on stemming a major disease outbreakFrom Ian Foster
Paul Avery
Proto-Grid: SETI@homeCommunity: SETI researchers + enthusiastsArecibo radio data sent to users (250KB data chunks)Over 2M PCs used
Paul Avery
More Advanced Proto-Grid:Evaluation of AIDS DrugsCommunity1000s of home computer usersPhilanthropic computing vendor (Entropia)Research group (Scripps)Common goalAdvance AIDS research
Paul Avery
Early Information InfrastructureNetwork-centricSimple, fixed end systemsFew embedded capabilitiesFew servicesNo user-level quality of serviceO(108) nodesNetwork
Paul Avery
Emerging Information InfrastructureApplication-centricHeterogeneous, mobile end-systemsMany embedded capabilitiesRich servicesUser-level quality of serviceQoSResource DiscoveryO(1010) nodesQualitatively different, not just faster and more reliableProcessingGridCaching
Paul Avery
Why Grids?Resources for complex problems are distributedAdvanced scientific instruments (accelerators, telescopes, )Storage and computingGroups of peopleCommunities require access to common servicesScientific collaborations (physics, astronomy, biology, eng. )Government agenciesHealth care organizations, large corporations, Goal is to build Virtual OrganizationsMake all community resources available to any VO memberLeverage strengths at different institutions Add people & resources dynamically
Paul Avery
Grid ChallengesOverall goalCoordinated sharing of resourcesTechnical problems to overcomeAuthentication, authorization, policy, auditingResource discovery, access, allocation, controlFailure detection & recoveryResource brokeringAdditional issue: lack of central control & knowledgePreservation of local site autonomyPolicy discovery and negotiation important
Paul Avery
Layered Grid Architecture(Analogy to Internet Architecture)FabricControlling things locally: Accessing, controlling resourcesConnectivityTalking to things: communications, securityResourceSharing single resources: negotiating access, controlling useCollectiveManaging multiple resources: ubiquitous infrastructure servicesUserSpecialized services: App. specific distributed servicesFrom Ian Foster
Paul Avery
Globus Project and ToolkitGlobus Project (Argonne + USC/ISI)O(40) researchers & developersIdentify and define core protocols and servicesGlobus ToolkitA major product of the Globus ProjectReference implementation of core protocols & servicesGrowing open source developer communityGlobus Toolkit used by all Data Grid projects todayUS:GriPhyN, PPDG, TeraGrid, iVDGLEU:EU-DataGrid and national projects
Paul Avery
Globus General ApproachDefine Grid protocols & APIsProtocol-mediated access to remote resourcesIntegrate and extend existing standardsDevelop reference implementationOpen source Globus ToolkitClient & server SDKs, services, tools, etc.Grid-enable wide variety of toolsGlobus ToolkitFTP, SSH, Condor, SRB, MPI, Learn about real world problemsDeploymentTestingApplicationsDiverse global servicesCoreservicesDiverse OS servicesApplications
Paul Avery
Globus Toolkit ProtocolsSecurity (connectivity layer)Grid Security Infrastructure (GSI)Resource management (resource layer)Grid Resource Allocation Management (GRAM)Information services (resource layer)Grid Resource Information Protocol (GRIP)Data transfer (resource layer)Grid File Transfer Protocol (GridFTP)
Paul Avery
Data Grids
Paul Avery
Data Intensive Science: 2000-2015Scientific discovery increasingly driven by IT Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityGeographically distributed collaborationDominant factor: data growth (1 Petabyte = 1000 TB)2000~0.5 Petabyte2005~10 Petabytes2010~100 Petabytes2015~1000 Petabytes?How to collect, manage,access and interpret thisquantity of data?Drives demand for Data Grids to handle additional dimension of data access & movement
Paul Avery
Global Data Grid Challenge
Global scientific communities will perform computationally demanding analyses of distributed datasets that will grow by at least 3 orders of magnitude over the next decade, from the 100 Terabyte to the 100 Petabyte scale.
Paul Avery
Data Intensive Physical SciencesHigh energy & nuclear physicsGravity wave searchesLIGO, GEO, VIRGOAstronomy: Digital sky surveysNow:Sloan Sky Survey, 2MASSFuture:VISTA, other Gigapixel arraysVirtual Observatories (Global Virtual Observatory)Time-dependent 3-D systems (simulation & data)Earth ObservationClimate modelingGeophysics, earthquake modelingFluids, aerodynamic designPollutant dispersal scenarios
Paul Avery
Data Intensive Biology and MedicineMedical dataX-Ray, mammography data, etc. (many petabytes)Digitizing patient records (ditto)X-ray crystallographyBright X-Ray sources, e.g. Argonne Advanced Photon SourceMolecular genomics and related disciplinesHuman Genome, other genome databasesProteomics (protein structure, activities, )Protein interactions, drug deliveryBrain scans (3-D, time dependent)Virtual Population Laboratory (proposed)Database of populations, geography, transportation corridorsSimulate likely spread of disease outbreaksCraig Venter keynote@SC2001
Paul Avery
Data and CorporationsCorporations and GridsNational, international, globalBusiness units, research teamsSales dataTransparent access to distributed databasesCorporate issuesShort term and long term partnershipsOverlapping networksManage, control access to data and resourcesSecurity
Paul Avery
Example: High Energy PhysicsCompact Muon Solenoidat the LHC (CERN)Smithsonian standard man
Paul Avery
LHC Computing ChallengesEvents resulting from beam-beam collisions:Signal event is obscured by 20 overlapping uninteresting collisions in same crossingCPU time does not scale from previous generations20002007
Paul Avery
LHC: Higgs Decay into 4 muons109 events/sec, selectivity: 1 in 1013
Paul Avery
LHC Computing ChallengesComplexity of LHC interaction environment & resulting dataScale: Petabytes of data per year (100 PB by ~2010-12)GLobal distribution of people and resources1800 Physicists150 Institutes32 Countries
Paul Avery
Global LHC Data GridTier0 CERN Tier1 National Lab Tier2 Regional Center (University, etc.) Tier3 University workgroup Tier4 WorkstationKey ideas:Hierarchical structureTier2 centers
Paul Avery
Global LHC Data GridOnline SystemCERN Computer Center > 20 TIPSUSA CenterFrance Center Italy Center UK CenterInstituteInstituteInstituteInstitute ~0.25TIPSWorkstations, other portals~100 MBytes/sec2.5 Gbits/sec100 - 1000 Mbits/secBunch crossing per 25 nsecs. 100 triggers per second Event is ~1 MByte in sizePhysicists work on analysis channels.Each institute has ~10 physicists working on one or more channelsPhysics data cache~PBytes/sec2.5 Gbits/sec~622 Mbits/secTier 0 +1Tier 1Tier 3Tier 4Tier 2ExperimentCERN/Outside Resource Ratio ~1:2 Tier0/( Tier1)/( Tier2) ~1:1:1
Paul Avery
Example: Global Virtual ObservatoryMulti-wavelength astronomy, Multiple surveys
Paul Avery
GVO Data ChallengeDigital representation of the skyAll-sky + deep fieldsIntegrated catalog and image databasesSpectra of selected samplesSize of the archived data40,000 square degreesResolution < 0.1 arcsec > 50 trillion pixelsOne band (2 bytes/pixel)100 TerabytesMulti-wavelength:500-1000 TerabytesTime dimension:Many PetabytesLarge, globally distributed database enginesIntegrated catalog and image databasesMulti-Petabyte data sizeGbyte/s aggregate I/O speed per site
Paul Avery
Sloan Digital Sky Survey Data Grid
Paul Avery
Distributed Collaboration
Japan
Fermilab
U.Washington
U.Chicago
USNO
JHU
VBNS Abilene
NMSU
Apache Point Observatory
I. Advanced Study
Princeton U.
ESNET
LIGO (Gravity Wave) Data GridHanford ObservatoryLivingston ObservatoryCaltechMITTier1OC3OC48OC3OC12OC48
Paul Avery
Data Grid Projects
Paul Avery
Large Data Grid ProjectsFunded projectsGriPhyNUSANSF$11.9M + $1.6M2000-2005EU DataGridEUEC10M2001-2004PPDGUSADOE$9.5M2001-2004TeraGridUSANSF$53M2001-?iVDGLUSANSF$13.7M + $2M2001-2006DataTAGEUEC4M2002-2004Proposed projectsGridPPUKPPARC>$15M?2001-2004Many national projectsInitiatives in US, UK, Italy, France, NL, Germany, Japan, EU networking initiatives (Gant, SURFNet)
Paul Avery
PPDG Middleware ComponentsFutureOO-collection exportCache, state trackingPredictionObject- and File-basedApplication Services (Request Interpreter)Cache ManagerFile Access Service (Request Planner)Matchmaking ServiceCost EstimationFile FetchingServiceFile Replication IndexEnd-to-End Network ServicesMass Storage ManagerResource ManagementFile MoverFile MoverSite BoundarySecurity Domain
Paul Avery
EU DataGrid Project
Paul Avery
Work Package
Work Package title
Lead contractor
WP1
Grid Workload Management
INFN
WP2
Grid Data Management
CERN
WP3
Grid Monitoring Services
PPARC
WP4
Fabric Management
CERN
WP5
Mass Storage Management
PPARC
WP6
Integration Testbed
CNRS
WP7
Network Services
CNRS
WP8
High Energy Physics Applications
CERN
WP9
Earth Observation Science Applications
ESA
WP10
Biology Science Applications
INFN
WP11
Dissemination and Exploitation
INFN
WP12
Project Management
CERN
GriPhyN: PetaScale Virtual-Data GridsVirtual Data ToolsRequest Planning & Scheduling ToolsRequest Execution & Management ToolsTransformsDistributed resources (code, storage, CPUs, networks)Resource Management ServicesResource Management ServicesSecurity and Policy ServicesSecurity and Policy ServicesOther Grid ServicesOther Grid ServicesInteractive User ToolsProduction TeamIndividual InvestigatorWorkgroupsRaw data source~1 Petaflop~100 Petabytes
Paul Avery
GriPhyN Research AgendaVirtual Data technologies (fig.)Derived data, calculable via algorithmInstantiated 0, 1, or many times (e.g., caches)Fetch value vs execute algorithmVery complex (versions, consistency, cost calculation, etc)LIGO exampleGet gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last yearFor each requested data value, need toLocate item location and algorithm Determine costs of fetching vs calculatingPlan data movements & computations required to obtain results Execute the plan
Paul Avery
Virtual Data in ActionData request mayCompute locallyCompute remotelyAccess local dataAccess remote dataScheduling based onLocal policiesGlobal policiesCostMajor facilities, archivesRegional facilities, cachesLocal facilities, caches
Paul Avery
GriPhyN/PPDG Data Grid ArchitectureApplicationPlannerExecutorCatalog ServicesInfo ServicesPolicy/SecurityMonitoringRepl. Mgmt.Reliable TransferServiceCompute ResourceStorage ResourceDAGDAG= initial solution is operational
Paul Avery
Catalog ArchitecturePhysical file storageURLs for physical file locationName LObjN F.X logO3 LCN PFNs logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6Metadata CatalogReplica CatalogLogical Container NameObject NameTransparency wrt location Name LObjN X logO1 Y logO2 F.X logO3 G(1).Y logO4 LCN PFNs logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6Replica CatalogGCMSGCMSObject NameMetadata Catalog
Paul Avery
Early GriPhyN Challenge Problem:CMS Data ReconstructionApril 2001CaltechNCSAWisconsinNCSA Linux cluster 5) Secondary reports complete to masterMaster Condor job running at Caltech7) GridFTP fetches data from UniTreeNCSA UniTree - GridFTP-enabled FTP server4) 100 data files transferred via GridFTP, ~ 1 GB eachSecondary Condor job on UW pool3) 100 Monte Carlo jobs on Wisconsin Condor pool2) Launch secondary job on Wisconsin pool; input files via Globus GASSCaltech workstation6) Master starts reconstruction jobs via Globus jobmanager on cluster8) Processed objectivity database stored to UniTree9) Reconstruction job reports complete to master
Paul Avery
Trace of a Condor-G Physics Run
Paul Avery
iVDGL: A World Grid LaboratoryInternational Virtual-Data Grid LaboratoryA global Grid laboratory (US, EU, Asia, )A place to conduct Data Grid tests at scaleA mechanism to create common Grid infrastructureA facility to perform production exercises for LHC experimentsA laboratory for other disciplines to perform Data Grid testsUS part funded by NSF: Sep. 25, 2001$13.65M + $2MWe propose to create, operate and evaluate, over a sustained period of time, an international research laboratory for data-intensive science.
From NSF proposal, 2001
Paul Avery
iVDGL Summary InformationPrincipal componentsTier1 sites (laboratories)Tier2 sites (universities)Selected Tier3 sites (universities)Fast networks: US, Europe, transatlantic, transpacificGrid Operations Center (GOC)Computer Science support teams (6 UK Fellows)Coordination, managementProposed international participantsInitially US, EU, Japan, AustraliaOther world regions laterDiscussions w/ Russia, China, Pakistan, India, BrazilComplementary EU project: DataTAGTransatlantic network from CERN to STAR-TAP (+ people)Initially 2.5 Gb/s
Paul Avery
US iVDGL Proposal ParticipantsU FloridaCMSCaltechCMS, LIGOUC San DiegoCMS, CSIndiana UATLAS, iGOCBoston UATLASU Wisconsin, MilwaukeeLIGOPenn StateLIGOJohns HopkinsSDSS, NVOU ChicagoCSU Southern CaliforniaCSU Wisconsin, MadisonCSSalish KootenaiOutreach, LIGOHampton UOutreach, ATLASU Texas, BrownsvilleOutreach, LIGOFermilabCMS, SDSS, NVOBrookhavenATLASArgonne LabATLAS, CST2 / SoftwareCS supportT3 / OutreachT1 / Labs (not funded)
Paul Avery
Initial US-iVDGL Data GridCaltech/UCSDFloridaWisconsinFermilabBNLIndianaBUMichiganOther sites to be added in 2002SKCBrownsvilleHamptonPSU
Paul Avery
iVDGL Map (2002-2003)DataTAGSurfnet
Paul Avery
Infrastructure Data Grid ProjectsGriPhyN (US, NSF)Petascale Virtual-Data Gridshttp://www.griphyn.org/Particle Physics Data Grid (US, DOE)Data Grid applications for HENPhttp://www.ppdg.net/European Data Grid (EC, EU)Data Grid technologies, EU deploymenthttp://www.eu-datagrid.org/TeraGrid Project (US, NSF)Dist. supercomp. resources (13 TFlops)http://www.teragrid.org/iVDGL + DataTAG (NSF, EC, others)Global Grid lab & transatlantic networkCollaborations of application scientists & computer scientistsFocus on infrastructure development & deploymentBroad application
Paul Avery
Data Grid Project Timeline
Paul Avery
Need for Common Grid InfrastructureGrid computing sometimes compared to electric gridYou plug in to get a resource (CPU, storage, )You dont care where the resource is locatedWant to avoid this situation in Grid computingThis analogy is more appropriate than originally intendedIt expresses a USA viewpoint uniform power gridWhat happens when you travel around the world?Different frequencies60 Hz, 50 HzDifferent voltages120 V, 220 VDifferent sockets!USA, 2 pin, France, UK, etc.
Paul Avery
Role of Grid Infrastructure Provide essential common Grid servicesCannot afford to develop separate infrastructures (Manpower, timing, immediate needs, etc.)Meet needs of high-end scientific & enging collaborationsHENP, astrophysics, GVO, earthquake, climate, space, biology, Already international and even global in scopeDrive future requirementsBe broadly applicable outside scienceGovernment agencies: National, regional (EU), UNNon-governmental organizations (NGOs)Corporations, business networks (e.g., suppliers, R&D)Other virtual organizations (see Anatomy of the Grid)Be scalable to the Global level
Paul Avery
Grid Coordination EffortsGlobal Grid Forum (GGF)www.gridforum.orgInternational forum for general Grid effortsMany working groups, standards definitionsNext one in Toronto, Feb. 17-20HICB (High energy physics)Represents HEP collaborations, primarily LHC experimentsJoint development & deployment of Data Grid middlewareGriPhyN, PPDG, TeraGrid, iVDGL, EU-DataGrid, LCG, DataTAG, CrossgridCommon testbed, open source software modelSeveral meeting so farNew infrastructure Data Grid projects?Fold into existing Grid landscape (primarily US + EU)
Paul Avery
SummaryData Grids will qualitatively and quantitatively change the nature of collaborations and approaches to computing
The iVDGL will provide vast experience for new collaborations
Many challenges during the coming transitionNew grid projects will provide rich experience and lessonsDifficult to predict situation even 3-5 years ahead
Paul Avery
Grid ReferencesGrid Bookwww.mkp.com/gridsGlobuswww.globus.orgGlobal Grid Forumwww.gridforum.orgTeraGridwww.teragrid.orgEU DataGridwww.eu-datagrid.orgPPDGwww.ppdg.netGriPhyNwww.griphyn.orgiVDGLwww.ivdgl.org
Paul Avery
We define Grid architecture in terms of a layered collection of protocols. Fabric layer includes the protocols and interfaces that provide access to the resources that are being shared, including computers, storage systems, datasets, programs, and networks. This layer is a logical view rather then a physical view. For example, the view of a cluster with a local resource manager is defined by the local resource manger, and not the cluster hardware. Likewise, the fabric provided by a storage system is defined by the file system that is available on that system, not the raw disk or tapes.The connectivity layer defines core protocols required for Grid-specific network transactions. This layer includes the IP protocol stack (system level application protocols [e.g. DNS, RSVP, Routing], transport and internet layers), as well as core Grid security protocols for authentication and authorization.Resource layer defines protocols to initiate and control sharing of (local) resources. Services defined at this level are gatekeeper, GRIS, along with some user oriented application protocols from the Internet protocol suite, such as file-transfer.Collective layer defines protocols that provide system oriented capabilities that are expected to be wide scale in deployment and generic in function. This includes GIIS, bandwidth brokers, resource brokers,.Application layer defines protocols and services that are parochial in nature, targeted towards a specific application domain or class of applications. These are are are arrgh