+ All Categories
Home > Documents > UT Arlington Colloquium (Jan. 24, 2002)Paul Avery1 University of Florida avery/ [email protected]...

UT Arlington Colloquium (Jan. 24, 2002)Paul Avery1 University of Florida avery/ [email protected]...

Date post: 01-Jan-2016
Category:
Upload: madeline-hawkins
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
53
UT Arlington Colloquium (Jan. 24, 20 Paul Avery 1 Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ [email protected] Physics Colloquium University of Texas at Arlington Jan. 24, 2002 Global Data Grids for 21 st Century Science
Transcript
  • Global Data Grids for 21st Century Science Paul AveryUniversity of Floridahttp://www.phys.ufl.edu/~avery/[email protected] Colloquium University of Texas at Arlington Jan. 24, 2002

    Paul Avery

  • What is a Grid?Grid: Geographically distributed computing resources configured for coordinated usePhysical resources & networks provide raw capabilityMiddleware software ties it together

    Paul Avery

  • Applications for GridsClimate modelingClimate scientists visualize, annotate, & analyze Terabytes of simulation dataBiologyA biochemist exploits 10,000 computers to screen 100,000 compounds in an hourHigh energy physics3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of dataEngineeringCivil engineers collaborate to design, execute, & analyze shake table experimentsA multidisciplinary analysis in aerospace couples code and data in four companiesFrom Ian Foster

    Paul Avery

  • Applications for Grids (cont.)Application Service ProvidersA home user invokes architectural design functions at an application service providerAn application service provider purchases cycles from compute cycle providersCommercialScientists at a multinational soap company design a new productCommunitiesAn emergency response team couples real time data, weather model, population dataA community group pools members PCs to analyze alternative designs for a local roadHealthHospitals and international agencies collaborate on stemming a major disease outbreakFrom Ian Foster

    Paul Avery

  • Proto-Grid: SETI@homeCommunity: SETI researchers + enthusiastsArecibo radio data sent to users (250KB data chunks)Over 2M PCs used

    Paul Avery

  • More Advanced Proto-Grid:Evaluation of AIDS DrugsCommunity1000s of home computer usersPhilanthropic computing vendor (Entropia)Research group (Scripps)Common goalAdvance AIDS research

    Paul Avery

  • Early Information InfrastructureNetwork-centricSimple, fixed end systemsFew embedded capabilitiesFew servicesNo user-level quality of serviceO(108) nodesNetwork

    Paul Avery

  • Emerging Information InfrastructureApplication-centricHeterogeneous, mobile end-systemsMany embedded capabilitiesRich servicesUser-level quality of serviceQoSResource DiscoveryO(1010) nodesQualitatively different, not just faster and more reliableProcessingGridCaching

    Paul Avery

  • Why Grids?Resources for complex problems are distributedAdvanced scientific instruments (accelerators, telescopes, )Storage and computingGroups of peopleCommunities require access to common servicesScientific collaborations (physics, astronomy, biology, eng. )Government agenciesHealth care organizations, large corporations, Goal is to build Virtual OrganizationsMake all community resources available to any VO memberLeverage strengths at different institutions Add people & resources dynamically

    Paul Avery

  • Grid ChallengesOverall goalCoordinated sharing of resourcesTechnical problems to overcomeAuthentication, authorization, policy, auditingResource discovery, access, allocation, controlFailure detection & recoveryResource brokeringAdditional issue: lack of central control & knowledgePreservation of local site autonomyPolicy discovery and negotiation important

    Paul Avery

  • Layered Grid Architecture(Analogy to Internet Architecture)FabricControlling things locally: Accessing, controlling resourcesConnectivityTalking to things: communications, securityResourceSharing single resources: negotiating access, controlling useCollectiveManaging multiple resources: ubiquitous infrastructure servicesUserSpecialized services: App. specific distributed servicesFrom Ian Foster

    Paul Avery

  • Globus Project and ToolkitGlobus Project (Argonne + USC/ISI)O(40) researchers & developersIdentify and define core protocols and servicesGlobus ToolkitA major product of the Globus ProjectReference implementation of core protocols & servicesGrowing open source developer communityGlobus Toolkit used by all Data Grid projects todayUS:GriPhyN, PPDG, TeraGrid, iVDGLEU:EU-DataGrid and national projects

    Paul Avery

  • Globus General ApproachDefine Grid protocols & APIsProtocol-mediated access to remote resourcesIntegrate and extend existing standardsDevelop reference implementationOpen source Globus ToolkitClient & server SDKs, services, tools, etc.Grid-enable wide variety of toolsGlobus ToolkitFTP, SSH, Condor, SRB, MPI, Learn about real world problemsDeploymentTestingApplicationsDiverse global servicesCoreservicesDiverse OS servicesApplications

    Paul Avery

  • Globus Toolkit ProtocolsSecurity (connectivity layer)Grid Security Infrastructure (GSI)Resource management (resource layer)Grid Resource Allocation Management (GRAM)Information services (resource layer)Grid Resource Information Protocol (GRIP)Data transfer (resource layer)Grid File Transfer Protocol (GridFTP)

    Paul Avery

  • Data Grids

    Paul Avery

  • Data Intensive Science: 2000-2015Scientific discovery increasingly driven by IT Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityGeographically distributed collaborationDominant factor: data growth (1 Petabyte = 1000 TB)2000~0.5 Petabyte2005~10 Petabytes2010~100 Petabytes2015~1000 Petabytes?How to collect, manage,access and interpret thisquantity of data?Drives demand for Data Grids to handle additional dimension of data access & movement

    Paul Avery

  • Global Data Grid Challenge

    Global scientific communities will perform computationally demanding analyses of distributed datasets that will grow by at least 3 orders of magnitude over the next decade, from the 100 Terabyte to the 100 Petabyte scale.

    Paul Avery

  • Data Intensive Physical SciencesHigh energy & nuclear physicsGravity wave searchesLIGO, GEO, VIRGOAstronomy: Digital sky surveysNow:Sloan Sky Survey, 2MASSFuture:VISTA, other Gigapixel arraysVirtual Observatories (Global Virtual Observatory)Time-dependent 3-D systems (simulation & data)Earth ObservationClimate modelingGeophysics, earthquake modelingFluids, aerodynamic designPollutant dispersal scenarios

    Paul Avery

  • Data Intensive Biology and MedicineMedical dataX-Ray, mammography data, etc. (many petabytes)Digitizing patient records (ditto)X-ray crystallographyBright X-Ray sources, e.g. Argonne Advanced Photon SourceMolecular genomics and related disciplinesHuman Genome, other genome databasesProteomics (protein structure, activities, )Protein interactions, drug deliveryBrain scans (3-D, time dependent)Virtual Population Laboratory (proposed)Database of populations, geography, transportation corridorsSimulate likely spread of disease outbreaksCraig Venter keynote@SC2001

    Paul Avery

  • Data and CorporationsCorporations and GridsNational, international, globalBusiness units, research teamsSales dataTransparent access to distributed databasesCorporate issuesShort term and long term partnershipsOverlapping networksManage, control access to data and resourcesSecurity

    Paul Avery

  • Example: High Energy PhysicsCompact Muon Solenoidat the LHC (CERN)Smithsonian standard man

    Paul Avery

  • LHC Computing ChallengesEvents resulting from beam-beam collisions:Signal event is obscured by 20 overlapping uninteresting collisions in same crossingCPU time does not scale from previous generations20002007

    Paul Avery

  • LHC: Higgs Decay into 4 muons109 events/sec, selectivity: 1 in 1013

    Paul Avery

  • LHC Computing ChallengesComplexity of LHC interaction environment & resulting dataScale: Petabytes of data per year (100 PB by ~2010-12)GLobal distribution of people and resources1800 Physicists150 Institutes32 Countries

    Paul Avery

  • Global LHC Data GridTier0 CERN Tier1 National Lab Tier2 Regional Center (University, etc.) Tier3 University workgroup Tier4 WorkstationKey ideas:Hierarchical structureTier2 centers

    Paul Avery

  • Global LHC Data GridOnline SystemCERN Computer Center > 20 TIPSUSA CenterFrance Center Italy Center UK CenterInstituteInstituteInstituteInstitute ~0.25TIPSWorkstations, other portals~100 MBytes/sec2.5 Gbits/sec100 - 1000 Mbits/secBunch crossing per 25 nsecs. 100 triggers per second Event is ~1 MByte in sizePhysicists work on analysis channels.Each institute has ~10 physicists working on one or more channelsPhysics data cache~PBytes/sec2.5 Gbits/sec~622 Mbits/secTier 0 +1Tier 1Tier 3Tier 4Tier 2ExperimentCERN/Outside Resource Ratio ~1:2 Tier0/( Tier1)/( Tier2) ~1:1:1

    Paul Avery

  • Example: Global Virtual ObservatoryMulti-wavelength astronomy, Multiple surveys

    Paul Avery

  • GVO Data ChallengeDigital representation of the skyAll-sky + deep fieldsIntegrated catalog and image databasesSpectra of selected samplesSize of the archived data40,000 square degreesResolution < 0.1 arcsec > 50 trillion pixelsOne band (2 bytes/pixel)100 TerabytesMulti-wavelength:500-1000 TerabytesTime dimension:Many PetabytesLarge, globally distributed database enginesIntegrated catalog and image databasesMulti-Petabyte data sizeGbyte/s aggregate I/O speed per site

    Paul Avery

  • Sloan Digital Sky Survey Data Grid

    Paul Avery

  • Distributed Collaboration

    Japan

    Fermilab

    U.Washington

    U.Chicago

    USNO

    JHU

    VBNS Abilene

    NMSU

    Apache Point Observatory

    I. Advanced Study

    Princeton U.

    ESNET

  • LIGO (Gravity Wave) Data GridHanford ObservatoryLivingston ObservatoryCaltechMITTier1OC3OC48OC3OC12OC48

    Paul Avery

  • Data Grid Projects

    Paul Avery

  • Large Data Grid ProjectsFunded projectsGriPhyNUSANSF$11.9M + $1.6M2000-2005EU DataGridEUEC10M2001-2004PPDGUSADOE$9.5M2001-2004TeraGridUSANSF$53M2001-?iVDGLUSANSF$13.7M + $2M2001-2006DataTAGEUEC4M2002-2004Proposed projectsGridPPUKPPARC>$15M?2001-2004Many national projectsInitiatives in US, UK, Italy, France, NL, Germany, Japan, EU networking initiatives (Gant, SURFNet)

    Paul Avery

  • PPDG Middleware ComponentsFutureOO-collection exportCache, state trackingPredictionObject- and File-basedApplication Services (Request Interpreter)Cache ManagerFile Access Service (Request Planner)Matchmaking ServiceCost EstimationFile FetchingServiceFile Replication IndexEnd-to-End Network ServicesMass Storage ManagerResource ManagementFile MoverFile MoverSite BoundarySecurity Domain

    Paul Avery

  • EU DataGrid Project

    Paul Avery

    Work Package

    Work Package title

    Lead contractor

    WP1

    Grid Workload Management

    INFN

    WP2

    Grid Data Management

    CERN

    WP3

    Grid Monitoring Services

    PPARC

    WP4

    Fabric Management

    CERN

    WP5

    Mass Storage Management

    PPARC

    WP6

    Integration Testbed

    CNRS

    WP7

    Network Services

    CNRS

    WP8

    High Energy Physics Applications

    CERN

    WP9

    Earth Observation Science Applications

    ESA

    WP10

    Biology Science Applications

    INFN

    WP11

    Dissemination and Exploitation

    INFN

    WP12

    Project Management

    CERN

  • GriPhyN: PetaScale Virtual-Data GridsVirtual Data ToolsRequest Planning & Scheduling ToolsRequest Execution & Management ToolsTransformsDistributed resources (code, storage, CPUs, networks)Resource Management ServicesResource Management ServicesSecurity and Policy ServicesSecurity and Policy ServicesOther Grid ServicesOther Grid ServicesInteractive User ToolsProduction TeamIndividual InvestigatorWorkgroupsRaw data source~1 Petaflop~100 Petabytes

    Paul Avery

  • GriPhyN Research AgendaVirtual Data technologies (fig.)Derived data, calculable via algorithmInstantiated 0, 1, or many times (e.g., caches)Fetch value vs execute algorithmVery complex (versions, consistency, cost calculation, etc)LIGO exampleGet gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last yearFor each requested data value, need toLocate item location and algorithm Determine costs of fetching vs calculatingPlan data movements & computations required to obtain results Execute the plan

    Paul Avery

  • Virtual Data in ActionData request mayCompute locallyCompute remotelyAccess local dataAccess remote dataScheduling based onLocal policiesGlobal policiesCostMajor facilities, archivesRegional facilities, cachesLocal facilities, caches

    Paul Avery

  • GriPhyN/PPDG Data Grid ArchitectureApplicationPlannerExecutorCatalog ServicesInfo ServicesPolicy/SecurityMonitoringRepl. Mgmt.Reliable TransferServiceCompute ResourceStorage ResourceDAGDAG= initial solution is operational

    Paul Avery

  • Catalog ArchitecturePhysical file storageURLs for physical file locationName LObjN F.X logO3 LCN PFNs logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6Metadata CatalogReplica CatalogLogical Container NameObject NameTransparency wrt location Name LObjN X logO1 Y logO2 F.X logO3 G(1).Y logO4 LCN PFNs logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6Replica CatalogGCMSGCMSObject NameMetadata Catalog

    Paul Avery

  • Early GriPhyN Challenge Problem:CMS Data ReconstructionApril 2001CaltechNCSAWisconsinNCSA Linux cluster 5) Secondary reports complete to masterMaster Condor job running at Caltech7) GridFTP fetches data from UniTreeNCSA UniTree - GridFTP-enabled FTP server4) 100 data files transferred via GridFTP, ~ 1 GB eachSecondary Condor job on UW pool3) 100 Monte Carlo jobs on Wisconsin Condor pool2) Launch secondary job on Wisconsin pool; input files via Globus GASSCaltech workstation6) Master starts reconstruction jobs via Globus jobmanager on cluster8) Processed objectivity database stored to UniTree9) Reconstruction job reports complete to master

    Paul Avery

  • Trace of a Condor-G Physics Run

    Paul Avery

  • iVDGL: A World Grid LaboratoryInternational Virtual-Data Grid LaboratoryA global Grid laboratory (US, EU, Asia, )A place to conduct Data Grid tests at scaleA mechanism to create common Grid infrastructureA facility to perform production exercises for LHC experimentsA laboratory for other disciplines to perform Data Grid testsUS part funded by NSF: Sep. 25, 2001$13.65M + $2MWe propose to create, operate and evaluate, over a sustained period of time, an international research laboratory for data-intensive science.

    From NSF proposal, 2001

    Paul Avery

  • iVDGL Summary InformationPrincipal componentsTier1 sites (laboratories)Tier2 sites (universities)Selected Tier3 sites (universities)Fast networks: US, Europe, transatlantic, transpacificGrid Operations Center (GOC)Computer Science support teams (6 UK Fellows)Coordination, managementProposed international participantsInitially US, EU, Japan, AustraliaOther world regions laterDiscussions w/ Russia, China, Pakistan, India, BrazilComplementary EU project: DataTAGTransatlantic network from CERN to STAR-TAP (+ people)Initially 2.5 Gb/s

    Paul Avery

  • US iVDGL Proposal ParticipantsU FloridaCMSCaltechCMS, LIGOUC San DiegoCMS, CSIndiana UATLAS, iGOCBoston UATLASU Wisconsin, MilwaukeeLIGOPenn StateLIGOJohns HopkinsSDSS, NVOU ChicagoCSU Southern CaliforniaCSU Wisconsin, MadisonCSSalish KootenaiOutreach, LIGOHampton UOutreach, ATLASU Texas, BrownsvilleOutreach, LIGOFermilabCMS, SDSS, NVOBrookhavenATLASArgonne LabATLAS, CST2 / SoftwareCS supportT3 / OutreachT1 / Labs (not funded)

    Paul Avery

  • Initial US-iVDGL Data GridCaltech/UCSDFloridaWisconsinFermilabBNLIndianaBUMichiganOther sites to be added in 2002SKCBrownsvilleHamptonPSU

    Paul Avery

  • iVDGL Map (2002-2003)DataTAGSurfnet

    Paul Avery

  • Infrastructure Data Grid ProjectsGriPhyN (US, NSF)Petascale Virtual-Data Gridshttp://www.griphyn.org/Particle Physics Data Grid (US, DOE)Data Grid applications for HENPhttp://www.ppdg.net/European Data Grid (EC, EU)Data Grid technologies, EU deploymenthttp://www.eu-datagrid.org/TeraGrid Project (US, NSF)Dist. supercomp. resources (13 TFlops)http://www.teragrid.org/iVDGL + DataTAG (NSF, EC, others)Global Grid lab & transatlantic networkCollaborations of application scientists & computer scientistsFocus on infrastructure development & deploymentBroad application

    Paul Avery

  • Data Grid Project Timeline

    Paul Avery

  • Need for Common Grid InfrastructureGrid computing sometimes compared to electric gridYou plug in to get a resource (CPU, storage, )You dont care where the resource is locatedWant to avoid this situation in Grid computingThis analogy is more appropriate than originally intendedIt expresses a USA viewpoint uniform power gridWhat happens when you travel around the world?Different frequencies60 Hz, 50 HzDifferent voltages120 V, 220 VDifferent sockets!USA, 2 pin, France, UK, etc.

    Paul Avery

  • Role of Grid Infrastructure Provide essential common Grid servicesCannot afford to develop separate infrastructures (Manpower, timing, immediate needs, etc.)Meet needs of high-end scientific & enging collaborationsHENP, astrophysics, GVO, earthquake, climate, space, biology, Already international and even global in scopeDrive future requirementsBe broadly applicable outside scienceGovernment agencies: National, regional (EU), UNNon-governmental organizations (NGOs)Corporations, business networks (e.g., suppliers, R&D)Other virtual organizations (see Anatomy of the Grid)Be scalable to the Global level

    Paul Avery

  • Grid Coordination EffortsGlobal Grid Forum (GGF)www.gridforum.orgInternational forum for general Grid effortsMany working groups, standards definitionsNext one in Toronto, Feb. 17-20HICB (High energy physics)Represents HEP collaborations, primarily LHC experimentsJoint development & deployment of Data Grid middlewareGriPhyN, PPDG, TeraGrid, iVDGL, EU-DataGrid, LCG, DataTAG, CrossgridCommon testbed, open source software modelSeveral meeting so farNew infrastructure Data Grid projects?Fold into existing Grid landscape (primarily US + EU)

    Paul Avery

  • SummaryData Grids will qualitatively and quantitatively change the nature of collaborations and approaches to computing

    The iVDGL will provide vast experience for new collaborations

    Many challenges during the coming transitionNew grid projects will provide rich experience and lessonsDifficult to predict situation even 3-5 years ahead

    Paul Avery

  • Grid ReferencesGrid Bookwww.mkp.com/gridsGlobuswww.globus.orgGlobal Grid Forumwww.gridforum.orgTeraGridwww.teragrid.orgEU DataGridwww.eu-datagrid.orgPPDGwww.ppdg.netGriPhyNwww.griphyn.orgiVDGLwww.ivdgl.org

    Paul Avery

    We define Grid architecture in terms of a layered collection of protocols. Fabric layer includes the protocols and interfaces that provide access to the resources that are being shared, including computers, storage systems, datasets, programs, and networks. This layer is a logical view rather then a physical view. For example, the view of a cluster with a local resource manager is defined by the local resource manger, and not the cluster hardware. Likewise, the fabric provided by a storage system is defined by the file system that is available on that system, not the raw disk or tapes.The connectivity layer defines core protocols required for Grid-specific network transactions. This layer includes the IP protocol stack (system level application protocols [e.g. DNS, RSVP, Routing], transport and internet layers), as well as core Grid security protocols for authentication and authorization.Resource layer defines protocols to initiate and control sharing of (local) resources. Services defined at this level are gatekeeper, GRIS, along with some user oriented application protocols from the Internet protocol suite, such as file-transfer.Collective layer defines protocols that provide system oriented capabilities that are expected to be wide scale in deployment and generic in function. This includes GIIS, bandwidth brokers, resource brokers,.Application layer defines protocols and services that are parochial in nature, targeted towards a specific application domain or class of applications. These are are are arrgh


Recommended