Distributed Data Access and Analysis for Next Generation HENP Experiments Harvey Newman, Caltech

February 10, 2000: Distributed Data Access and Analysis for HENP Experiments Harvey B Newman (CIT)

Distributed Data Access and Analysisfor Next Generation HENP Experiments

Harvey Newman, CaltechHarvey Newman, CaltechCHEP 2000, PadovaCHEP 2000, Padova February 10, 2000 February 10, 2000


LHC Computing: LHC Computing: DifferentDifferent from from Previous Experiment GenerationsPrevious Experiment Generations

Geographical dispersion:Geographical dispersion: of people and resources of people and resources Complexity:Complexity: the detector and the LHC environment the detector and the LHC environment Scale: Scale: Petabytes per year of dataPetabytes per year of data

~5000 Physicists 250 Institutes

~50 CountriesMajor challenges associated with:Major challenges associated with:

Coordinated Use of Distributed Computing Resources Coordinated Use of Distributed Computing Resources Remote software development and physics analysisRemote software development and physics analysis Communication and collaboration at a distanceCommunication and collaboration at a distance

R&D: A New Form of Distributed System: Data-GridR&D: A New Form of Distributed System: Data-Grid


Four Experiments Four Experiments The Petabyte to Exabyte ChallengeThe Petabyte to Exabyte Challenge

ATLAS, CMS, ALICE, LHCBATLAS, CMS, ALICE, LHCBHiggs and New particles; Quark-Gluon Plasma; CP ViolationHiggs and New particles; Quark-Gluon Plasma; CP Violation

Data written to “tape” ~5 Petabytes/Year and UPData written to “tape” ~5 Petabytes/Year and UP (1 PB = 10 (1 PB = 101515 Bytes) Bytes)

0.1 to 1 Exabyte (1 EB = 100.1 to 1 Exabyte (1 EB = 101818 Bytes) Bytes) (~2010) (~2020 ?) Total for the LHC Experiments(~2010) (~2020 ?) Total for the LHC Experiments


To Solve: the LHC “Data Problem”To Solve: the LHC “Data Problem”

While the proposed LHC computing and data handling While the proposed LHC computing and data handling facilities are large by present-day standards,facilities are large by present-day standards,

They will not support FREE access, transport or processing They will not support FREE access, transport or processing for more than a minute part of the datafor more than a minute part of the data

Balance between proximity to large computational and data Balance between proximity to large computational and data handling facilities, and proximity to end users and more handling facilities, and proximity to end users and more

local resources for frequently-accessed datasets local resources for frequently-accessed datasets Strategies must be studied and prototyped, to ensure both:Strategies must be studied and prototyped, to ensure both:

acceptable turnaround times, and efficient resource utilisation acceptable turnaround times, and efficient resource utilisation Problems to be Explored Problems to be Explored

How to meet demands of hundreds of users who How to meet demands of hundreds of users who needneed transparent transparent access to local and remote data, in disk caches and tape stores access to local and remote data, in disk caches and tape stores

Prioritise hundreds of requests of local and remote communities,Prioritise hundreds of requests of local and remote communities, consistent with local and regional policies consistent with local and regional policies

Ensure that the system is dimensioned/used/managed Ensure that the system is dimensioned/used/managed optimally, for optimally, for the mixed workloadthe mixed workload


MONARC General Conclusions MONARC General Conclusions on LHC Computingon LHC Computing

Following discussions of computing and network requirements, Following discussions of computing and network requirements, technology evolution and projected costs, support requirements etc.technology evolution and projected costs, support requirements etc.

The scale of LHC “Computing” requires The scale of LHC “Computing” requires a worldwide effort to a worldwide effort to accumulate the necessary technical and financial resourcesaccumulate the necessary technical and financial resources

A distributed hierarchy of computing centres will lead to better useA distributed hierarchy of computing centres will lead to better useof the financial and manpower resources of CERN, the Collaborations,of the financial and manpower resources of CERN, the Collaborations,and the nations involved, than a highly centralized model focused at and the nations involved, than a highly centralized model focused at CERN CERN

The distributed model also provides better use of The distributed model also provides better use of physics opportunities at the LHC by physicists and students physics opportunities at the LHC by physicists and students

At the top of the hierarchy is the CERN Center, with the ability to perform At the top of the hierarchy is the CERN Center, with the ability to perform all analysis-related functions, but not the ability to do them completelyall analysis-related functions, but not the ability to do them completely

At the next step in the hierarchy is a collection of large, multi-service At the next step in the hierarchy is a collection of large, multi-service “Tier1 Regional Centres”, “Tier1 Regional Centres”, each with each with

10-20% of the CERN capacity devoted to one experiment10-20% of the CERN capacity devoted to one experiment There will be Tier2 or smaller special purpose centers in many regionsThere will be Tier2 or smaller special purpose centers in many regions


Bandwidth Requirements Estimate (Mbps) [*]Bandwidth Requirements Estimate (Mbps) [*]ICFA Network TasICFA Network Tas

Year 1998 2000 2005BW Utilized Per Physicist

(and Peak BW Used)0.05-0.25

(0.5 - 2)0.2 - 2 (2 - 10)

0.8 - 10 (10 - 100)

BW Utilized by a UniversityGroup 0.25 - 10 1.5 - 45 34 - 622

BW to a Home-laboratory or Regional Center 1.5 - 45 34 -

155622 -5000

BW to a Central Laboratory Housing One or More Major

Experiments34 - 155 155 -

6222500 -10000

BW on a Transoceanic Link 1.5 - 20 34-155 622 -5000

See http://l3www.cern.ch/~newman/icfareq98.htmlSee http://l3www.cern.ch/~newman/icfareq98.htmlCirca 2000: Predictions roughly on track: Circa 2000: Predictions roughly on track: “Universal” BW Growth” by ~2X Per Year;“Universal” BW Growth” by ~2X Per Year;

622 Mbps on Links European and Transatlantic by ~2002-3622 Mbps on Links European and Transatlantic by ~2002-3Terabit/sec US Backbones (e.g. ESNet) by ~2003-5Terabit/sec US Backbones (e.g. ESNet) by ~2003-5

Caveats: Distinguish raw bandwidth and effective line capacity;Caveats: Distinguish raw bandwidth and effective line capacity;Maximum end-to-end rate for individual data flows Maximum end-to-end rate for individual data flows

“QoS”/ IP has a way to go“QoS”/ IP has a way to go

D388, D402,D274


CMS Analysis and CMS Analysis and Persistent Object StorePersistent Object Store

On Demand Object Creation

Data Organized In a(n Data Organized In a(n Object) Object) “Hierarchy”“Hierarchy”

Raw, Reconstructed (ESD), Raw, Reconstructed (ESD), Analysis Objects (AOD), TagsAnalysis Objects (AOD), Tags

Data DistributionData Distribution All raw, reconstructedAll raw, reconstructed

and master parameter DB’s and master parameter DB’s at CERN at CERN

All event TAG and AODs,All event TAG and AODs,and selected reconstructed and selected reconstructed data sets data sets at each regional center at each regional center

HOTHOT data (frequently data (frequently accessed) moved to RCsaccessed) moved to RCs

Goal of location and medium Goal of location and medium transparencytransparency

Online

Common Filters and Pre-Emptive Object

Creation

CMS

Slow ControlDetector

Monitoring “L4”L2/L3

L1

Persistent Object Store Object Database Management System

Filtering

Simulation Calibrations, Group Analyses User Analysis

Offline

C121


GIOD SummaryGIOD Summary

Hit

Track

Detector

GIOD hasGIOD has Constructed a Terabyte-scale Constructed a Terabyte-scale

set of fully simulated CMS set of fully simulated CMS eventsevents and used these to and used these to create a large OO databasecreate a large OO database

Learned how to create large Learned how to create large database federationsdatabase federations

Completed the “100” (to 170) Completed the “100” (to 170) Mbyte/sec CMS MilestoneMbyte/sec CMS Milestone

Developed prototype Developed prototype reconstruction and analysis reconstruction and analysis codes, and Java 3D OO codes, and Java 3D OO visualization demonstrators,visualization demonstrators, that work that work seamlessly seamlessly with with persistent objects over persistent objects over networksnetworks

Deployed facilities and Deployed facilities and database federations as database federations as useful testbedsuseful testbeds for for Computing Model studiesComputing Model studiesC226

C51


Data Grid Hierarchy (CMS Example)Data Grid Hierarchy (CMS Example)

Tier2 Center ~1 TIPS

Online System

Offline Farm~20 TIPS

CERN Computer Center

Fermilab~4 TIPSFrance Regional

Center Italy Regional

Center Germany

Regional Center

InstituteInstituteInstituteInstitute ~0.25TIPS

Workstations

~100 MBytes/sec

~100 MBytes/sec

~2.4 Gbits/sec

100 - 1000 Mbits/sec

Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size

Physicists work on analysis “channels”.Each institute has ~10 physicists working on one or more channelsData for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight




~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 3Tier 3

Tier 4Tier 4

1 TIPS = 25,000 SpecInt95PC (today) = 10-15 SpecInt95


Tier 2Tier 2

E277


LHC (and HEP) Challenges LHC (and HEP) Challenges of Petabyte-Scale Dataof Petabyte-Scale Data

Technical Requirements Optimize use of resources with next generation middleware

Co-Locate and Co-Schedule Resources and Requests Enhance database systems to work seamlessly

across networks: caching/replication/mirroring Balance proximity to centralized facilities, and

to end users for frequently accessed data

Requirements of the Worldwide Collaborative Nature of Experiments

Make appropriate use of data analysis resources in each world region, conforming to local and regional policies

Involve scientists and students in each world regionin front-line physics research Through an integrated collaborative environment

E163

C74,C292


Time-Scale: CMS Time-Scale: CMS Recent “Events”Recent “Events”

A A PHASE TRANSITIONPHASE TRANSITION in our understanding of the role of CMS in our understanding of the role of CMS

Software and Computing occurred in October - November 1999Software and Computing occurred in October - November 1999 ““Strong Coupling” of S&C Task,Trigger/DAQ, Physics TDR,Strong Coupling” of S&C Task,Trigger/DAQ, Physics TDR,

detector performance studies and other main milestonesdetector performance studies and other main milestones Integrated CMS Software and Trigger/DAQ planning for the next Integrated CMS Software and Trigger/DAQ planning for the next

round:round: May 2000 MilestoneMay 2000 Milestone

Large simulated samples are required: ~ 1 Million events fullyLarge simulated samples are required: ~ 1 Million events fullysimulated a few times during 2000, in ~1 month simulated a few times during 2000, in ~1 month

A smoothly rising curve of computing and data handling needsA smoothly rising curve of computing and data handling needs from now on from now on

Mock Data Challenges from 2000 (1% scale) to 2005Mock Data Challenges from 2000 (1% scale) to 2005

Users want substantial parts of the functionality formerly Users want substantial parts of the functionality formerly planned for 2005, planned for 2005, Starting NowStarting Now

A108


RD45, RD45, GIOD:GIOD: Networked Object DatabasesNetworked Object Databases Clipper,GC;Clipper,GC; High speed access to Objects or File data High speed access to Objects or File data

FNAL/SAMFNAL/SAM for processing and analysis for processing and analysis SLAC/OOFS Distributed File System + Objectivity Interface SLAC/OOFS Distributed File System + Objectivity Interface NILE, Condor:NILE, Condor: Fault Tolerant Distributed Computing with Fault Tolerant Distributed Computing with

Heterogeneous CPU ResourcesHeterogeneous CPU Resources

MONARC:MONARC: LHC Computing Models: LHC Computing Models: Architecture, Simulation, Strategy, PoliticsArchitecture, Simulation, Strategy, Politics

PPDG:PPDG: First Distributed Data Services and First Distributed Data Services and Data Grid System Prototype Data Grid System Prototype

ALDAP:ALDAP: Database Structures and Access Database Structures and Access Methods for Methods for Astrophysics and HENP DataAstrophysics and HENP Data GriPhyN:GriPhyN: Production-Scale Data GridProduction-Scale Data Grid

Simulation/Modeling, Application + Network Simulation/Modeling, Application + Network Instrumentation, System Optimization/EvaluationInstrumentation, System Optimization/Evaluation

APOGEEAPOGEE

Roles of ProjectsRoles of Projectsfor HENP Distributed Analysisfor HENP Distributed Analysis

E277

A391


MONARC: Common ProjectMONARC: Common Project MModels odels OOf f NNetworked etworked AAnalysis nalysis

At At RRegional egional CCentersentersCaltech, CERN, Columbia, FNAL, Heidelberg, Caltech, CERN, Columbia, FNAL, Heidelberg,

Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Munich, Orsay, Oxford, TuftsMunich, Orsay, Oxford, Tufts

PROJECT GOALSPROJECT GOALS Develop “Baseline Models”Develop “Baseline Models” Specify the main parameters Specify the main parameters

characterizing the Model’s characterizing the Model’s performance: throughputs, latencies performance: throughputs, latencies

Verify resource requirement baselines: Verify resource requirement baselines: (computing, data handling, networks) (computing, data handling, networks)

TECHNICAL GOALSTECHNICAL GOALS Define the Define the Analysis ProcessAnalysis Process Define Define RC Architectures and ServicesRC Architectures and Services Provide Provide Guidelines for the final ModelsGuidelines for the final Models Provide a Provide a Simulation Toolset Simulation Toolset for Further for Further

Model studiesModel studies

622

Mbi

ts/s 622 M

bits/s

Univ 2

CERN350k SI95 350 Tbytes

Disk; Robot

Tier2 Ctr20k SI95 20 TB

Disk Robot

FNAL/BNL70k SI9570 Tbyte

Disk; Robot

622 M

bits/s

N X

622

Mbi

ts/s

622Mbits/s

622 Mbits/s

Univ1

UnivM

Model Circa Model Circa 20052005

F148


MONARC Working Groups/ChairsMONARC Working Groups/Chairs ““Analysis Process Design”Analysis Process Design” P. Capiluppi (Bologna, CMS)P. Capiluppi (Bologna, CMS)

“ “Architectures”Architectures” Joel Butler (FNAL, CMS) Joel Butler (FNAL, CMS)

“ “Simulation”Simulation” Krzysztof Sliwa (Tufts, ATLAS)Krzysztof Sliwa (Tufts, ATLAS)

““Testbeds”Testbeds” Lamberto Luminari (Rome, ATLAS) Lamberto Luminari (Rome, ATLAS)

““Steering”Steering” Laura Perini (Milan, ATLAS) Laura Perini (Milan, ATLAS) Harvey Newman (Caltech, CMS)Harvey Newman (Caltech, CMS)

& & “Regional Centres Committee”“Regional Centres Committee”


MONARC Architectures WG:MONARC Architectures WG: Regional Centre Facilities & Services Regional Centre Facilities & Services

Regional Centres Should ProvideRegional Centres Should Provide

All technical and data services required to do physics analysisAll technical and data services required to do physics analysis All Physics Objects, Tags and Calibration dataAll Physics Objects, Tags and Calibration data Significant fraction of raw dataSignificant fraction of raw data Caching or mirroring calibration constantsCaching or mirroring calibration constants Excellent network connectivity to CERN and the region’s usersExcellent network connectivity to CERN and the region’s users Manpower to share in the development of common validation Manpower to share in the development of common validation

and production softwareand production software A fair share of post- and re-reconstruction processingA fair share of post- and re-reconstruction processing Manpower to share in ongoing work on Common R&D ProjectsManpower to share in ongoing work on Common R&D Projects Excellent support services for training, documentation, Excellent support services for training, documentation,

troubleshooting at the Centre or remote sites served by ittroubleshooting at the Centre or remote sites served by it Service to members of other regionsService to members of other regions

Long Term Commitment for staffing, hardware evolution and supportLong Term Commitment for staffing, hardware evolution and supportfor R&D, as part of the distributed data analysis architecturefor R&D, as part of the distributed data analysis architecture


MONARC and Regional CentresMONARC and Regional Centres

MONARC RC FORUM: Representative Meetings QuarterlyMONARC RC FORUM: Representative Meetings Quarterly Regional Centre Planning well-advanced, with optimistic outlook, Regional Centre Planning well-advanced, with optimistic outlook,

in US (FNAL for CMS; BNL for ATLAS), France (CCIN2P3), Italy, UKin US (FNAL for CMS; BNL for ATLAS), France (CCIN2P3), Italy, UK Proposals submitted late 1999 or early 2000Proposals submitted late 1999 or early 2000

Active R&D and prototyping underway, especially in US, Italy, Active R&D and prototyping underway, especially in US, Italy, Japan; and UK (LHCb), Russia (MSU, ITEP), Finland (HIP) Japan; and UK (LHCb), Russia (MSU, ITEP), Finland (HIP)

Discussions in the national communities also underway in Discussions in the national communities also underway in Japan, Finland, Russia, GermanyJapan, Finland, Russia, Germany

There is a near-term need to understand the level and sharing of There is a near-term need to understand the level and sharing of support for LHC computing between CERN and the outside support for LHC computing between CERN and the outside institutes, to enable the planning in several countries to advance.institutes, to enable the planning in several countries to advance.

MONARC Uses traditional 1/3:2/3 sharing assumptionMONARC Uses traditional 1/3:2/3 sharing assumption


Regional Center ArchitectureRegional Center Architecture Example by I. Gaines (MONARC) Example by I. Gaines (MONARC)

Tapes

Network from CERN

Networkfrom Tier 2& simulation centers

Tape Mass Storage & Disk Servers

Database Servers

PhysicsSoftware

Development

R&D Systemsand Testbeds

Info serversCode servers

Web ServersTelepresence

Servers

TrainingConsultingHelp Desk

ProductionReconstruction

Raw/Sim ESD

Scheduled, predictable

experiment/physics groups

ProductionAnalysis

ESD AODAOD DPD

Scheduled

Physics groups

Individual Analysis

AOD DPDand plots

Chaotic

Physicists Desktops

Tier 2

Local institutesCERN

Tapes

C169


Data Grid: Tier2 Layer Data Grid: Tier2 Layer Create an Ensemble of (University-Based) Tier2 Create an Ensemble of (University-Based) Tier2

Data Analysis CentresData Analysis Centres Site ArchitecturesSite Architectures Complementary to the the Major Complementary to the the Major

Tier1 Lab-Based CentersTier1 Lab-Based Centers Medium-scale Linux CPU farm, Sun data server, RAID disk arrayMedium-scale Linux CPU farm, Sun data server, RAID disk array Less need for 24 X 7 Operation Less need for 24 X 7 Operation Some lower component costs Some lower component costs Less production-oriented, to respond to local and regional analysis Less production-oriented, to respond to local and regional analysis

priorities and needs priorities and needs Supportable by a small local team and physicists’ helpSupportable by a small local team and physicists’ help

One Tier2 Center in Each Region (e.g. of the US)One Tier2 Center in Each Region (e.g. of the US) Catalyze local and regional focus on particular sets of physics goalsCatalyze local and regional focus on particular sets of physics goals Encourage coordinated analysis developments emphasizing particular Encourage coordinated analysis developments emphasizing particular

physics aspects or subdetectors. Example: CMS EMU in Southwest USphysics aspects or subdetectors. Example: CMS EMU in Southwest US Emphasis on Training, Involvement of Students at UniversitiesEmphasis on Training, Involvement of Students at Universities

in Front-line Data Analysis and Physics Resultsin Front-line Data Analysis and Physics Results Include a high quality environment for desktop remote collaborationInclude a high quality environment for desktop remote collaboration

E277


MONARC Analysis Process ExampleMONARC Analysis Process Example

DAQ/RAWSlow

Control/Cal


Monarc Analysis Model Baseline: Monarc Analysis Model Baseline: ATLAS or CMS “Typical” Tier1 RCATLAS or CMS “Typical” Tier1 RC

CPU PowerCPU Power ~100 KSI95~100 KSI95 Disk spaceDisk space ~100 TB~100 TB Tape capacityTape capacity 300 TB, 100 MB/sec300 TB, 100 MB/sec Link speed to Tier2Link speed to Tier2 10 MB/sec (1/2 of 155 Mbps)10 MB/sec (1/2 of 155 Mbps) Raw dataRaw data 1% 1% 10-15 TB/year10-15 TB/year ESD dataESD data 100%100% 100-150 TB/year100-150 TB/year Selected ESDSelected ESD 25%25% 5 TB/year 5 TB/year [*][*] Revised ESDRevised ESD 25%25% 10 TB/year 10 TB/year [*][*] AOD dataAOD data 100%100% 2 TB/year 2 TB/year [**][**] Revised AODRevised AOD 100%100% 4 TB/year 4 TB/year [**][**] TAG/DPDTAG/DPD 100%100% 200 GB/year 200 GB/year

Simulated dataSimulated data 25%25% 25 TB/year 25 TB/year (repository)(repository)

[*] Covering Five Analysis Groups; each selecting ~1% [*] Covering Five Analysis Groups; each selecting ~1% of Annual ESD or AOD data for a Typical Analysis of Annual ESD or AOD data for a Typical Analysis

[**] Covering All Analysis Groups[**] Covering All Analysis Groups


MONARC Testbeds WG: MONARC Testbeds WG: Isolation of Key ParametersIsolation of Key Parameters

Some Parameters Measured,Some Parameters Measured,Installed in the MONARC Simulation Models,Installed in the MONARC Simulation Models,and Used in First Round Validation of Models.and Used in First Round Validation of Models.

Objectivity AMS Response Time-Function, Objectivity AMS Response Time-Function, and its and its dependence on dependence on

Object clustering, page-size, data class-hierarchy Object clustering, page-size, data class-hierarchy and access patternand access pattern

Mirroring and caching (e.g. with the Objectivity DRO option)Mirroring and caching (e.g. with the Objectivity DRO option) Scalability of the System Under “Stress”: Scalability of the System Under “Stress”:

Performance as a function of the number of jobs, Performance as a function of the number of jobs, relative to the single-job performancerelative to the single-job performance

Performance and Bottlenecks Performance and Bottlenecks for a variety of data for a variety of data access patternsaccess patterns

Tests over LANs and WANsTests over LANs and WANs D235, D127


MONARC Testbeds WGMONARC Testbeds WG

Test-bed configuration defined and widely deployed Test-bed configuration defined and widely deployed “ “Use Case” Applications Using Objectivity:Use Case” Applications Using Objectivity:

GIOD/JavaCMS, CMS Test Beams, GIOD/JavaCMS, CMS Test Beams, ATLASFAST++, ATLAS 1 TB MilestoneATLASFAST++, ATLAS 1 TB Milestone Both LAN and WAN testsBoth LAN and WAN tests

ORCA4 (CMS)ORCA4 (CMS) First “Production” applicationFirst “Production” application Realistic data access patternsRealistic data access patterns Disk/HPSS Disk/HPSS

““Validation” Milestone Carried Out, with Simulation WGValidation” Milestone Carried Out, with Simulation WG

A108

C113


MONARC Testbed SystemsMONARC Testbed Systems


Multitasking Processing ModelMultitasking Processing Model

Concurrent running tasks share resources (CPU, memory, I/O) “Interrupt” driven scheme:

For each new task or when one task is finished, an interrupt is generated and all “processing times” are recomputed.

It provides:

An easy way to apply different load balancing schemes

An efficient mechanism to simulate multitask processing

A Java 2-Based, CPU- and code-efficient simulation for distributed systems has been developed

Process-oriented discrete event simulationF148


Role of SimulationRole of Simulationfor Distributed Systemsfor Distributed Systems

Simulations are widely recognized and used as essential toolsSimulations are widely recognized and used as essential tools for the design, performance evaluation and optimisation for the design, performance evaluation and optimisation

of complex distributed systemsof complex distributed systems From battlefields to agriculture; from the factory floor to From battlefields to agriculture; from the factory floor to

telecommunications systemstelecommunications systems Discrete event simulations with an appropriate and Discrete event simulations with an appropriate and

high level of abstraction high level of abstraction Just beginning to be part of the HEP cultureJust beginning to be part of the HEP culture

Some experience in trigger, DAQ and tightly coupledSome experience in trigger, DAQ and tightly coupledcomputing systems: CERN CS2 models (Event-oriented)computing systems: CERN CS2 models (Event-oriented)

MONARC (Process-Oriented; Java 2 Threads + Class Lib) MONARC (Process-Oriented; Java 2 Threads + Class Lib) These simulations are very different from HEP “Monte Carlos”These simulations are very different from HEP “Monte Carlos”

““Time” intervals and interrupts are the essentialsTime” intervals and interrupts are the essentials Simulation is a vital part of the study of site architectures,Simulation is a vital part of the study of site architectures,

network behavior, data access/processing/delivery strategies, network behavior, data access/processing/delivery strategies, for HENP Grid Design and Optimization for HENP Grid Design and Optimization


Example : Physics Analysis at Example : Physics Analysis at Regional CentresRegional Centres

Similar data processingSimilar data processing

jobs are performed in jobs are performed in each of several RCs each of several RCs

Each Centre has “TAG”Each Centre has “TAG”and “AOD” databases and “AOD” databases replicated.replicated.

Main Centre provides Main Centre provides “ESD” and “RAW” data “ESD” and “RAW” data

Each job processes Each job processes AOD data, and also aAOD data, and also aa fraction of ESD and a fraction of ESD and RAW data.RAW data.


Example: Physics AnalysisExample: Physics Analysis


Simple Validation Measurements Simple Validation Measurements

The AMS Data Access Case The AMS Data Access Case

0

20

40

60

80

100

120

140

160

180

0 5 10 15 20 25 30 35

Nr. of concurrent jobs

Mea

n Ti

me pe

r jo

b [m

s]

Raw Data DBLAN

4 CPUs Client

Simulation Measurements Distribution of 32 Jobs’ Processing Time

monarc01

05

101520253035

100 105 110 115 120

Simulationmean 109.5

Measurementmean 114.3

C113


MONARC Phase 3MONARC Phase 3

INVOLVING CMS, ATLAS, LHCb, ALICETIMELY and USEFUL IMPACT:

Facilitate the efficient planning and design of mutually compatible site and network architectures, and services

Among the experiments, the CERN Centre and Regional Centres

Provide modelling consultancy and service to the experiments and Centres

Provide a core of advanced R&D activities, aimed at LHC computing system optimisation and production prototyping

Take advantage of work on distributed data-intensive computing for HENP this year in other “next generation” projects [*]

For example PPDG


MONARC Phase 3MONARC Phase 3

Technical Goal: System OptimisationTechnical Goal: System OptimisationMaximise Throughput and/or Reduce Long TurnaroundMaximise Throughput and/or Reduce Long Turnaround

Phase 3 System Design ElementsPhase 3 System Design Elements RESILIENCE,RESILIENCE, resulting from flexible management of each data resulting from flexible management of each data

transaction, especially over WANstransaction, especially over WANs SYSTEM STATE & PERFORMANCE TRACKING,SYSTEM STATE & PERFORMANCE TRACKING, to match and to match and

co-schedule requests and resources, detect or predict faultsco-schedule requests and resources, detect or predict faults FAULT TOLERANCE,FAULT TOLERANCE, resulting from robust fall-back strategies resulting from robust fall-back strategies

to recover from bottlenecks, or abnormal conditionsto recover from bottlenecks, or abnormal conditions

Base developments on large scale testbed prototypesBase developments on large scale testbed prototypesat every stage: for example ORCA4at every stage: for example ORCA4

[*] See H. Newman, http://www.cern.ch/MONARC/progress_report/longc7.html


MONARC StatusMONARC Status

MONARC is well on its way to specifying baseline Models MONARC is well on its way to specifying baseline Models representing cost-effective solutions to LHC Computing.representing cost-effective solutions to LHC Computing.

Discussions have shown that LHC computing has Discussions have shown that LHC computing has a new scale and level of complexity. a new scale and level of complexity.

A Regional Centre hierarchy of networked centres A Regional Centre hierarchy of networked centres appears to be the most promising solution.appears to be the most promising solution.

A powerful simulation system has been developed, and is aA powerful simulation system has been developed, and is avery useful toolset for further model studies. very useful toolset for further model studies.

Synergy with other advanced R&D projects has been identified.Synergy with other advanced R&D projects has been identified. Important information, and example Models have been Important information, and example Models have been

provided:provided: Timely for the Hoffmann Review and discussions of LHC Timely for the Hoffmann Review and discussions of LHC

Computing over the next monthsComputing over the next months MONARC Phase 3 has been ProposedMONARC Phase 3 has been Proposed

Based on prototypes, with increasing detail and realismBased on prototypes, with increasing detail and realism Coupled to Mock Data Challenges in 2000Coupled to Mock Data Challenges in 2000


The Particle Physics Data Grid (PPDG)The Particle Physics Data Grid (PPDG)

Coordinated reservation/allocation techniques;Coordinated reservation/allocation techniques; Integrated Instrumentation, DiffServIntegrated Instrumentation, DiffServ First Year Goal: First Year Goal: Optimized cached read access to 1-10 Gbytes, Optimized cached read access to 1-10 Gbytes,

drawn from a total data set of up to One Petabytedrawn from a total data set of up to One Petabyte

PRIMARY SITEPRIMARY SITEData Acquisition,Data Acquisition,

CPU, Disk, CPU, Disk, Tape RobotTape Robot

SECONDARY SITESECONDARY SITECPU, Disk, CPU, Disk, Tape RobotTape Robot

Site to Site Data Replication Service

100 Mbytes/sec

DoE/NGI Next Generation Internet ProjectDoE/NGI Next Generation Internet ProjectANL, BNL, Caltech, FNAL, JLAB, LBNL, ANL, BNL, Caltech, FNAL, JLAB, LBNL,

SDSC, SLAC, U.Wisc/CSSDSC, SLAC, U.Wisc/CS

Multi-Site Cached File Access ServiceUniversityUniversityCPU, Disk, CPU, Disk,

UsersUsersPRIMARY SITEPRIMARY SITE

DAQ, Tape, DAQ, Tape, CPU, CPU,

Disk, RobotDisk, Robot

Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot

UniversityUniversityCPU, Disk, CPU, Disk,

UsersUsersUniversityUniversityCPU, Disk, CPU, Disk,

UsersUsersUniversityUniversityCPU, Disk, CPU, Disk,

UsersUsers

UniversityUniversityCPU, Disk, CPU, Disk,

UsersUsers

Satellite SiteSatellite SiteTape, CPU, Tape, CPU, Disk, RobotDisk, Robot


PPDG: Architecture for Reliable High PPDG: Architecture for Reliable High Speed Data DeliverySpeed Data Delivery

Object-based andObject-based andFile-based Application File-based Application

ServicesServices

Cache ManagerCache Manager

File AccessFile AccessServiceService

Matchmaking Matchmaking ServiceService

Cost EstimationCost EstimationFile FetchingFile Fetching

ServiceService

File Replication File Replication IndexIndex

End-to-End End-to-End Network ServicesNetwork Services

Mass Storage Mass Storage ManagerManager

Resource Resource ManagementManagement

File MoverFile Mover

File MoverFile Mover

Site BoundarySite Boundary Security DomainSecurity Domain


Distributed Data Delivery and Distributed Data Delivery and LHC Software ArchitectureLHC Software Architecture

Software Architectural ChoicesSoftware Architectural Choices

Traditional, single-threaded applicationsTraditional, single-threaded applications Allow for data arrival and reassembly Allow for data arrival and reassembly

OROR

Performance-Oriented (Complex)Performance-Oriented (Complex) I/O requests up-front; multi-threaded; data driven;I/O requests up-front; multi-threaded; data driven;

respond to ensemble of (changing) cost estimates respond to ensemble of (changing) cost estimates Possible code movement as well as data movementPossible code movement as well as data movement Loosely coupled, dynamicLoosely coupled, dynamic


ALDAPALDAP: : AAccessingccessing LLargearge DDataata ArchivesArchivesinin AAstronomystronomy andand PParticle Physicsarticle Physics

NSF Knowledge Discovery Initiative (KDI)NSF Knowledge Discovery Initiative (KDI)CALTECH, Johns Hopkins, FNAL(SDSS)CALTECH, Johns Hopkins, FNAL(SDSS)

Explore advanced adaptiveExplore advanced adaptive database structures, physical database structures, physical data storage hierarchies for archival storage of next data storage hierarchies for archival storage of next generation astronomy and particle physics datageneration astronomy and particle physics data

Develop spatial indexes, novel data organizations, Develop spatial indexes, novel data organizations, distribution and delivery strategies, for distribution and delivery strategies, for Efficient and transparent access to data across networksEfficient and transparent access to data across networks

Example (Kohonen) Maps for data “self-organization”Example (Kohonen) Maps for data “self-organization” Create prototype network-distributed data query execution Create prototype network-distributed data query execution

systems using Autonomous Agent workerssystems using Autonomous Agent workers Explore commonalities and find effective Explore commonalities and find effective common solutionscommon solutions

for particle physics and astrophysics datafor particle physics and astrophysics data

ALDAP (NSF/KDI) ProjectALDAP (NSF/KDI) Project

C226


Mobile Agents: Reactive, Autonomous, Goal Driven, AdaptiveMobile Agents: Reactive, Autonomous, Goal Driven, Adaptive Execute AsynchronouslyExecute Asynchronously Reduce Network Load: Local ConversationsReduce Network Load: Local Conversations Overcome Network Latency; Some OutagesOvercome Network Latency; Some Outages Adaptive Adaptive Robust, Fault Tolerant Robust, Fault Tolerant Naturally Heterogeneous Naturally Heterogeneous Extensible Concept: Extensible Concept: Agent HierarchiesAgent Hierarchies

Beyond Traditional Architectures:Beyond Traditional Architectures:Mobile Agents (Java Aglets)Mobile Agents (Java Aglets)

““Agents are objects with rules and legs” -- D. TaylorAgents are objects with rules and legs” -- D. Taylor

Application

Ser

vice

Age

nt

Agent

Age

nt

Age

ntA

gent

Age

ntA

gent


D9


Grid Services Architecture [*]:Grid Services Architecture [*]:Putting it all TogetherPutting it all Together

GridGridFabricFabric

GridGridServicesServices

ApplnApplnToolkitsToolkits

ApplnsApplns

Archives, networks, computers, display devices, etc.;Archives, networks, computers, display devices, etc.;associated local services associated local services

Protocols, authentication, policy, resource Protocols, authentication, policy, resource management, instrumentation, data discovery, etc.management, instrumentation, data discovery, etc.

......RemoteRemote

vizviztoolkittoolkit

RemoteRemotecomp.comp.toolkittoolkit

RemoteRemotedatadata

toolkittoolkit

RemoteRemotesensorssensorstoolkittoolkit

RemoteRemotecollab.collab.toolkittoolkit

HEP Data-Analysis HEP Data-Analysis Related ApplicationsRelated Applications

[*] Adapted from Ian Foster[*] Adapted from Ian Foster E403


Grid Hierarchy Goals: Better Resource Grid Hierarchy Goals: Better Resource Use Use andand Faster Turnaround Faster Turnaround

Efficient resource use and improved responsiveness Efficient resource use and improved responsiveness through:through:

Treatment of the ensemble of site and network resourcesTreatment of the ensemble of site and network resourcesas an integrated (loosely coupled) systemas an integrated (loosely coupled) system

Resource discovery, query estimation (redirection), Resource discovery, query estimation (redirection), co-scheduling, prioritization, local and global allocations co-scheduling, prioritization, local and global allocations

Network and site “instrumentation”: performance Network and site “instrumentation”: performance tracking, monitoring, forward-prediction, problem tracking, monitoring, forward-prediction, problem trapping and handlingtrapping and handling

Exploit superior network infrastructures (national,Exploit superior network infrastructures (national,land-based) per unit cost for frequently accessed dataland-based) per unit cost for frequently accessed data

Transoceanic links relatively expensiveTransoceanic links relatively expensive Shorter links Shorter links normally higher throughput normally higher throughput

Ease development, operation, management and security, Ease development, operation, management and security, through the use of layered, (de facto) standard servicesthrough the use of layered, (de facto) standard services

E163

E345


Grid Hierarchy Concept:Grid Hierarchy Concept:Broader AdvantagesBroader Advantages

Greater flexibility to pursue different physics interests, Greater flexibility to pursue different physics interests, priorities, and resource allocation strategies by regionpriorities, and resource allocation strategies by region

Lower tiers of the hierarchy Lower tiers of the hierarchy More local control More local control Partitioning of users into Partitioning of users into “proximate”“proximate” communities communities

into for support, troubleshooting, mentoringinto for support, troubleshooting, mentoring Partitioning of facility tasks, to manage and focus Partitioning of facility tasks, to manage and focus

resourcesresources ““Grid” integration and common services are a principalGrid” integration and common services are a principal

means for effective worldwide resource coordination means for effective worldwide resource coordination

An Opportunity to maximize global funding resources and An Opportunity to maximize global funding resources and their effectiveness, while meeting the needs for analysis their effectiveness, while meeting the needs for analysis

and physicsand physics


Grid Development IssuesGrid Development Issues Integration of applications with Grid MiddlewareIntegration of applications with Grid Middleware

Performance-oriented user application software architecturePerformance-oriented user application software architectureneeded, to deal with the realities of data access and deliveryneeded, to deal with the realities of data access and delivery

Application frameworks must work with system state and Application frameworks must work with system state and policy information (“instructions”) from the Gridpolicy information (“instructions”) from the Grid

ODBMS’s must be extended to work across networksODBMS’s must be extended to work across networks ““Invisible” (to the DBMS) data transport, and catalog updateInvisible” (to the DBMS) data transport, and catalog update

Interfacility cooperation at a new level, across Interfacility cooperation at a new level, across world regionsworld regions

Agreement on the use of standard Grid components,Agreement on the use of standard Grid components,services, security and authenticationservices, security and authentication

Match with heterogeneous resources, performance levels,Match with heterogeneous resources, performance levels,and local operational requirementsand local operational requirements

Consistent policies on use of local resources by remoteConsistent policies on use of local resources by remotecommunitiescommunities

Accounting and “exchange of value” softwareAccounting and “exchange of value” software


Grid Hierarchy Concept:Grid Hierarchy Concept:Broader AdvantagesBroader Advantages

Greater flexibility to pursue different physics interests, Greater flexibility to pursue different physics interests, priorities, and resource allocation strategies by regionpriorities, and resource allocation strategies by region

Lower tiers of the hierarchy Lower tiers of the hierarchy More local control More local control Partitioning of users into Partitioning of users into “proximate”“proximate” communities communities

into for support, troubleshooting, mentoringinto for support, troubleshooting, mentoring Partitioning of facility tasks, to manage and focus Partitioning of facility tasks, to manage and focus

resourcesresources ““Grid” integration and common services are a principalGrid” integration and common services are a principal

means for effective worldwide resource coordination means for effective worldwide resource coordination

An Opportunity to maximize global funding resources and An Opportunity to maximize global funding resources and their effectiveness, while meeting the needs for analysis their effectiveness, while meeting the needs for analysis

and physicsand physics


Worldwide Integrated Distributed SystemsWorldwide Integrated Distributed Systemsfor Dynamic Content Delivery Circa 2000for Dynamic Content Delivery Circa 2000

Akamai,Akamai, Adero, Sandpiper Server Networks Adero, Sandpiper Server Networks 11200 200 Thousands Thousands of Network-Resident Serversof Network-Resident Servers

25 25 60 ISP Networks 60 ISP Networks 25 25 30 Countries 30 Countries 40+ Corporate Customers40+ Corporate Customers $ 25 B Capitalization$ 25 B Capitalization

Resource DiscoveryResource Discovery Build “Weathermap” of Server Network (State Tracking)Build “Weathermap” of Server Network (State Tracking) Query Estimation; Matchmaking/Optimization; Query Estimation; Matchmaking/Optimization;

Request rerouting Request rerouting Virtual IP Addressing Virtual IP Addressing

Mirroring, CachingMirroring, Caching (1200) Autonomous-Agent Implementation(1200) Autonomous-Agent Implementation

Content Delivery Networks:Content Delivery Networks:a Web-enabled Pre- “Data Grid”a Web-enabled Pre- “Data Grid”


The Need for a “Grid”: the BasicsThe Need for a “Grid”: the Basics Computing for LHC will never be “enough” to fully exploit the physics Computing for LHC will never be “enough” to fully exploit the physics

potential, or exhaust the scientific potential of the collaborationspotential, or exhaust the scientific potential of the collaborations The basic Grid elements are required to make the ensemble of The basic Grid elements are required to make the ensemble of

computers, networks, storage management systems, and function as a computers, networks, storage management systems, and function as a self-consistent system, implementing consistent (and complex) self-consistent system, implementing consistent (and complex) resource usage policies.resource usage policies.

A basic “Grid” will an information gathering/ workflow guiding/ A basic “Grid” will an information gathering/ workflow guiding/ monitoring/ and repair-initiating entity, designed to ward off resource monitoring/ and repair-initiating entity, designed to ward off resource wastage (or meltdown) in a complex, distributed and somewhat “open” wastage (or meltdown) in a complex, distributed and somewhat “open” system. system.

Without such information, experience shows that effective global use of Without such information, experience shows that effective global use of such a large, complex and diverse ensemble of resources is likely to such a large, complex and diverse ensemble of resources is likely to fail; or at the very least be sub-optimalfail; or at the very least be sub-optimal

The time to accept the charge to build a Grid, for sober and The time to accept the charge to build a Grid, for sober and compelling reasons, is nowcompelling reasons, is now

Grid-like systems are starting to appear in industry and commerceGrid-like systems are starting to appear in industry and commerce But Data Grids on the LHC scale will not be in production untilBut Data Grids on the LHC scale will not be in production until

significantly after 2005significantly after 2005


SummarySummary

The HENP/LHC Data Analysis Problem The HENP/LHC Data Analysis Problem Petabyte scale compact binary data, and computing Petabyte scale compact binary data, and computing

resources distributed worldwideresources distributed worldwide Development of an integrated robust networked data access Development of an integrated robust networked data access

processing and analysis system is mission-criticalprocessing and analysis system is mission-critical An aggressive R&D program is requiredAn aggressive R&D program is required

to develop reliable, seamless systems that work acrossto develop reliable, seamless systems that work across an ensemble of networks an ensemble of networks

An effective inter-field partnership is now developingAn effective inter-field partnership is now developingthrough many R&D projects (PPDG, GriPhyN, ALDAP…)through many R&D projects (PPDG, GriPhyN, ALDAP…)

HENP analysis is now one of the driving forcesHENP analysis is now one of the driving forces for the development of “Data Grids” for the development of “Data Grids”

Solutions to this problem could be widely applicable in Solutions to this problem could be widely applicable in other scientific fields and industry, by LHC startupother scientific fields and industry, by LHC startup

National and Multi-National “Enterprise Resource Planning”National and Multi-National “Enterprise Resource Planning”

Date post:	19-Mar-2016
Category:	Documents
Upload:	gannon
View:	44 times
Download:	0 times

Distributed Data Access and Analysis for Next Generation HENP Experiments Harvey Newman, Caltech

Documents