+ All Categories

Outline

Date post: 10-Feb-2016
Category:
Upload: lapis
View: 39 times
Download: 0 times
Share this document with a friend
Description:
A Large Hadron Collider Case Study - Where HPC and Big Data Converge Frank Würthwein Professor of Physics University of California San Diego November 1 5th, 2013. Outline. The Science Software & Computing Challenges Present Solutions Future Solutions. The Science. - PowerPoint PPT Presentation
Popular Tags:
38
A Large Hadron Collider Case Study - Where HPC and Big Data Converge Frank Würthwein Professor of Physics University of California San Diego November 15th, 2013
Transcript

Status of Analysis Exercise in CCRC08 CCRC08 meeting May 29th 2008 Alessandra Fanfani, Enzo Miccio, Sanjay Padhi, Frank Wrthwein

A Large Hadron Collider Case Study - Where HPC and Big Data Converge

Frank Wrthwein

Professor of PhysicsUniversity of California San DiegoNovember 15th, 2013

1OutlineThe ScienceSoftware & Computing ChallengesPresent SolutionsFuture SolutionsNovember 15th 2013Frank Wurthwein - HP-CAST212The Science

~67% of energy is dark energy~29% of matter is dark matterAll of what we know makes upOnly about 4% of the universe. We have some ideas but no proof of what this is!We got no clue what this is.The Universe is a strange place!November 15th 2013Frank Wurthwein - HP-CAST2144

To study Dark Matter we need to create it in the laboratoryNovember 15th 2013Frank Wurthwein - HP-CAST215Mont BlancLake GenevaALICEATLASLHCbCMS5

6Big bang in the laboratoryWe gain insight by colliding particles at the highest energies possible to measure:Production ratesMasses & lifetimesDecay ratesFrom this we derive the spectroscopy as well as the dynamics of elementary particles.Progress is made by going to higher energies and brighter beams.November 15th 2013Frank Wurthwein - HP-CAST2177Explore Nature over 15 Orders of magnitudePerfect agreement between Theory & Experiment

Dark Matter expectedsomewhere below this line.November 15th 2013Frank Wurthwein - HP-CAST218And for the Sci-Fi Buffs

Imagine our 3D world to beconfined to a 3D surface ina 4D universe. Imagine this surface to be curved such that the 4th Ddistance is short for locationslight years away in 3D.Imagine space travel bytunneling through the 4th D.The LHC is searching for evidence of a 4th dimension of space.November 15th 2013Frank Wurthwein - HP-CAST219Recap so far The beams cross in the ATLAS and CMS detectors at a rate of 20MHzEach crossing contains ~10 collisionsWe are looking for rare events that are expected to occur in roughly 1/10000000000000 collisions, or less.November 15th 2013Frank Wurthwein - HP-CAST2110Software & ComputingChallenges

The CMS Experiment12The CMS Experiment80 Million electronic channelsx 4 bytesx 40MHz-----------------------~ 10 Petabytes/sec of informationx 1/1000 zero-suppressionx 1/100,000 online event filtering------------------------~ 100-1000 Megabytes/sec raw data to tape1 to 10 Petabytes of raw data per yearwritten to tape, not counting simulations.

2000 Scientists (1200 Ph.D. in physics)~ 180 Institutions~ 40 countries

12,500 tons, 21m long, 16m diameter

November 15th 2013Frank Wurthwein - HP-CAST211313Example of an interesting EventNovember 15th 2013Frank Wurthwein - HP-CAST2114

Higgs to candidateZoomed in R-Z view of a busy eventNovember 15th 2013Frank Wurthwein - HP-CAST2115

Yellow dots indicate individual collisions, all during the same beam crossing.Active Scientists in CMSNovember 15th 2013Frank Wurthwein - HP-CAST2116

5-40% of the scientific members are actively doing large scale data analysis in any given week.~1/4 of the collaboration,scientists and engineers,contributed to the common source code of ~3.6M C++ SLOC.Evolution of LHC Science Program150Hz1000Hz10000HzEvent Rate written to tapeNovember 15th 2013Frank Wurthwein - HP-CAST2117

The ChallengeHow do we organize the processing of 10s to 1000s of Petabytes of data by a globally distributed community of scientists, and do so with manageable change costs for the next 20 years ? Guiding Principles for SolutionsChose technical solutions that allow computing resources as distributed as human resources.Support distributed ownership and control, within a global single sign-on security context.Design for heterogeneity and adaptability.November 15th 2013Frank Wurthwein - HP-CAST211818Present SolutionsNovember 15th 2013Frank Wurthwein - HP-CAST2120

Federation of National Infrastructures. In the U.S.A.: Open Science Grid November 15th 2013Frank Wurthwein - HP-CAST2121

Among the top 500 supercomputers there are only two that are bigger when measured by power consumption.Tier-3 CentersLocally controlled resources not pledged to any of the 4 collaborations.Large clusters at major research Universities that are time shared.Small clusters inside departments and individual research groups.Requires global sign-on system to be open for dynamically adding resources.Easy to support APIsEasy to work around unsupported APIsNovember 15th 2013Frank Wurthwein - HP-CAST2122Me -- My friends -- The grid/cloudO(104) UsersO(102-3) SitesO(101-2) VOsThin clientThin Grid APIThick VOMiddleware& SupportMeMy friendsThe anonymousGrid or CloudDomain science specificCommon to all sciencesand industryNovember 15th 2013Frank Wurthwein - HP-CAST212323My Friends ServicesDynamic Resource provisioningWorkload managementschedule resource, establish runtime environment, execute workload, handle results, clean upData distribution and accessInput, output, and relevant metadataFile catalogue

November 15th 2013Frank Wurthwein - HP-CAST2124

Optimize Data Structure for Partial ReadsNovember 15th 2013Frank Wurthwein - HP-CAST2125

Fraction of a file that is readNovember 15th 2013Frank Wurthwein - HP-CAST2126# of files readFor vast majority of files, less than 20% of the file is read.20%Average 20-35%Median 3-7% (depending on type of file)Overflow binFuture SolutionsFrom present to futureInitially, we operated a largely static system.Data was placed quasi-static before it can be analyzed.Analysis centers have contractual agreements with the collaboration.All reconstruction is done at centers with custodial archives.Increasingly, we have too much data to afford this.Dynamic data placementData is placed at T2s based on job backlog in global queues.WAN access:Any Data, Anytime, AnywhereJobs are started on the same continent as the data instead of the same cluster attached to the data.Dynamic creation of data processing centersTier-1 hardware bought to satisfy steady state needs instead of peak needs.Primary processing as data comes off the detector => steady stateAnnual Reprocessing of accumulated data => peak needsNovember 15th 2013Frank Wurthwein - HP-CAST2128Any Data, Anytime, AnywhereNovember 15th 2013Frank Wurthwein - HP-CAST2129

Global redirection system to unify all CMS data into one globally accessible namespace.Is made possible by paying careful attention to IO layerto avoid inefficiencies due to IO related latencies.

Vision going forwardImplemented vision for 1st time in Spring 2013using Gordon Supercomputer at SDSC.November 15th 2013Frank Wurthwein - HP-CAST2130November 15th 2013Frank Wurthwein - HP-CAST2131

CMS My Friends StackCMSSW release environmentNFS exported from Gordon IO nodesFuture: CernVM-FS via Squid cachesJ. Blomer et al.; 2012 J. Phys.: Conf. Ser. 396 052013Security Context (CA certs, CRLs) via OSG worker node clientCMS calibration data access via FroNTierB. Blumenfeld et al;2008 J. Phys.: Conf. Ser. 119 072007Squid caches installed on Gordon IO nodesglideinWMSI. Sfiligoi et al.; doi:10.1109/CSIE.2009.950Implements late binding provisioning of CPU and job schedulingSubmits pilots to Gordon via BOSCO (GSI-SSH)WMAgent to manage CMS workloadsPhEDEx data transfer managementUses SRM and gridftpNovember 15th 2013Frank Wurthwein - HP-CAST2132Job environmentData and JobhandlingCMS My Friends StackCMSSW release environmentNFS exported from Gordon IO nodesFuture: CernVM-FS via Squid cachesJ. Blomer et al.; 2012 J. Phys.: Conf. Ser. 396 052013Security Context (CA certs, CRLs) via OSG worker node clientCMS calibration data access via FroNTierB. Blumenfeld et al;2008 J. Phys.: Conf. Ser. 119 072007Squid caches installed on Gordon IO nodesglideinWMSI. Sfiligoi et al.; doi:10.1109/CSIE.2009.950Implements late binding provisioning of CPU and job schedulingSubmits pilots to Gordon via BOSCO (GSI-SSH)WMAgent to manage CMS workloadsPhEDEx data transfer managementUses SRM and gridftpNovember 15th 2013Frank Wurthwein - HP-CAST2133Job environmentData and JobhandlingThis is clearly mighty complex !!!So lets focus only on the parts that are specific to incorporating Gordon as a dynamic data processing center.November 15th 2013Frank Wurthwein - HP-CAST2134

Items in red were deployed/modified to incorporate GordonBOSCOMinor mod of PhEDEx config fileDeploy SquidExport CMSSW& WN clientGordon ResultsWork completed in February/March 2013 as a result of a lunch conversation between SDSC & US-CMS managementDynamically responding to an opportunity400 Million RAW events processed125 TB in and ~150 TB out~2 Million core hours of processingExtremely useful for both science results as well as proof of principle in software & computing.November 15th 2013Frank Wurthwein - HP-CAST2135Summary & ConclusionsGuided by the principles:Support distributed ownership and control in a global single sign-on security context.Design for heterogeneity and adaptabilityThe LHC experiments very successfully developed and implemented a set of new concepts to deal with BigData.November 15th 2013Frank Wurthwein - HP-CAST2136Outlook (I)The LHC experiments had to largely invent an island of BigData technologies with limited interactions with industry and other domain sciences.Is it worth building bridges to other islands ?IO stack and HDF5 ?MapReduce ?What else ?Is there a mainland emerging that is not just another island ?November 15th 2013Frank Wurthwein - HP-CAST2137Outlook (II)November 15th 2013Frank Wurthwein - HP-CAST2138With increasing brightness of the beams,the number of simultaneous collisionsincreases from ~10 to ~140. The resulting increase in # of hits inthe detector leads to an exponentialgrowth in the CPU time needed to dothe pattern recognition at the core ofour reconstruction software.O(104) by 2023O(104) ~ O(10) x O(10) x O(10) x O(10)Moores lawNew hardware architecturesNew algorithmsBuilt a better detectorHoped for solution:Problem:


Recommended