+ All Categories
Home > Documents > PASTA Review.pptx

PASTA Review.pptx

Date post: 16-Dec-2015
Upload: hirender-dahiya
View: 252 times
Download: 12 times
Share this document with a friend
Popular Tags:
Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002 PASTA Review Technology for the LHC Era

Computing at FNAL" Hans Wenzel Fermilab

PASTA ReviewTechnology for the LHC Era Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 200221 October 2002Michael Ernst, [email protected] 089225

Approach to Pasta IIITechnology Review of what was expected from Pasta II and what might be expected in 2005 and beyond.Understand technology drivers which might be market and business driven. In particular the suppliers of basic technologies have undergone in many cases major business changes with divestment, mergers and acquisitions.Try to translate where possible into costs that will enable us to predict how things are evolving.Try to extract emerging best practices and use case studies wherever possible.Involve a wider number of people than CERN in major institutions in at least Europe and the US.Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002ParticipantsA: Semiconductor TechnologyIan Fisk (UCSD) Alessandro Machioro (CERN) Don Petravik (Fermilab)B:Secondary Storage Gordon Lee (CERN) Fabien Collin (CERN) Alberto Pace(CERN) C:Mass StorageCharles Curran (CERN) Jean-Philippe Baud(CERN)D:Networking TechnologiesHarvey Newman (Caltech) Olivier Martin (CERN) Simon Leinen(Switch)E:Data Management TechnologiesAndrei Maslennikov (Caspur) David Foster (CERN)F:Storage Management SolutionsMichael Ernst (Fermilab) Nick Sinanis (CERN/CMS) Martin Gasthuber (DESY)G:High Performance Computing SolutionsBernd Panzer (CERN) Ben Segal (CERN) Arie Van Praag(CERN)

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Current StatusMost reports in the final stages.Networking is the last to complete.Some cosmetic treatments needed.Draft Reports can be found at:

http://david.web.cern.ch/david/pasta/pasta2002.htmMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Year








Technology requirements

Dram pitch (um)








uP channel length








Tox equivalent (nm)






Serial ATA (150MB/s 600 MB/s)Serial ATA is expected to dominate the commodity disk connectivity market by end 2003. Fiber channel products still expensive.DVD solutions still 2-3x as expensive as disks. No industry experience managing large DVD libraries. Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002

Tape Storage Technology Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Tape Drive Target ApplicationsPerformance NeedsTape CapacityTape AccessAccessThroughputCapacity Application Segments Batch Processing

Tape Transaction Processing

Hierarchical Storage Management

Active Archive (Check, Medical)

Disk Extension

Scientific / Extremely Large Files

Backup and Restore

Deep ArchiveMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Tape Path and Cartridge TypesDual HubInternal tape pathSTK 9840, IBM 3570, QIC

Single HubExternal tape pathDLT, LTO, 3490, 4490,9490, 3590, SD-3, 9940

Dual Hub - CassetteExternal tape pathAIT, Mammoth and other 4/8mm Helical scan

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002

LTO Ultrium RoadmapGA 18-24 Months after Gen1GA 18-24 Months after Gen2GA Start of Q3-00Media SwapMedia SwapMedia SwapMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002

Recent Tape Drives (STK)99409840Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Track Layout (STK 9840/9940)

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Ferrofluid/MFM Images of Data Tracks and Amplitude Based Servo Tracks (STK 9840/9940) Drives

6 servo tracks with bursts of servo signal. Servo head positions between tracks. 5 positions Direction of tape54321Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 20029940 Operational Principle

Single HubExternal tape pathMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Data Access ComparisonData Access: Time from cartridge insertion in drive to first byte of data - timed to mid-point and does not include robotic time Access Seconds9840 12 Timberline 25 CapacitySeconds9940 59 RedWood 66 Magstar 3590 64Magstar 3590E 90+IBM LTO 3580 110 Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002

9940 Load TimeMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002FuturesFutures200120012003200320002000 GA=1999 20 GB (unc.)10 MB/secAccess = 8.0 secMP MediaSCSI, Fibre Ch.9840A200220029840 and 9940 Tape Drive Roadmap9840BQ301 volume ship20 GB (unc.)19 MB/secAccess = 8.0 secMP MediaSCSI, Fibre Ch.TCP/IP Service Port200370 GB (unc.)30 MB/secAccess = 8.0 secMP MediaESCON, Fibre Ch.TCP/IP Service Port9840CFuture GenerationAccess Drives100 - 300 GB (unc.)45 - 90 MB/sec9940AGA=200060 GB (unc.)10 MB/secMP MediaSCSI, Fibre Ch.1H029940B200 GB (unc.)30 MB/secMP MediaFibre ChannelTCP/IP Service PortAccessCapacityFuture GenerationCapacity Drives300 GB - 1 TB (unc.)60 -120 MB/secMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Tapes - 1Recent tape drive technology (9840, 9940A/B, LTO) is installed at CERN.Current Installation are 10 STK silos capable of taking 800 new format tape drives. Today tape performance is 15MB/sec so theoretical aggregate is 12GB/sec (no way in reality!)Cartridge capacities expected to increase to 1TB before LHC startup but its market demand and not technical limitations driving it.Using tapes as a random access device is no longer a viable optionNeed to consider a much larger, persistent disk cache for LHC reducing tape activity for analysis.Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Random Access Sequential Access FileTransfer ProductionPersonalAnalysis Disk CacheHSM(Hierarchical Storage Manager)

Experiment-specificMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Tapes - 2Current costs are about $33/slot for a tape in the Powderhorn robot.Current tape cartridge (9940A/B, 60GB) costs $86 with a slow decrease over time.Media dominates the overall cost and a move to higher capacity cartridges and tape units sometimes require a complete media change.Current storage costs 0.4-0.7 USD/GB in 2000 could drop to 0.2 USD/GB in 2005 but probably would require a complete media change.Conclusions: No major challenges for tapes for LHC startup but the architecture has to be such that random access is avoidedMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002

Interregional Connectivity is the key .NetworkingMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Tier2 Center

Online SystemOffline Farm,CERN Computer CenterUS Center @ FNALFrance Center Italy Center UK Center InstituteInstituteInstituteInstitute~100 MBytes/sec~2.4 Gbits/sec~PBytes/secTier2 CenterTier2 CenterTier2 CenterTier 0 +1Tier 1Tier 3Tier2 Center Tier 2ExperimentCMS has adopted a distributed computing model to perform data analysis, event simulation, and event reconstruction in which two-thirds of the total computing resources are located at regional centers. The unprecedented size of the LHC collaborations and complexity of the computing task requires that new approaches be developed to allow physicists spread globally to efficiently participate. CMS as an example Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002 Transatlantic Net WG (HN, L. Price) Bandwidth Requirements [*]

[*] Installed BW. Maximum Link Occupancy 50% AssumedMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Network Progress andIssues for Major Experiments Network backbones are advancing rapidly to the 10 Gbps rangeGbps end-to-end throughput data flows will be in production soon (in 1-2 years)Network advances are changing the view of the nets rolesThis is likely to have a profound impact on the experiments Computing Models, and bandwidth requirements Advanced integrated applications, such as Data Grids, rely onseamless transparent operation of our LANs and WANsWith reliable, quantifiable (monitored), high performanceNetworks need to be integral parts of the Grid(s) designNeed new paradigms of real network and system monitoring, and of new of managed global systems for HENP analysisThese are starting to be developed for LHCMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Signs of the Times: UncertaintyBut No Change in Outlook Key Providers in BankruptcyKPNQwest, Teleglobe, Global Crossing, FLAG; Worldcom Switching to Others, Where Needed and PossibleE.g. T-Systems (Deutsche Telecom) for US-CERNStrong Telecom Market OutlookGood pricing from DTMCI/Worldcom network will continue to operate (?):20 M customers in US; UK academic & research networkAggressive plans by major and startup network equipment providers Strong Outlook in R&E Nets for Rapid Progress Abilene (US) Upgrade On Schedule; GEANT (Europe), and SuperSINET(Japan) Plans Continuing ESNet Backbone Upgrade: 2.5 Gbps Now; 10 Gbps in 2 Yrs. Regional Progress, and Visions; E.g. CALREN: 1 Gbps to Every Californian by 2010 Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002LHCnet Network : Late 2002Development and testsAbileneMRENESnetSTARTAPNASALinux PC forPerformance tests & MonitoringCERN -GenevaGEANTSwitchIN2P3WHOLinux PC forPerformance tests & MonitoringCaltech/DoE PoP StarLight Chicago 622 Mbps (Prod.)Cisco 7609CERNCisco 7609Caltech(DoE)Alcatel 7770DataTAG(CERN)Cisco 7606DataTAG(CERN)Juniper M10DataTAG(CERN)Cisco 7606Caltech(DoE)Juniper M10Caltech(DoE)Alcatel 7770DataTAG(CERN)2.5 Gbps (R&D)Optical Mux/DmuxAlcatel 1670Optical Mux/DmuxAlcatel 1670Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002The Rapid Pace of Network Technology Advances ContinuesWithin the Next One to Two Years 10 Gbps Ethernet on Switches and Servers; LAN/WAN integration at 10 Gbps 40 Gbps Wavelengths Being Shown HFR: 100 Mpps forwarding engines, 4 and more 10 Gbps ports per Slot; Terabit/sec backplanes etc. Broadband Wireless [Multiple 3G/4G alternatives]: the drive to defeat the last mile problem 802.11 ab, UWB, etc.

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002EU-Solicited Project. CERN, PPARC (UK), Amsterdam (NL), and INFN (IT);and US (DOE/NSF: UIC, NWU and Caltech) partnersMain Aims: Ensure maximum interoperability between US and EU Grid ProjectsTransatlantic Testbed for advanced network research2.5 Gbps Wavelength Triangle 7/02 (10 Gbps Triangle in 2003)

FrRenaterDataTAG Project



GEANTNewYorkSTAR-TAPSTARLIGHT Wave TriangleMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002HENP Major Links: Bandwidth Roadmap (Scenario) in Gbps

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002HENP Lambda Grids:Fibers for PhysicsProblem: Extract Small Data Subsets of 1 to 100 Terabytes from 1 to 1000 Petabyte Data StoresSurvivability of the HENP Global Grid System, with hundreds of such transactions per day (circa 2007)requires that each transaction be completed in a relatively short time. Example: Take 800 secs to complete the transaction. Then Transaction Size (TB) Net Throughput (Gbps) 1 10 10 100 100 1000 (Capacity of Fiber Today)Summary: Providing Switching of 10 Gbps wavelengthswithin ~3 years; and Terabit Switching within ~6-10 yearswould enable Petascale Grids with Terabyte transactionswithin this decade, as required to fully realize the discovery potential of major HENP programs, as well as other data-intensive fields.Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Time to recover from a single lossTCP reactivity: Due to the Basic Multiplicative-Decrease Additive-Increase Algorithm to Handle Packet LossTime to increase the throughput by 120 Mbit/s is larger than 6 min for a connection between Chicago and CERN.A single loss is disastrousA TCP connection reduces bandwidth use by half after a loss is detected (Multiplicative decrease)A TCP connection increases slowly its bandwidth use (Additive increase)TCP is much more sensitive to packet loss in WANs than in LANs

6 minNetwork Protocol IssuesMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002TCP ResponsivenessCaseCapacityRTT (ms) MSS (Byte)ResponsivenessTypical LAN in 198810 Mbps[ 2 ; 20 ] 1460[ 1.5 ms ; 154 ms ]Typical WAN in 19889.6 Kbps40 14600.006 secTypical LAN today100 Mbps5 (worst case)14600.096 secCurrent WAN link CERN Starlight622 Mbps12014606 minutesFuture WAN link CERN Starlight10 Gbit/s120146092 minutesFuture WAN link CERN Starlight10 Gbit/s1208960 (Jumbo Frame)15 minutesMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002National Research Networks in Japan Proposal: Two TransPacific 2.5 Gbps Wavelengths, and Japan-CERN Grid Testbed by ~2003 TokyoNagoyaInternetKyoto UICRKyoto-UNagoya UNIFSNIGKEKTohoku UIMSU-TokyoNAOU TokyoNII Hitot.NII ChibaIPWDM pathIP routerOXCISASMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002NetworkingMajor cost reductions have taken place in wide-area bandwidth costs.2.5 Gbps common for providers but not academic in 1999. Now, 10Gbps common for providers and 2.5Gbps common for academic.Wide area data migration/replication now feasible and affordable.Tests of multiple streams to the US running at the full capacity of 2Gbps were successful.Local Area Networking moving to 10 Gbps and this is expected to increase. First10Gbps NICs available for end systems.Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Networking TrendsTransitioning from 10Gbit to 20-30 Gbit seems likely.MPLS (Multiprotocol Label Switching) has gained momentum. It provides secure VPN capability over public networks. A possibility for tier-1 center connectivity.Lambda networks based on dark fiber are also becoming very popular. It is a build-yourself network and may also be relevant for the grid and center connectivity.Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Storage - ArchitecturePossibly the biggest challenge for LHCStorage architecture design (seamless integration from CPU caches to deep archive required)Data management. Currently very poor tools and facilities for managing data and storage systems.SAN vs. NAS debate still aliveSAN, scalable and high availability, but costly NAS, cheaper and easier to manageObject storage technologies appearingIntelligent storage system able to manage the objects it is storingAllowing light-weight Filesystems

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002OSD IntelligenceStorage DeviceObject ManagerApplication File Manager

Meta OperationLAN/SANData TransferSecurityObject Storage Device ArchitectureMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Storage ManagementVery little movement in the HSM space since the last PASTA report.HPSS still for large scale systemsA number of mid-range products (make tape look like a big disk) but limited scaling possibleHEP still a leader in tape and data managementCASTOR, Enstore, JASMineWill remain crucial technologies for LHC.Cluster file systems appearing (StorageTank, Lustre)Provide unlimited (PB) file system (e.g. through LAN, SAN)Scale to many 1000s of clients (CPU servers).Need to be interfaced to tertiary storage systems (e.g. Enstore)

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Storage - ConnectivityFiberChannel market growing at 36%/year from now to 2006 (Gartner). This is the current technology for SAN implementation.iSCSI or equivalent over Gigabit Ethernet is an alternative (and cheaper) but less performant implementation of SAN gaining in popularity.It is expected that GigE will become a popular transport for storage networks.InfiniBand (up to 30 Gbps) is a full-fledged network technology that could change the landscape of cluster architectures and has much, but varying, industry support. Broad adoption could drive costs down significantlyFIO (Compaq, IBM, HP) and NGIO (Intel, MS, Sun) merged to IBExpect bridges between IB and legacy Ethernet and FC netsUses IPv6Supports RDMA and multicast Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Local FSNFSDAFSApplicationApplicationApplicationBuffersBuffersBuffersFS SwitchFS SwitchFileSystemSCSIDriverHBA DriverNFSTCP/IPNIC DriverBufferCacheBufferCachePacketBuffersNICHBANICVI NICDriverDAFSVIPLUSERKernelHardwareData Transfer OverheadMemoryDataControlMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Storage Cost

Cost of managing storage and data are the predominate costsMichael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Storage Scenario - Today

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Storage Scenario - Future

Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Some Overall ConclusionsTape and Network trends match or exceed our initial needs.Need to continue to leverage economies of scale to drive down long term costs.CPU trends need to be carefully interpretedThe need for new performance measures are indicated. Change in the desktop market might effect the server strategy.Cost of manageability is an issue.Disk trends continue to make a large (multi PB) disk cache technically feasible, but .The true cost of such an object a bit unclear, given the issues of reliability, manageability and the disk fabric chosen (NAS/SAN, iSCSI/FC etc etc)File system access for a large disk cache (RFIO, DAFS, ) under investigation (urgent !)More architectural work is needed in the next 2 years for the processing and handling of LHC data.NAS/SAN models are converging, many options for system interconnects, new High Performance NAS products are (about to be) rolled out (Zambeel, Panasas, Maximum Throughput, Exanet etc) Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002 Sounds like we are in pretty good shape ..Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002 but lets be careful ...Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002PASTA has addressed issues exclusively on the Fabric level

It is likely that we will get the required technology (Processors, Memory, Secondary and Tertiary Storage Devices, Networking, Basic Storage Management) Missing: Solutions allowing truly distributed Computing on a Global Scale Will the Grid Projects meet our Expectations (in time) ?Michael Ernst, FNAL Large Scale Cluster Computing Workshop October 21, 2002Chart120.517.412.511.63.71.913.55.911.61.4

Disk Vendors Market Share in units - 1998

Sheet114.917.7822.65.64.672.23108.4121.1145.1Seagate20.5Quantum17.4IBM12.5Maxstore11.6Toshiba3.7Hitachi1.9Western Digital13.5Samsung5.9Fujitsu11.6Others1.437000412700204520032244220104254241446004262.23.158132031500.2030.1840.230.2750.3115.01818.24621.3424.7128.45109.65127.94148.34171.65197.275.636.236.816.636.460.3460.6570.9171.1021.2340.4930.0980.0460.0310.0286.6870.5820.0990.0340.04210.115.0821.4350.3440.1178.711.976.261.8220.4724.9611.28314.478.2013.110.2154.11411.9316.73511.4770.2280.2411.6759.23616.04100.1990.2641.1136.151


2.5/3.0"5.25"3.5"Millions of unitsDisk Shipments by Form Factor - 1998


Disk Vendors Market Share in units - 1998


Capacity per 3.5 inch platterGB per disk platterAreal Density Projected Growth


1 - 1.92 - 2.93 - 4.95 - 6.97 - 9.910 - 19.9> 20units - millionsPC/Desktop Disks - Capacity Trends Sizes in GB


1.8 inch2.5 inch3.5 inch5,25 inchUnits - millionsDisk Shipments by Form Factor


Cartridges< 1 GB1 - 2 GB2 - 3 GB3 - 5 GB5 - 10 GB10 -20 GB20 - 40 GB> 40 GBRevenue US$ - billionDisk Drive Revenues by Drive Capacity


HDD Vendor Market Share in Units - 2001Desktop PC/ATA drivesQuantum / Maxtor36%

Sheet1Seagate20Quantum16IBM13Maxtor12Toshiba4Hitachi2Western Digital14Samsung6Fujitsu12Others1Seagate47Quantum/Maxtor9IBM22Fujitsu19Hitachi3Seagate24Quantum/Maxtor34IBM9Fujitsu7Sansung8Western Digital16


HDD Vendor Market Share in Units - 1998


HDD Vendor Market Share in Units - 2001Enterprise StorageQuantum / Maxtor9%


HDD Vendor Market Share in Units - 2001Desktop PC/ATA drivesQuantum / Maxtor36%


HDD Vendor Market Share in Units - 2001Enterprise StorageQuantum / Maxtor9%

Sheet1Seagate20Quantum16IBM13Maxtor12Toshiba4Hitachi2Western Digital14Samsung6Fujitsu12Others1Seagate47Quantum/Maxtor9IBM22Fujitsu19Hitachi3Seagate24Quantum/Maxtor34IBM9Fujitsu7Sansung8Western Digital16


HDD Vendor Market Share in Units - 1998


HDD Vendor Market Share in Units - 2001Enterprise StorageQuantum / Maxtor9%


HDD Vendor Market Share in Units - 2001Desktop PC/ATA drivesQuantum / Maxtor36%

Sheet1Generation 1Generation 2Generation 3Generation 4Capacity100 GB200 GB400 GB800 GBTransfer Rate10-20 MB/s20-40 MB/s40-80 MB/s80-160 MB/sEnabling Technology????Number of Channels881616Recording MethodRLL 1,7PRMLPRMLPRMLMedia TypeMP2MPMPThin FilmTape Length580 m580 m800 m800 mNumber of Tracks3845127681024




9940 Load Time To First Data(Total: 58.5 Seconds)41s Locate2.5s Low Speed Position0.5s Load0.5s Un-slack0.5s One Wrap on Reel2s Thread0.5s Initialize2s Wrap Tape Path9s Cover Leader Block

Sheet10.5load0.5un-slack2thread0.51 wrap on mach reel2wrap tape path0.5initialization9cover leader block41Locate2.5low speed posn58.5Totaltotal sec


TC40 Load Time To First Data(Total: 58.5 Seconds)9s Cover Leader Block2s Wrap Tape Path0.5s Initialize2s Thread0.5s One Wrap on Reel0.5s Un-slack0.5s Load2.5s Low Speed Position41s Locate














20010.155 0.622-2.5SONET/SDH

20020.6222.5SONET/SDHDWDM; GigE Integ.

20032.510 DWDM; 1 + 10 GigEIntegration

2005102-4 X 10 ( Switch;( Provisioning

20072-4 X 10~10 X 10; 40 Gbps1st Gen. ( Grids

2009~10 X 10or 1-2 X 40 ~5 X 40 or~20-50 X 1040 Gbps (Switching

2011~5 X 40 or

~20 X 10~25 X 40 or ~100 X 10 2nd Gen ( GridsTerabit Networks

2013~Terabit~MultiTerabit~Fill One Fiber


6000000 1448Time (s)Throughput (Mbit/s)TCP Throughput CERN-Chicago over the 622 Mbit/s link

