A Next-Generation Parallel File System Environment for the OLCF · 2012-05-13 · A Next-Generation...

A Next-Generation Parallel File System Environment for the OLCF

Galen M. Shipman, David A. Dillow, Douglas Fuller, Raghul Gunasekaran, Jason Hill,Youngjae Kim, Sarp Oral, Doug Reitz, James Simmons, Feiyi Wang

Oak Ridge Leadership Computing Facility, Oak Ridge National LaboratoryOak Ridge, TN 37831, USA

{gshipman,dillowda,fullerdj,gunasekaran,hilljj,kimy1,oralhs,reitzdm,simmonsja,fwang2}@ornl.gov

Abstract

When deployed in 2008/2009 the Spider system at theOak Ridge National Laboratory’s Leadership Comput-ing Facility (OLCF) was the world’s largest scale Lus-tre parallel file system. Envisioned as a shared par-allel file system capable of delivering both the band-width and capacity requirements of the OLCF’s diversecomputational environment, Spider has since becomea blueprint for shared Lustre environments deployedworldwide. Designed to support the parallel I/O re-quirements of the Jaguar XT5 system and other smaller-scale platforms at the OLCF, the upgrade to the TitanXK6 heterogeneous system will begin to push the limitsof Spider’s original design by mid 2013. With a doublingin total system memory and a 10x increase in FLOPS, Ti-tan will require both higher bandwidth and larger totalcapacity. Our goal is to provide a 4x increase in totalI/O bandwidth from over 240GB/sec today to 1T B/secand a doubling in total capacity. While aggregate band-width and total capacity remain important capabilities,an equally important goal in our efforts is dramaticallyincreasing metadata performance, currently the Achillesheel of parallel file systems at leadership. We present inthis paper an analysis of our current I/O workloads, ouroperational experiences with the Spider parallel file sys-tems, the high-level design of our Spider upgrade, andour efforts in developing benchmarks that synthesize ourperformance requirements based on our workload char-acterization studies.

1 Introduction

The Spider file system departed from the traditionalapproach of tightly coupling the parallel file systems to

a single simulation platform. Our decoupled approachhas allowed the OLCF to utilize Spider as the primaryparallel file system for all major compute resources atthe OLCF providing users with a common scratch andproject space across all platforms. This approach has re-duced operational costs and simplified management ofour storage environment. The upgrade to the Spidersystem will continue this decoupled approach with thedeployment of a new parallel file system environmentthat will be run concurrently with the existing systems.This approach will allow a smooth transition of ourusers from the existing storage environment to the newstorage environment, leaving our users a place to standthroughout the upgrade and transition to operations. Atightly coupled file system environment wouldn’t allowthis flexibility.

The primary platform served by Spider is one of theworld’s most powerful supercomputers, Jaguar [1, 12,5], a 3.3 Petaflop/s Cray XK6 [2]. In the Fall of 2012Jaguar will be upgraded to a hybrid-architecture cou-pling the AMD 16-core Opteron 6274 processor runningat 2.4 GHz with an NVIDIA “Kepler” GPU. The up-graded system will be named “Titan”. The OLCF alsohosts an array of other computational resources such asvisualization, end-to-end, and application developmentplatforms. Each of these systems requires a reliable,high-performance and scalable file system for data stor-age.

This paper presents our plans for the second gener-ation of the Spider system. Much of our planning hasbeen based on both I/O workload analysis of our currentsystem, our operational experiences, and projections ofrequired capabilities for the Titan system. The remain-der of this paper is organized as follows: Section 2 pro-vides an overview of the I/O workloads on the currentSpider system. Operational experiences with the Spidersystem are presented in Section 3. A high-level system

1

architecture is presented in Section 4 followed by ourplanned Lustre architecture in Section 5. Benchmark-ing efforts that synthesize our workload requirementsare then presented in Section 6. Finally, conclusions arediscussed in Section 7.

2 Workload Characterization

For characterizing workloads we collect I/O statisticsfrom the DDN S2A9000 RAID controllers. The con-trollers have a custom API for querying performanceand status information over the network. A custom dae-mon utility [11] periodically polls the controllers fordata and stores the results in a MySQL database. Wecollect bandwidth and input/output operations per sec-ond (IOPS) for both read and write operations at 2 sec-ond intervals. We measure the actual I/O workload interms of the number of read/write request with the sizeof the requests. The request size information is capturedin 16KB intervals, with the smallest request less than16KB and the maximum being 4MB. The request sizeinformation is sampled approximately every 60 secondsfrom the controller. The controller maintains an aggre-gate count of the requests serviced with respect to sizefrom last system boot, and the difference between twoconsecutive sampled values will be the number of re-quests serviced during the time period.

We studied the workloads of our storage cluster us-ing the data collected from 48 DDN “Couplets” (96RAID controllers) over a period of thirteen months fromSeptember 2010 to September 2011. Our storage clus-ter is composed of three filesystem partitions, calledwidow1, widow2, and widow3. Widow1 encompasseshalf of the 48 DDN “Couplets” and provides approx-imately 5 PB of capacity and 120GB/s of bandwidth.Widow2 and Widow3 each encompass 1/4 of the 48DDN “Couplets” and provide 60GB/s and 2.5 PB of ca-pacity each. The maximum aggregate bandwidth overall partitions is approximately 240GB/s. We character-ize the data in terms of the following system metrics:

• I/O bandwidth distribution, helps understand theI/O utilization and requirements of our scientificworkloads. Understanding workload patterns willhelp in architecting and designing storage clustersas required by scientific applications.

• Read to write ratio is a measure of the read to writerequests observed in our storage cluster. This in-formation can be used to determine the amount ofpartitioned area required for read caching or writebuffering in a shared file cache design.

• Request size distribution, which is essential in un-derstanding and optimizing device performance,

and the overall filesystem performance. The un-derlying device performance is highly dependenton the size of read and write requests, and corre-lating request size with bandwidth utilization willhelp understand performance implications of de-vice characteristics.

0 10 20 30 40 50 60 70 80 90

100

June-1 Jun-2 Jun-3 Jun-4 Jun-5 Jun-6 Jun-7

Ba

nd

wid

th (

GB

/s) Read

Write

0 10 20 30 40 50 60 70 80 90

100

June-1 Jun-2 Jun-3 Jun-4 Jun-5 Jun-6 Jun-7

Ba

nd

wid

th (

GB

/s) Read

Write

0 10 20 30 40 50 60 70 80 90

100

May-31 Jun-1 Jun-2 Jun-3 Jun-4 Jun-5 Jun-6 Jun-7

Ba

nd

wid

th (

GB

/s) Read

Write

Figure 1: Observed I/O bandwidth usage for a week in June2011.

Figure 1 shows the filesystem usage in terms of band-width for a week in the month of June 2011. This isrepresentative of our normal usage patterns for a mix ofscientific applications on our compute clients. We havethe following observations from the figure:

• The Widow1 filesystem partition shows muchhigher bandwidths of reads and writes than thosein other filesystem partitions (widow2 and 3). Notethat widow1 partition is composed of 48 RAIDcontrollers (24 “Couplets”) whereas other parti-tions are composed of 24 RAID controllers (12“Couplets”), widow1 is designed to offer higher ag-gregate I/O bandwidths than the other file systems.

• Regardless of filesystem partitions, utilized band-width observed is very low and only several highspikes of bandwidths could be sparsely observed.For example, we can observe high I/O demands,which can be over 60GB/s on June 2, June 3, June4, however, other days show lower I/O demands.We can infer from the data that the arrival patternsof I/O requests are bursty and the I/O demands canbe tremendously high for short periods of time butoverall utilization can be dramatically lower thanpeak usage. This is consistent with application

2

0 10 20 30 40 50 60 70 80 90

100 110 120 130 140

Sept10 Oct10 Nov10 Dec10 Jan11 Feb11 Mar11 Apr11 May11 Jun11 Jul11 Aug11 Sept11

Maxim

um

Bandw

idth

Max of Montly Max Values (GB/s): widow1 read=132.6, write=98.4widow2 read= 50.2, write=47.5widow3 read= 52.5, write=47.7

widow1-readwidow1-writewidow2-readwidow2-writewidow3-readwidow3-write

Figure 2: Aggregate read and write maximum bandwidths observed from widow1, widow2, and widow3 partitions.

Checkpoint/Restart workloads. Whereas peaks inexcess of 60GB/s are common, average utilizationis only 1.22GB/s and 1.95GB/s for widow2 andwidow3 respectively. These results highly motivatea tiering strategy for next-generation systems withhigher bandwidth media such as NVRAM withsmaller capacity backed by larger capacity and rel-atively lower performance hard disks.

Figure 2 shows monthly maximum bandwidths forreads and writes. Overall it is observed from all widowfilesystem partitions that max read bandwidth is higherthan max write bandwidth. For example, in widow1, themax read bandwidth is about 132GB/s where as the maxwrite bandwidth is about 99GB/s. in widow2, max readbandwidth is 50.2GB/s whereas max write bandwidth is47.5GB/s. This asymmetry in performance is commonin storage media.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1 10 100 1000 10000 100000

Dis

trib

ution P

(x<

X)

Read Bandwidth (MB/s) - Log-Scale

widow1widow2widow3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1 10 100 1000 10000 100000

Dis

trib

ution P

(x<

X)

Write Bandwidth (MB/s) - Log-Scale

widow1widow2widow3

Figure 3: Cumulative distribution of I/O bandwidths for readsand writes for every widow partition.

Our observations further revealed bursty properties ofI/O bandwidths from Figure 1. One way of analyzingI/O bandwidth demands is through the use of CDF (Cu-mulative Distribution Function) plots. In Figure 3, weshow the CDF plots of reads and writes for all widowpartitions. Similar to our observations that we made[9], the bandwidth distributions for reads and writes fol-low heavy long-tail distributions, and these trends areobserved across all widow filesystem partitions.

For example, in Figure 3(a), we see that read band-width for widow1 exceed 100GB/s whereas it becomeslower than 10GB/s at 90th percentile. It becomes even

lower than 100MB/s at the 50th percentile. Similar ob-servations can be found in widow2 and 3. The band-width can exceed 10GB/s at the 99 percentile of thebandwidths, however, it becomes lower than 100MB/sat around the 65th and 55th percentiles for widow2 andwidow3 respectively. Figure 3(b) illustrates the CDFplot of write bandwidth for widow1, 2, and 3. Similarobservations can be found in Figure 3(a). However, it isobserved that the max write bandwidths are lower thanthe read bandwidths, and at the 90th percentile, writebandwidth for widow1 becomes much lower than theread bandwidth.

0

10

20

30

40

50

60

70

80

90

100

Sept10 Nov10 Jan11 Mar11 May11 Jul11 Sept11

Re

ads (

%)

Time (Month-Year)

Average:

widow1=61.6%

widow2=35.9%

widow3=35.3%

widow1widow2widow3

Figure 4: Percentage of read requests observed every monthfor every widow partition.

Typically scientific storage systems are thought to bewrite dominant; this is generally attributed to the largenumber of checkpoints written for increased fault toler-ance. However in our observation we see a significantlyhigh percentage of read requests.

Figure 4 presents the percentage of read requests withrespect to the total number of I/O requests in the sys-tem. The plot is derived by calculating the total read andwrite requests observed during the 13 month period. Onaverage, widow1 is read-dominant with 61.6% of totalrequests being reads. However, it’s observed that readpercentage can exceed 80% of reads. (referring to thepercentages of reads in October 2010, April and May

3

2011 in Figure 4). Compared to widow1, widow2 andwidow3 show lower read percentages. Average read per-centage is around 35% in both widow2 and 3 partitions.However, we can also observe that the read percentagecan exceed 50%. And we also observe that the read per-centage increases; For example, average read is 41.1%in 2011 whereas it is 24% in 2010.

Conventional wisdom is that HPC I/O workloads arewrite dominant due to a lot of checkpointing operations,Spider I/O workloads do not follow this convention.This could be attributed to the center-wide shared filesystem architecture of Spider, hosting an array of com-putational resources such as Jaguar, visualization sys-tems, end-to-end systems, and application developmentsystems.

As we studied in our previous work [9], we observethat I/O request sizes from 4-16KB, 512KB and 1MBaccount for more than 95% of total requests. This isbecause the request sizes cluster near 512KB bound-aries imposed by the Linux block layer. Spider’s LusterOSSes use DM-Multipath, a virtual device driver thatprovides fault tolerance. To ensure it never sends re-quests that are too large for the underlying devices, ituses the smallest maximum request size. If all of thosedevices support requests larger than 512KB, it uses thatsize as its maximum request size. The lower lever de-vices are free to merge the 512KB into larger requests.Lustre also tries to send 1MB requests to storage whenpossible, thus providing frequent merge opportunitiesunder load.

3 Operational Experiences

In this section we’ll cover the reliability of the com-ponents of the storage system to date, cover our experi-ence with using an advanced Lustre Networking (LNET)routing scheme to increase performance, and finally dis-cuss our experiences with the Cray Gemini interconnect.

3.1 Hardware Reliability

3.1.1 DDN S2A 9900

The main focus of any storage system’s reliability andperformace is the disk subsystem. Over the course of 4years in operation the DDN S2A9900 has been a verystable and productive platform at the OLCF. The perfor-mace has met our needs throughout its lifetime so far,and has resulted in very little unscheduled downtime forthe filesystems. Of paramount concern as the storagesystem ages is component failure rates. Based on ourcurrent data from Crayport we have experienced an av-erage of 4 disk failures per month since the storage sys-

tem was brought online in 2009. Figure ?? shows diskfailures in Spider over time.

Overall the S2A9900 has been fairly stable and hasnot required large quantities of component replacement.Additionally the architecture of Spider allows us to haveportions of the 9900 fail and not cause an outage. Fig-ure 5 shows a chart of failures by component, and a com-parison of failures that result in FRU replacement. Themajority of disk failures resulted in FRU replacement.Most of the other compponent failure types had a lowerreplacement to failure ration.

0 20 40 60 80 100 120 140 160 180 200

Disk

Singlet

Enclosure

IO Module

Disk Expansion M

odule

Enclosure PSU

Singlet PSU

Singlet Fan Module

Failure

Replacement

Figure 5: FRU failures in Spider from January 2009 throughMarch 2012.

3.1.2 Dell OSS/MDS nodes

The Lustre servers for Spider are the Dell PowerEdge1950 for OSS and MGS servers; R900 for the MDSservers. These platforms have been extremely reliablethrough the life of Spider to date. Our on-site hard-ware support team has been able to get nodes replacedquickly to re-enable the full features and performanceof Spider. Node swaps are less common than disk fail-ures, but on average since January 2009 we replace anOSS node ever 3 months. We take periodic maintenanceto upgrade firmware and BIOS on machines. The mostcommon reason to replace a node is bad memory - andit’s far easier to work with Dell to get the problem re-solved with the node in the Spare pool and not servingproduction data. The possibility exists to just swap thememory from a spare node but that makes the case track-ing much harder when interacting with the vendor.

3.1.3 Cisco and Mellanox IB switch gear

The Spider Scalable IO network (SION) consists of ap-proximately 2000 ports (both host and switch HCA’s),and over 3 miles of optical infiniband cabling. The net-work was designed for fault tolerance – allowing accessto the storage resources even if on the order of 50% ofthe networking gear is down. Since 2009 when Spider

4

was placed in production there are only 2 service inter-ruptions that were based on issues with SION; one wasrelated to hardware, the other related to the OFED soft-ware stack.

3.2 Software Reliability

As the Spider filesystem has transitioned from Lustreversion 1.6.5.1 with approximately 100 patches throughseveral versions of 1.8.X; stability has remained verygood. That code base is mature and it has resultedan a very reliable and productive platform for the sci-ence objectives at the Oak Ridge Leadership ComputingFacility. For calendar year 2011 (see Table 1 below),the largest filesystem in the OLCF had 100% sched-uled availibility in 8/12 months. Overall for the yearit had scheduled availibility of 99.26% – something thatwas almost unheard of in the version 1.4 days of Lus-tre. Those numbers are even more impressive whenyou consider the scale of the system in relation to thesize of a Lustre 1.4 installation. The partnership be-tween DDN, Dell, Mellanox, and the Lustre players atSun Microsystems, Cray, and Whamcloud has provideda roadmap for other centers to remove islands of datainside compute platforms, lower interconnection costsbetween compute resources, and decouple storage pro-curements from compute procurements.

Table 1: Spider Availibility for 2011

Filesystem ScheduledAvailibility

OverallAvailibility

widow1 99.26% 97.95%widow2 99.93% 99.34%widow3 99.95% 99.36%

3.3 Advanced LNET Routing for Performance

In May of 2011 we applied an advanced LustreNetworking (LNET) routing techinque we called Fine-Grained Routing (FGR) that allowed us to achievemuch higherbandwidth utilization on the backend stor-age without any modifications to the user applicationcodes [4]. As can be observed in Figure 6, the maximumaggregate performance graph below there is a consider-able dropoff in aggregate performance in August 2011 –when we had to remove the FGR configuration in prepa-ration for the progressive upgrade for Jaguar from XT5to XK6. In August we saw a 20% performance decreasein the maximum aggregate bandwidth performance forSpider.

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Jan-‐11

Feb-‐11

Mar-‐11

Apr-‐11

May-‐11

Jun-‐11

Jul-‐11

Aug-‐11

Aggregate BW Performance in MB/s

Aggregate BW Performance

Figure 6: Aggregate bandwitdth to all of Spider for January2011 through August 2011.

3.4 Gemini Experiences

3.4.1 Filesystem Interaction with Gemini Re-route

Our largest issue related to Gemini in production is doc-umented in Cray bug 780996. Under certian circum-stances the L0 component has extremely high CPU loadand does not answer a routing request from the SMW,causing the entire machine to destabalize. In all caseswhere this bug was encountered, a reboot was requiredto rectify the situation. Work is ongoing with this bugwith a near term resolution proposed. Under normal op-erating conditions we see this bug on the order of onceper month. The dynamimc routing feature of the Gem-ini interconnect has been our biggest win for availibilityof the compute resource in moving from the XT5 to theXK6.

3.4.2 LNET Self Test performance

As part of our acceptance suite for the XK6 we ranthe LNET Self Test application from the Lustre testingsuite. With our stated requirements of 1TB/s of sequen-tial IO performance for the next generation filesystemwe needed to verify that the Gemini interconnect couldpass enough traffic based on the approximate numberof LNET routers we would have in the machine. Themetric was at least 500GB/s of performance, and weachieved 793GB/s using 370 XIO nodes; a 2.14GB/s av-erage per node. Additional performance will be requiredto attain the 1TB/s performance as there are only 512available service nodes in the machine and they cannotall be LNET routers. Areas where we think performancecould be gained would be in making the checksummingroutines multithreaded, and possibly changing the algo-rithm used for computing the checksums.

5

3.4.3 kgnilnd

At OLCF there is a large knowledge base of tunable pa-rameters for the LNET modules for Seastar (kptllnd),and for tuning on Seastar as well. Transition to Gem-ini provided us the opportunity and challenge of learn-ing the parameters for an entirely different network tech-nology. Documentation for kgnilnd circa October 2011was very sparse. There was no baseline configurationthat could be referenced, and no official document fromCray that pointed to the module parameters and recom-mended as well as default values. This made bringingup Spider on portions of the XK6 more difficult as wewould make progress and have to send an e-mail to adeveloper at Cray and wait for a response. Luckily wehad planned the schedule to allow for things like this andwe were able to go on working on other hardware whilewe waited for responses.

Interaction between the HSN and SION changes a lit-tle on Gemini also. With Seastar the HSN was the bot-tleneck – with Gemini the IB network is the bottleneck.Care must be taken to allow for router buffers and formonitoring outstanding messages on the router to insurethat things are not backing up and causing congestion.

4 System Architecture

OLCF’s current center-wide file system, Spider(§4.1), represented a leap forward in I/O capability forJaguar and the other compute platforms within the fa-cility. Lessons learned from the Spider deployment andoperations will greatly enhance the design of the next-generation file system. In addition to a new storage in-frastructure, the OLCF’s Scalable I/O Network, SION(§4.2) will be rearchitected to support the new storageplatform.

4.1 Spider

Spider is the current storage platform supportingOLCF. It was architected to support Jaguar XT5 alongwith the other computing and analysis systems withinthe center. The current Spider system is expected to re-main at OLCF through 2013/2014.

Spider is a Lustre-based [13][15] center-wide file sys-tem replacing multiple file systems within the OLCF. Itprovides centralized access to petascale data sets fromall OLCF platforms, eliminating islands of data.

Spider is a large-scale shared storage cluster. 48DDN S2A9900 [3] controller couplets provide storagewhich in aggregate delivers over 240GB/s of bandwidthand over 10 petabytes of formatted capacity from 13,4401 terabyte SATA drives.

The storage is accessed through 192 Dell dual-socketquad-core Lustre OSS (object storage servers) nodesproviding over 14 Teraflop/s in performance and 3 Ter-abytes of memory in aggregate. Each OSS can providein excess of 1.25 GB/s of file system level performance.Metadata is stored on 2 LSI Engino 7900s (XBB2) [10]and is served by 3 Dell quad-socket quad-core systems.

A centralized file system requires increased redun-dancy and fault tolerance. Spider is designed to elimi-nate single points of failure and thereby maximize avail-ability. By using fail-over pairs, multiple networkingpaths, and the resiliency features of the Lustre file sys-tem, Spider provides a reliable high-performance cen-tralized storage solution greatly enhancing our capabil-ity to deliver scientific insight.

On Jaguar XK6, 192 Cray Service I/O (SIO) nodesare configured as Lustre routers. Each SIO is connectedto SION using Mellanox ConnectX [14] host channeladapters (HCAs). These Lustre routers allow computenodes within the Gemini torus network on the XK6 toaccess the Spider file system at speeds in excess of 1.25GB/s per compute node. In aggregate, the XK6 systemhas over 240 GB/s of storage bandwidth.

4.2 Scalable I/O Network (SION)

In order to provide true integration among all systemshosted by OLCF, a high-performance, large-scale Infini-Band [8] network, dubbed SION, was deployed. SIONprovides advanced capabilities including resource shar-ing and communication between the two segments ofJaguar and Spider, and real time visualization, stream-ing data from the simulation platform to the visualiza-tion platform at extremely high data rates.

SION will be rearchitected to support the installationof the new file system while retaining connectivity forthe existing storage and the other platforms currentlyconnected to it. As new platforms are deployed at OLCF,SION will continue to scale out, providing an integratedbackplane of services. Rather than replicating infras-tructure services for each new deployment, SION per-mits centralized, center-wide services, thereby reducingtotal costs, enhancing usability, and decreasing the timefrom initial acquisition to production readiness.

4.3 New System Architecture

OLCF’s next-generation parallel file system will berequired to support the I/O needs of Titan while suc-ceeding the current storage infrastructure as the primarycenter-wide storage platform. While the final architec-ture will depend strongly on the storage solution selectedand supporting hardware required to build the file sys-

6

tem, many necessary characteristics are known based oncurrently understood system requirements.

In its final configuration, Titan will contain between384 and 420 service nodes for use as I/O routers. Thesenodes will connect to SION using QDR InfiniBand.Based on the final file system architecture, this QDR fab-ric will likely exist as a separate stage within SION withcross-connections to the Spider file system.

To fully support the I/O capabilities of Titan, thenew file system will require up to 1TB/sec of sequen-tial bandwidth. The system will employ Spider’s scal-able cluster philosophy to enable a flexible deploymentcapable of adjustment based on evolving system require-ments and budget realities. Additionally, the decouplednature of the SION network combined with the networkagnosticism of LNET will permit OLCF to procure anyback-end storage technology supported by Lustre. Aconceptual diagram of the new system architecture ispresented in Figure 7.

Current SION(IB DDR)

SpiderNext Gen File System

Exis<ng Analysis and Compute Clusters

Titan

Upgraded SION(IB QDR/FDR)

Future Analysis and Compute Clusters

New or future components and systems Exis5ng components and systems

Figure 7: Conceptual diagram of the new system architecture.

5 Lustre Architecture

Spider is a Lustre-based [13, 15] center-wide file sys-tem.Implementing Spider drove many enhancements tothe Lustre file system in the 1.6 and 1.8 release series,and we expect the next-generation OLCF file system tomake use of both OLCF and OpenSFS driven features inLustre 2.2 and beyond.

Lustre 2.2 saw many improvements in the code base.Many of these directly affect the performance of the filesystem, and help to improve the scientific productivity

of the center:

• Parallel directory operations(PDO) improve a longstanding performance issue when operating a sys-tem at large scale – each modification to a directoryis serialized by a single lock, allowing a single op-eration to proceed at a time. File-per-process IOmodels combined with ever-increasing core countssignificantly increases the impact of this serializa-tion on creating a checkpoint or analysis data set.OpenSFS funded development to split the the direc-tory lock into multiple locks, improving parallelismand reducing the time required to create thousandsof files in a single directory.

• Increased maximum stripe count allows more OSTsto be used for a single file. Prior to Lustre 2.2, afile was limited to 160 OSTs. With ldiskfs’s limitof 2 TB per object, a single file was limited to 320TB. This ORNL funded development allows up to2000 OSTs per file, allowing a single file to store4 PB. Perhaps more important than the increasein file size is the potential performance improve-ment from spreading the load over more OSTs.Spider currently has over 672 OSTs in its largestfile system; these improvements allow a theoretical4x bandwidth improvement for a single shared filewith well formed IO characteristics.

• Imperative recovery helps reduce the time requiredto recover from the inevitable hardware failuresseen at scale. In the past, recovery is allowed totake up to three times the OBD timeout to com-plete. At small scale, this does not add significantdelay to recovery, as the default timeout is 100 sec-onds and smaller client counts both increase theodds of each client performing IO at the time of thefailure and decrease the chance of a client failureduring the recovery window. However, the scale ofOLCF has the opposite effect: it is unlikely that allof the center’s resources are performing IO duringa failure, and it is much more likely that a clientmay die during the recovery window. This, com-bined with operational parameters that required a6x increase in OBD timeout, caused the recoverywindow to grow to over 30 minutes.

Imperative recovery takes advantage of a sharedlock on the MGS to note when an OST is restarted– either on the same OSS or another – and causeseach client to notice the new location of the OSTwithout waiting for a timeout to occur. Whilethis alone improves the recovery time, the recov-ery window is also reduced to take advantage ofthis active notification – we notice dead clients

7

much sooner. In combination, these ORNL fundedchanges are on track to reduce the OLCF recoverytime from over 30 minutes to well under 5 minutes.

• Asynchronous glimpse locks Lustre has over the lastseveral years been improved with added featuresto deal with traversing directories. Prior to Asyn-chronous glimpse locks, the client would triggermany RPCs for each file or subdirectory for a spe-cific parent directory. To improve this, stataheadwas added to Lustre so clients could prefetch fileattributes from the MDS asynchronously. This ini-tial design while improving the problem still expe-rienced limitations. Each parallel statahead threadwould require access to the VFS layer which in-troduce race conditions with other operations alsoneeding to access the VFS layer as well. In addi-tion, not all RPCs were processed asynchronously.The glimpse size RPCs used to pre-fetch file sizewhere still synchronous. The solution was to makeglimpse size RPCs asynchronous and to take infor-mation in the RPC reply to cache it in a stataheadlocal dcache on the client.

• PTLRPC thread pools In Lustre 2.2 a generic ptl-rpcd pool of threads was developed to handle allthe asynchronous RPCs on the client. In the pastlustre had 2 threads to handle the RPC load. Of-ten one of those threads would be idle while thethe CPU was pegged. As one can see this doesnot scale well to modern machines. So a pool ofthreads was created that would take advantage ofeach CPU core. These threads are then bound toeach core to avoid cache misses which degradesoverall perform. Even though this helps distributethe CPU load, it is still possible to have one threadexperience a delay due to an over utilized CPU. Toget the best of both worlds two pools are created.Some threads are always bound to a core and theother threads are free to migrate to the least loadedcore. This can be controlled with binding policesset at module load time.

Lustre 2.3 is expected to bring further improvementsin performance:

• 4+ MB requests will help improve storage backendperformance when driven under heavy workloadsfrom thousands of clients. While well-formed, se-quential IO is able to generate peak bandwidthfrom the storage system, the IO stream presentedquickly becomes random at the block level. With1 MB write request sizes, block-level performanceon Spider’s DDN 9900 controllers is reduced to be-tween 35% and 40% of peak unless I/O requests are

merged within the controller. Sending 4 MB writerequests to the controllers returns performance toabove 80% of peak irrespective of merging of re-quests and achieves above 90% in many cases.

Lustre 2.3 is expected to increase the maximumRPC size to 4 MB, allowing the OSSes to sendlarger requests through the object storage file sys-tem to the block layer. This will recover much ofthe performance lost to the highly random IO pat-terns seen from thousands of clients simultaneouslyaccessing the file system.

• SMP scalability improvements, funded byOpenSFS, will improve performance on to-day’s multi-core systems often used for Lustreservers. Currently, Lustre has many points ofcontention at multiple layers of the stack. Atthese points, locks may be bounced from core tocore, and from socket to socket, thrashing cachesand dramatically increasing latency. Additionally,requests may be bounced from core to core whilemigrating from LNET to the RPC service that willultimately process it, loosing the benefits of anywarm cache it may have generated.

The SMP scalability improvement work will splitlocks with high rates of contention and add per-socket and/or per-core queues for LNET and RPCservices. This will improve the LNET small mes-sage throughput, RPC service rate, and overall re-sponsiveness and throughput for the MDS server.

• Online Object Index scrubbing will help protect thefile system from corrupted meta-data due to non-graceful shutdowns, as well as restore the ability toperform file-level MDT backups. Lustre 2.x usesan object index to map file ids (FIDs) to inodesin the backing store. If this index is corrupted ordeleted due to a file-level backup or other storageissue, Lustre is unable to retrieve user data fromstorage without manual intervention. OpenSFS hasfunded work to perform an online scrub of this in-dex in the background of user-initiated operations.This will verify that every FID in the index pointsto a correct and valid inode, and that every in-use inode has a FID and entry in the object index.This work will form the basis for follow-on workto allow for online integrity checks of the Lustremetadata between MDT(s) and the OSTs, avoidinglengthy down-times to detect and correct missingand/or orphaned data objects.

• The Network Request Scheduler (NRS) enhance-ment will allow the Lustre servers more controlover the order in which they processes the RPCs

8

presented by the clients, leading to better IO pat-terns to the block scheduler and allowing for dif-ferent quality of service levels for different users orclients.

Currently, Lustre processes requests from clients ina first-in, first-out (FIFO) manner. This allows alarge system to starve out smaller ones due to thedisparate request generation rate, or due to manyrequests for the same resource to eat up servicethreads waiting for a lock while requests to otherresources sit idle in the network queue. NRS allowspluggable algorithms to specify the order in whichrequests are processed. These algorithms can im-plement policies that restrict the relative numberof requests from a group of clients or could givepreferential treatment for contiguous requests to thesame data object. While the enabling technologywill be introduced in Lustre 2.3, we expect to seecommunity development of additional policy en-gines once the foundational mechanism is in place.

• The Object Storage Device (OSD) rework will al-low Lustre to utilize different object storage sys-tems. Currently, Lustre uses an ext4 derivativecalled ldiskfs to store user data in objects on disk.Lawrence Livermore National Laboratory has beenworking to allow Lustre to take advantage of the re-dundancy, management, and performance featuresof the ZFS filesystem. The OSD rework providesthe flexibility to allow both ldiskfs and ZFS to beused as the underlying object store in the near fu-ture, and allows for future expansion to BTRFS andother new file systems longer-term.

As we look beyond the expected features in the Lus-tre 2.3 release, it becomes more difficult to determinewhen various improvements will become available to theLustre community. However, there are several improve-ments being worked on as likely candidates for inclusionin the OLCF next-generation file system:

• Distributed Namespaces (DNE) will improve Lus-tre aggregate metadata performance by allowingmultiple MDTs to be used in a single Lustre filesys-tem. This horizontal scaling will help isolate themetadata demands between groups of users whileallowing management of the storage as one aggre-gated pool. The initial implementation of DNEwill allow directories to be created on separateMDTs, spreading the load in different portions ofthe namespace. Follow-on work will allow a sin-gle directory to be striped across multiple MDTs,allowing improvements when thousands of clientsare performing operations in that directory, such asduring large-scale job startup and checkpointing.

• Client IO improvements to prioritize page write-back. It has been observed for some time that in-dividual clients suffer from several IO bottle necks.Lustre currently uses a basic FIFO to cache dirtypages which ties performance to the behavior of theapplication. This is remedied by merging sequen-tial pages on the client side during formation of re-quests. Further performance improvement will begained by prioritizing the write-back of pages asso-ciated with a lock being canceled by the OST. Thiswill reduce the time required to release the lock toother clients waiting to write their data. By mov-ing the locking to a per-object basis, this will alsoreduce lock contention on the clients.

• Improved performance for directory scans will im-prove common use cases such as a user listing thenames and sizes of files in a directory or applica-tions scanning for their input deck. Currently, Lus-tre uses multiple requests to retrieve the names ofthe files in the directory, the size of the files, and theattributes of the files. While multiple names may beretrieved in one request, obtaining the size and at-tributes requires two requests per file. This load isreduced by sending all of the information about thefile along with the name, reducing the number of re-quests (and LDLM locks) required from the MDT.While this improvement may not significantly im-pact application performance, it will increase re-sponsiveness to interactive user commands.

While looking to the future of the OLCF file systems,it is important to not forget the past and to minimize userpain as much as possible during the transition. It is pos-sible to upgrade the existing Spider file systems from 1.8to 2.2 (and beyond) without requiring a reformat. Thishas the advantage of not requiring users to move theirdata to archival storage and restoring it to the new sys-tem, but trades off access to many of the new featuressupported by recent Lustre releases. There is work un-derway to convert the on-disk format of 1.8 file systemsto 2.X, and OLCF will monitor the progress of this effortto provide the least interruption possible to our users.

6 Systematic Benchmarking

The benchmarking suite is a comprehensive suiteproviding Lustre file system-level and block-level per-formance metrics for a given file or storage system. Thesuite also provides tools for quick visualization of theresults allowing head-to-head comparison and detailedanalysis of system’s response to exercised I/O scenar-ios. Using bash scripts as wrappers to control, coordi-nate, and synchronize pre-selected I/O workloads, the

9

suite uses the obdfilter-survey at the file system-leveland the fair-lio at the block-level as workload genera-tors. Obdfilter-survey [7] is widely used and exercisesobdfilter layer in Lustre I/O stack for reading, writingand rewriting Lustre objects. Fair-lio [4] is in-housedeveloped and libaio-based tool which generates paral-lel and concurrent block-level sequential and random,read and write asynchronous I/O to set of specified localblock-level targets. The benchmark suite is developed aspart of OLCF’s efforts towards procuring and deployingthe next generation center-wide shared Lustre file sys-tem for the Titan supercomputer and other OLCF com-puting, analysis, and visualization resources. The suiteis publicly available and can be obtained at the OLCFwebsite [6].

Based on the lessons learned from our Spider filesystem deployment and operations and also from ourI/O workload characterization work outlined in Section2, our benchmark suite evaluates the file and storagesystem I/O response in terms of various characteristics,such as, performance and scalability. Our I/O workloadcharacteristics work and experiences with the Spider filesystem both pointed out that the aggregate I/O workloadobserved at the file and storage system level is highlybursty, random, and a heavy mix of small (less than 512kB) and large (512 kB and above) read and write re-quests. Therefore, our benchmark suite tries to mimicthis workload.

As stated, the benchmark suite has a block I/O sec-tion and a file system I/O section. The block I/O sec-tion consists 4 different benchmarks: single host scaleup test (block-io-single-host-scale-up.sh), single hostfull scale test (block-io-single-host-full-run.sh), scalablestorage unit scale up test (block-io-ssu-scale-up.sh), andthe scalable storage unit degraded mode full scale test(block-io-ssu-degraded.sh). Of these, the first three areused for assessing the performance and scalability of ahealthy system, while the fourth is used for degradedsystems. All four benchmarks require pdsh and dshbakutilities to execute. In each of the four benchmarks, thestorage system is exercised with random and sequential,small and large read and write I/O operations. For alltests, command line parameters including queue size,block size, and I/O test mode (sequential write, sequen-tial read, random write random read) and the iterationnumber is generated before the actual execution and thenrandomized. The randomized list of tests then fed intothe benchmark I/O engine and executed. Through thisrandomization of test parameters and sequence of I/Ooperations and modes, we eliminate the caching effectson test nodes and the storage system, therefore obtainingmuch more realistic readouts (based on our Spider expe-riences and results of our workload study, we know that

the I/O traffic is heavily bursty and heavy mix of I/Omodes and operations with varying block sizes). Eachindividual iteration of tests is run for 30 seconds to ob-tain statistically meaningful results.

Of these four benchmarks, the block-io-single-host-full-run test is used to characterize the performance andscalability of the underlying storage system for a sin-gle I/O server perspective. A single SCSI disk blockdevice (sd device) configured on the target host is exer-cised in this test. There are 720 individual tests in thisone benchmark and the total run time of the benchmarkwill be (720 tests * 45 seconds), or in other words 9hours. The script will generate a summary file capturingthe STDOUT of the script with some additional infor-mation and a .csv (comma separated values) results filecapturing detailed results of the tests and derived statis-tics. The script will also create a subdirectory located atthe parent directory where the script was launched andwrite individual raw test results in separate files in thisnew directory. This benchmark will use 4 kB, 8 kB, 16kB, 32 kB, 64 kB, 128 kB, 256 kB, 512 kB, 1 MB, 2 MB,4 MB, and 8 MB block I/O request sizes and sequentialand random write and read I/O modes and operationsand queue sizes of 4, 8, and 16. A permutation of allthese variables is generated as command line argumentsbefore the actual execution of the benchmark and thenrandomized and then fed into the actual I/O workloadgenerator engine (i.e. fair-lio), as explained above.

The block-io-single-host-scale-up test runs a ran-domized set of sequential and random write and read I/Obenchmarks using the fair-lio binary for various blockand queue sizes for all SCSI disk block devices config-ured on one a single I/O server node for a given scalablestorage unit for multiple iterations. Similar to the pre-vious test, this benchmark also generates and random-izes test parameters and test modes and operations be-fore the execution. Also, as outlined in our workloadstudy, I/O block size PDF show three spikes at less than16 kB, 512 kB, and 1 MB for both read and write op-erations. These three request sizes were observed to ac-count for more than 95% of total requests. Therefore, tospeed up the benchmarking process and to shorten theactual test time, we chose 4 kB, 8 kB, 512 kB, and 1MB as block sizes. The total run time of the test will beat least log2(number of target test devices) * (144 tests* 45 seconds), or in other words at least 1.8 hours for(log2(number of target test devices)). As an example, ifevery test host has 5 target test devices, the total run timewill be 7.2 hours.

The block-io-ssu-scale-up benchmark will exerciseall configured SCSI block devices on all configured testhosts on the scalable storage unit to gather the maximumobtainable performance of the scalable test cluster under

10

various I/O test modes and operations. This test willagain run a randomized set of sequential and randomwrite and read I/O benchmarks using the fair-lio binaryfor various block and queue sizes for all SCSI disk blockdevices devices for multiple iterations on all I/O servers.The block and queue sizes selected for this benchmarkis identical to those of the block-io-single-host-scale-upbenchmark. The total run time of the benchmark willtherefore be at least (log2(number of target test devices))* (144 tests * 45 seconds), or in other words at least 1.8hours for (log2(number of target test devices)). As anexample, if every test host has 5 target test devices, thetotal run time will be 7.2 hours.

The degraded mode test, block-io-ssu-degraded, issimilar to ssu scale-up test, and has an identical sets ofblock size, queue size, and I/O mode and operation pa-rameters. This test will exercise all SCSI block devices(i.e. RAID arrays or LUNs) on all test hosts and pro-vide the performance profile of the ssu when 10% of theSCSI block devices are being rebuilt. Before runningthis script it is expected that the tester makes sure thatthere are at least 10% of the SCSI block devices are inactive rebuild state for the entire execution of the bench-mark. This script will again run a randomized set of se-quential and random write and read I/O benchmarks us-ing the fair-lio binary for various block and queue sizesfor all SCSI block devices. The total run time of the testwill be (144 tests * 45 seconds), or in other words 1.8hours.

Also included in the benchmark suite, are tools forparsing and plotting the obtained results for the blockI/O benchmarks. These plotting tools require gnuplotand ps2pdf utilities to execute. These two utilities arequite common in HPC environments. A sample blockI/O plot is presented in Figure 8.

0

20

40

60

80

100

1 2 4 5

% o

f tot

al p

erfo

rman

ce

number of devices

seq _ rd-Max-blo ck_io _ ssu_ scale _ u p _ O ct _ 0 8 _ 11 _ 0 5 _ 1 8 -sas

Max-16 -10 4 8 5 7 6Max-16 -40 9 6

Max-16 -52 4 2 8 8

Max-16 -81 9 2Max-4-10 4 8 5 7 6

Max-4-40 9 6

Max-4-52 4 2 8 8Max-4-81 9 2

Max-8-10 4 8 5 7 6

Max-8-40 9 6Max-8-52 4 2 8 8

Max-8-81 9 2

Figure 8: A sample block I/O benchmark plot

For assessing the Lustre-level performance and scal-ability of a file and storage system, we developed a set of

tools built around the obdfilter-survey engine. Our toolsgenerate a required set of parameters and variables andthen feed them to the obdfilter-survey. Again, similarto our block I/O tests, these parameters are determinedbased on our experiences with Spider and our I/O work-load characterization study. Our Lustre-level benchmarkpackage includes a obdfilter-survey-olcf script. Thisscript is different than the one that comes with Lustredistributions and includes the required set of I/O pa-rameters and variables. The only modifiable parame-ters in this benchmark package is the list of OSTs to betested. There are no other modifiable variables or pa-rameters. This benchmark assumes a fully configuredand functional Lustre file system already running on thetest hardware. However, Lustre clients are NOT neededto run this benchmark suite. The benchmark is testedagainst Lustre version 1.8. The benchmark package re-quires passwordless ssh capability from the head nodeto the OSSes and between the OSSes, as well. A sam-ple file-system-level benchmark plot is presented in Fig-ure 9.

% o

f tot

al p

erfo

rman

ce u

sing

20

objs

Figure 9: A sample file-system-level benchmark plot

OLCF’s benchmark suite was shared with our part-ners and vendors in 2011 and is publicly available sinceearly 2012. The initial feedback we have received is veryencouraging.

11

7 Conclusions

To support the need of a scalable, high-performance,center-wide parallel file system environment, the OLCFarchitected, developed, and deployed the Spider system.In 2013, Spider will have been in operation for over 5years having supported Jaguar XT4, Jaguar XT5, andJaguar XT6 throughout this time. As the OLCF tran-sitions to the next-generation hybrid Titan system, theSpider system will undergo a major upgrade to meet theperformance and capacity requirements of Titan.

The design of our next-generation Spider system willdraw upon our operational experiences, I/O workloadcharacterizations, and the performance, capacity, and re-siliency requirements of the OLCF. This system will in-corporate a number of major changes in Lustre to im-prove resiliency, performance, and scalability. Similaradvances in storage system technologies will be incor-porated into this system.

Given the flexible architecture of the current Spidersystem, this upgrade will be brought online and will beoperated concurrently with our current storage systems.This strategy will provide a smooth transition for ourusers allowing them access to our current storage sys-tems as new storage is deployed and made accessible.

References

[1] A. Bland, R. Kendall, D. Kothe, J. Rogers, and G. Ship-man. Jaguar: The worlds most powerful computer. InProceedings of the Cray User Group Conference, 2009.

[2] Cray Inc. Cray XK6. http://www.cray.com/

Products/XK6/XK6.aspx.[3] Data Direct Networks. DDN S2A9900. http://www.

ddn.com/9900.[4] D. A. Dillow, G. M. Shipman, S. Oral, and Z. Zhang. I/o

congestion avoidance via routing and object placement.In Proceedings of Cray User Group Conference (CUG2011), 2011.

[5] J. Dongarra, H. Meuer, and E. Strohmaier. Top500 su-percomputing sites. http://www.top500.org, 2009.

[6] O. R. L. C. Facility. Olcf i/o evaluation benchmarksuite. http://www.olcf.ornl.gov/wp-content/

uploads/2010/03/olcf3-benchmark-suite.tar.

gz.[7] O. Inc. Benchmarking lustre performance (lus-

tre i/o kit). http://wiki.lustre.org/manual/

LustreManual20_HTML/BenchmarkingTests.html.[8] Infiniband Trade Association. Infiniband Architecture

Specification Vol 1. Release 1.2, 2004.[9] Y. Kim, R. Gunasekaran, G. M. Shipman, D. Dillow,

Z. Zhang, and B. W. Settlemyer. Workload characteriza-tion of a leadership class storage. In Proceedings of the5th Petascale Data Storage Workshop Supercomputing’10 (PDSW’10) held in conjunction with SC’10, Novem-ber 2010.

[10] LSI Corporation. 7900 HPC Storage System.http://www.lsi.com/storage_home/high_

performance_computing/7900_hpc_storage_

system/index.html.[11] R. Miller, J. Hill, D. D. A., G. Raghul, G. M. Shipman,

and D. Maxwell. Monitoring tools for large scale sys-tems. In Proceedings of Cray User Group Conference(CUG 2010), 2010.

[12] Oak Ridge National Laboratory, National Center forComputational Sciences. Jaguar. http://www.nccs.

gov/jaguar/.[13] Sun Microsystems Inc. Luste Wiki. http://wiki.

lustre.org, 2009.[14] S. Sur, M. J. Koop, L. Chai, and D. K. Panda. Per-

formance analysis and evaluation of mellanox connectxinfiniband architecture with multi-core platforms. InHOTI ’07: Proceedings of the 15th Annual IEEE Sym-posium on High-Performance Interconnects, pages 125–134, Washington, DC, USA, 2007. IEEE Computer So-ciety.

[15] F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang,and I. Huang. Understanding lustre filesystem internals.Technical Report ORNL/TM-2009/117, Oak Ridge Na-tional Lab., National Center for Computational Sciences,2009.

12

Date post:	19-Mar-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

A Next-Generation Parallel File System Environment for the OLCF · 2012-05-13 · A Next-Generation...

Documents