© 2014 IBM Corporation
IBM GPFS 2014 / Elastic StorageTitle Software Defined Storage in action with GPFS v4.1Speaker Frank Kraemer
Frank KraemerIBM Systems Architectmailto:[email protected]
© 2014 IBM Corporation
Agenda:
• File Systems Market Overview• GPFS v4.1 News & Roadmap• ILM with TSM/HSM & LTFS‐EE• Network Attached Storage (cNFS)• GPFS Native Raid (GNR)
o GPFS Storage Server (Intel x86)o Elastic Storage Server (Power8)
• GPFS‐FPO (Hadoop/Mapreduce)• Summary & Roadmap
LEGO, the LEGO logo and the Minifigure are trademarks and/or copyrights of the LEGO Group.
© 2014 IBM Corporation
IBM GPFS vs. Competitors
Why choosing GPFS?1. Stability2. Features3. Scalability4. OS Platform support 5. Global Namespace6. References
Competitors (some)• Lustre (Intel, DDN, Cray, Xyratex,..)• StorNext FS, Quantum• Gluster (RedHat)
• Panasas (NAS)• EMC Isilion (NAS)• NetApp Ontap v8.x (NAS)• HDS HNAS/BlueArc (NAS)
Open source & research projects• Ceph (April 30th 2014 now RedHat)
• BeeGFS (ex-Fraunhofer FS)• dCache• XtreemFS
http://en.wikipedia.org/wiki/List_of_file_systemsInfos
Seagate
(Spinnaker Networks)
© 2014 IBM Corporation
GPFS 1998
GPFS: A Shared‐Disk File System for Large Computing Clusters
Frank Schmuck and Roger HaskinIBM Almaden Research CenterSan Jose, CA
© 2014 IBM Corporation
GPFS history and milestones
Tiger
Shark
1.31.21.1
lc lc lc
2.2
2.2
2.2
lc
lc
sphacmprpdlc
2.11.5
sphacmprpd
sphacmp(ESS)
1.4
sphacmp(SSA)
1.x
sp
1998 2001 2002 2003 2004 2006 20072000
2.2
2.2
2.2
2.3
2.3
2.3
3.2
3.2
3.2
lcAIX v6/7
pLinux
Linux
InteroperabilityDisaster Recovery (DR)
Remote mountcapabilities (WAN)
Information LifecycleManagement (ILM)
2010
Win2012R23.2
3.3
3.3
3.3
3.3
3.1
3.1
3.1
1993
SFS v1.0
2009
SFS v1.1
IBM SAN File System (SFS)
2008 2014
Windows 2008R2
3.5
3.5
3.5
3.5
20112005
GPFS AFM / Panache
GPFS Native RAID
GPFS‐SNC / Hadoop Octv4.1.0.3
Win7 x64
© 2014 IBM Corporation
Software Defined Storage for Dummieshttp://www‐01.ibm.com/common/ssi/cgi‐bin/ssialias?subtype=BK&infotype=PM&appname=STGE_DC_ZQ_USEN&htmlfid=DCM03004USEN&attachment=DCM03004USEN.PDF
This book examines data storage and management challenges and explains software‐defined storage, an innovative solution for high‐performance, cost‐effective storage using the IBM General Parallel File System (GPFS).
http://en.wikipedia.org/wiki/Software‐defined_storage
mailto:[email protected]
© 2014 IBM Corporation
GPFS = Software Defined Storage (SDS)
GPFS Storage Server Cluster
Cinder SwiftGPFS Hadoop
Connector
GPFS NFS
Single software defined storage solution across all these application types
Linear capacity & performance scale out
POSIX
Enterprise storage on standard hardware
Single Name Space
Technical Computing Big Data & Analytics Cloud
Block ObjectFile
© 2014 IBM Corporation
GPFS DEVELOPMENT TEAMSBackground Information
© 2014 IBM Corporation
GPFS Almaden Research, CA
Latitude: 37°12‘37.53‘N / Longitude: 121°48‘25.23‘W
© 2014 IBM Corporation
GPFS Lab Poughkeepsie, N.Y.
Latitude: 41°39‘8.35‘N / Longitude: 73°56‘5.20‘W
© 2014 IBM Corporation
GPFS Support & Lab Mainz, Germany
© 2014 IBM Corporation
GPFS CONCEPTSTutorial
© 2014 IBM Corporation
GPFS Architecture (Basis)
Storage Area Network (SAN),
Shared SAS, Twin Tailed, etc.
LUN = Logical Unit Number / NSD = Network Shared Disk
1
SAN LUN
GPFS NSD
„1:1“ Relation
© 2014 IBM Corporation
GPFS Architecture (Common)
SAN
LAN
LUN‘s
GPFS NSD Client
GPFS NSD Server
2
© 2014 IBM Corporation
GPFS Architecture (Typical)
Disk LUN‘s
GPFS NSD Clients
GPFS NSD Server
FC SAN
LAN / WAN / Infiniband (any Mix)
3
+ Twin‐Tailed Disks + Internal Disks
FPOHadoop
(GSS = GPFS Storage Server // FPO = File Placement Optimizer)
GSS GSSFPOHadoop
© 2014 IBM Corporation
LUN‘s
NSD Clients
NSD Server
(NSD = Network Shared Disk)
LANInfiniband
remote cluster
Remote Cluster Mount (synchronous)local cluster
4
© 2014 IBM Corporation
LUN‘s
NSD Clients
NSD Server
(NSD = Network Shared Disk)
WANInfiniband
remote cluster
GPFS Advanced File Management (async)local cluster
Caching (R/W)
5
© 2014 IBM Corporation
GPFS System Structureapplication
File system call
configuration manager file system manager metanode
OS kernel
OS vnode / vfs
GPFS kernel extension
GPFS inode
GPFS administration commands (mm...)
Multi‐threadedGPFS daemon
mmfsdNSD
GPFS portability layer (required for Linux only)
NetworkSharedDisk
© 2014 IBM Corporation
GPFS Metadata ServicesMulti‐threaded GPFS daemon
mmfsd
Configuration manager
drives recovery after nodefailure
1 per cluster, elected by thequorum nodes
1 per file system
1 per open file file metadata updatesMetanode(s)
File system manager(s)
selects file system managers
file system configuration
disk space allocation
token management
quota management
security services
© 2014 IBM Corporation
Consistency control: Locks and Tokens
Token Servers
Applications
GPFSTokensLocks
Foo.1
Foo.2
Foo.3
Foo.1
Blk.02
Blk.19
Blk.936
Blk.237
Data buffers
File structures
Block 237
Block 2
Local consistency Cached capability, global consistency
Client systems
Request / releaseRevoke
Multiple modesDistributed via hashRecoverable service
© 2014 IBM Corporation
GPFS Replicated Data and Metadata
No designated "mirror", no fixed placement function:flexible replication (e.g., replicate only metadata, or only important files)dynamic reconfiguration: data can migrate block‐by‐blockmm<cr|ch>fs interfaces for admin
Inode, indirect block, and/ordata blocks may be replicated
Each disk address:list of pointers to replicas
Each pointer:disk id + sector no.
© 2014 IBM Corporation
GPFS Failure Group (FG) conceptFailure Group: collection of disks that could become unavailable simultaneously, e.g.,
– Disks attached to the same storage controller– Disks served by the same NSD server
Used for two purposes:– Replication: replicas of the same block must be
on disks in two different failure groups– Striping: stripe across failure groups, then
across disks within failure group:D1, D3, D5, D7, D2, D4, D6, D8
Reason: common point of failure = common resource that requires load balancing
GPFS-FPO: “extended failure group” (conveys additional location information)
Example: r,n = rack, node within rackwith replication 3:
– second copy placed in a different rack– third copy: same rack, but different node
D1 D2 D3 D4 D5 D6 D7 D8
FG1 FG2 FG3 FG4
1,1 1,2 2,1 2,2
rack 1 rack 2
© 2014 IBM Corporation
GPFS v3.5 has fullIPv6 Support
• IPv6 (Internet Protocol version 6) is a version of the Internet Protocol (IP) intended to succeed IPv4• IPv6 was developed by the Internet Engineering Task Force (IETF) to deal with this long‐anticipated IPv4 address exhaustion, and is described in Internet standard document RFC 2460, published in December 1998.
• While IPv4 allows 32 bits for an IP address, and therefore has 2^32 (4 294 967 296) possible addresses, IPv6 uses 128‐bit addresses, for an address space of 2^128 addresses.
• IPv6 also implements additional features not present in IPv4. • Network security is also integrated into the design of the IPv6 architecture, including IPsec.
© 2014 IBM Corporation
GPFS VERSION 4.1What‘s new with GPFS
© 2014 IBM Corporation
GPFS v4.1April 22 2014
• IBM GPFS Concepts, Planning, and Installation Guide (GA76‐0441)• IBM GPFS Administration and Programming Reference (SA23‐1452)• IBM GPFS Advanced Administration and Programming Reference (SC23‐7032)• IBM GPFS Problem Determination Guide (GA76‐0443)• IBM GPFS Data Management API Guide (GA76‐0442)
http://www.ibm.com/common/ssi/cgi‐bin/ssialias?infotype=AN&subtype=CA&appname=gpateam&supplier=897&letternum=ENUS214‐079&pdf=yes
© 2014 IBM Corporation
GPFS v4.1 product structure
Server and Client for EachSocket Based Licensing• Simpler, no more PVUs
Express Edition• gpfs.base (no ilm, afm, cnfs) • gpfs.docs• gpfs.gpl• gpfs.msg• gpfs.gskit
Standard Edition• Add gpfs.ext
Advanced Edition• Add – gpfs.crypto
Platforms• zLinux• Ubuntu
Features Express Edition Standard Edition Advanced Edition
Basic GPFS functionality
ILM: Storage pools, Policy, mmbackup
Active File Management (AFM)
Clustered NFS (cNFS)
Encryption
(same as v3.5)
*NEW* *NEW*
© 2014 IBM Corporation
Encryption and NIST ComplianceNative encryption support for GPFS v4.1 filesystems
Addresses critical requirementsEncryption of data at restSecure Erase is mandatory today
User and directory blocks will be fully encrypted.Per-inode file encryption key (FEK), which would be wrapped withone or more master encryption keys (MEK).MEK management will be external to GPFS. (TKLM)GPFS v4.1 will be NIST SP 800-131A compliant.
© 2014 IBM Corporation
Encryption and NIST Compliance
• Native: encryption is built into the “Advanced” GPFS product
• Protects data from security breaches, unauthorized access, and being lost, stolen or improperly discarded
• Cryptographic erase for fast, simple and secure file deletion
• Complies with NIST SP 800-131A and is FIPS 140-2 certified
• Supports HIPAA, Sarbanes-Oxley, EU and national data privacy law compliance
© 2014 IBM Corporation
Native Encryption and Secure Erase
Encryption of data at rest• Files are encrypted before they are stored on disk
• Keys are never written to disk
• No data leakage in case disks are stolen or improperly decommissioned
Secure deletion • Ability to destroy arbitrarily large subsets of a file system
• No “digital shredding”, no overwriting: secure deletion is a cryptographic operation
© 2014 IBM Corporation
Reliability, Availability and Serviceability (RAS) #1
Automated deadlock detection, notification, and debug data collectionAutomated deadlock detectionAutomated deadlock data collectionAutomated deadlock breakup
Dump improvementDaemon survival under heavy loadsAbility to dump more data
Message LoggingSend message logs to system event logging facility
Directory enhancements to allow shrinkingMerging mostly empty blocksAllows larger directory block sizes
© 2014 IBM Corporation
Reliability, Availability and Serviceability (RAS) #2
User-defined Node classesmmcrnodeclass, mmchnodeclass, mmdelnodeclass, mmlsnodeclass
Quota file improvementsQuota management enablement without unmounting the file systemfsck() speed improvementsSupport for GPT NSD
Adds a standard disk partition table (GPT type) to NSDsDisk label support for Linux
New GPFS NSD v2 format provides the following benefits:Includes a partition table so that the disk is recognized as a GPFS deviceAdjusts data alignment to support disks with a 4 KB physical block sizeAdds backup copies of some key GPFS data structuresExpands some reserved areas to allow for future growth
© 2014 IBM Corporation
Performance & Health MonitoringNetwork Performance Monitoring
GPFS daemon caches statistics relating to RPCsA set of up to seven statistics cached per node
Channel wait timeSend time TCPSend time verbsReceive time TCPLatency TCPLatency verbsLatency mixed
GPFS RPC Latency Measurement.mmdiag –rpc
Ongoing enhancements in GPFS 4.1 TLsDisk Performance MonitoringMemory Utilization Monitoring
© 2014 IBM Corporation
Performance Improvements
Fine Grained Directory Locking (FGDL)Local Read Only Cache (LROC)
Overflow file data cache to local SSD storageDefined as NSD with “localCache” as usageConfigure it for data or metadata (inodes/dirs)Utilize SSD as an extension of the GPFS buffer pool
Save more memory for applications Automatic management of local storage
Write Data Logging (WDL)Takes advantage of NVRAMs in GPFS client nodes to reduce latency of small and synchronous writesScale write performance with addition of GPFS client nodes
GPFS Clients
GPFS Storage Server Cluster
GPFS LROC
© 2014 IBM Corporation
Backup/Restore Improvements New tool to restore from a fileset snapshot into the active file system.
Only copy the blocks that have been changed as well as the changed attributes since the restoring snapshot.
TSM Configuration Verification by mmbackupTSM B/A client must be installed and at the same version on all the nodes that will execute the mmbackup command.TSM B/A configuration are correct before executing the backup.
Automatic TSM tuning adjustments:
“The mmbackup command can be tuned to control the numbers of threads used on each node to scan the file system, perform inactive object expiration, and modified object backup. In addition, the sizes of lists of objects expired or backed up can be controlled or autonomic ally tuned to select these list sizes if they are not specified. List sizes are now independent for backup and expire tasks.”
© 2014 IBM Corporation
GPFS v4.1 on Windows via Cygwin
http://cygwin.com/
Cygwin is:A large collection of GNU and Open Source tools which provide functionality similar to a Linux distribution on Windows.A DLL (cygwin1.dll) which provides substantial POSIX API functionality.
GPFS:GPFS will use Cygwin for it‘s shell execution enviroment only.All GPFS programs (executables/binaries) will be native Windows binariesand will not have any linkage with Cygwin DLLs.Cygwin is needed as SUA has been completely removed by Microsoft in Windows Server 2012 R2. (see http://technet.microsoft.com/en-us/library/dn303411.aspx)
© 2014 IBM Corporation
New and changed commandsChanged with GPFS v4.1
mmaddcallbackmmafmctlmmafmlocalmmbackupmmchclustermmchconfigmmchfilesetmmchfsmmcrclustermmcrfilesetmmcrfsmmdiagmmlsfsmmlsmountmmmigratefsmmmountmmrestorefsmmsnapdirmmumount
New with GPFS v4.1mmafmconfigmmchnodeclassmmcrnodeclassmmdelnodeclassmmlsnodeclassmmsetquota
© 2014 IBM Corporation
GPFS MULTICLUSTERCloud File Systems via WAN (IP)
© 2014 IBM Corporation
GPFS the Cloud ‚backbone‘
Why?
Tie together multiple sets of data into a single namespace
Allow multiple application groups to share portions or all data
Secure, available and high performance data sharing
Support of public and private clouds
LAN
SANSAN
GPFS
LAN
SANSAN
GPFS
GPFS NSD Protocol on TCP/IP
Create an enterprise‐wideGlobal namespace
© 2014 IBM Corporation
Cluster A ‚Europe‘Cluster B ‚US‘
/gpfs1_clusterA
/gpfs2_clusterB
LAN / WANvia TCP/IP
Cluster C ‚Far East Asia‘
GPFS Multicluster(Cloud Mode)
© 2014 IBM Corporation
GPFS Multicluster ‐ Firewall
• bi‐ directional deamon communication
• data to filesystem always uses port 1191 (default)
• optional: mmchconfig tscTcpPort=PortNumber
11911191
© 2014 IBM Corporation
WIDE AREA DATA SERVICESGPFS WAN Cache with AFM / Panache
© 2014 IBM Corporation
GPFS WAN Cache Support (AFM)
Cache
http://www.almaden.ibm.com/storagesystems/projects/panache/http://static.usenix.org/event/fast10/tech/full_papers/eshel.pdf
© 2014 IBM Corporation
Motivation for GPFS AFM
Data sharing across geographically distributed sites is commonWhile the bandwidth is decent, latency is highNetwork is unreliable, subject to outages
Infrastructure needs to be scalable to move data across the WANMask latency and fluctuating performance of the network
Applications desire local performance for remote dataMove data closer to compute servers
Traditional protocols for remote file serving are chatty and unsuitableLarge files (VM images, virtual disks) are becoming predominantExisting caching systems are primitive
© 2014 IBM Corporation
Clients access:/global/data1/global/data2/global/data3/global/data4/global/data5/global/data6
Clients access:/global/data1/global/data2/global/data3/global/data4/global/data5/global/data6
Clients access:/global/data1/global/data2/global/data3/global/data4/global/data5/global/data6
Cache Filesets:/data1/data2
Local Filesets:/data3/data4
Cache Filesets:/data5/data6
File System: store1
Local Filesets:/data1/data2
Cache Filesets:/data3/data4
Cache Filesets:/data5/data6
File System: store2
Cache Filesets:/data1/data2
Cache Filesets:/data3/data4
Local Filesets:/data5/data6
File System: store3
See all data from any ClusterCache as much data as required or fetch data on demand
Home Cache
Global Namespace + AFM Cache
© 2014 IBM Corporation
Read Only (RO)– Cache can only read data, no data change allowed.
Local Update (LU)– Data is cached from home and changes are allowed like SW mode but changes are not pushed to home. – Once data is changed the relationship is broken i.e cache and home are no longer in sync for that file.
Single Writer (SW)– Only cache can write data.– Home can’t change.– Other peer caches have to be setup as read only caches.
Independent Writer (IW)– One or more filesets can be linked to the same HOME. Other peer caches
can point to the same HOME and can be set up as “iw“ as well.
Change of Modes– SW & RO mode caches can be changed to any other mode.– LU cache can’t be changed. (Too many complications/conflicts to deal with.)
AFM Cache Modes
© 2014 IBM Corporation
Home
appl
data
web
Cache
appl
data
web
Homedirectory tree
Cachedirectory tree
Inode: 100Inode attrs: < … >Remote state:<inode: 1024attrs: mtime, ctime, …. >
Inode 1024Inode attrs: < … >
Independent filesystemsSeparate inode spaceSW: Home FS must not be changedIW: HOME can be changedCache is a clustered FS
LOOKUPGETATTR
[root@c25m4n03 fs10]# mmlsattr -d -X -L file435234file name: file435234metadata replication: 1 max 2data replication: 1 max 2immutable: noappendOnly: noflags: storage pool name: systemfileset name: AFM_fs10snapshot name: creation time: Fri Mar 22 10:35:06 2013Windows attributes: ARCHIVE
gpfs.pcache.inode: 0x0000000000500003597E255F00000001gpfs.pcache.attr: 0x0000000000036126000000000[...]
AFM Technical Details
© 2014 IBM Corporation
GPFS ILMInformation Lifecycle Management (ILM)
© 2014 IBM Corporation
GPFS Storage Pools & PoliciesMotivation:
– Not all storage is the same:some is faster, cheaper, more reliable, …
– Not all data are the same:some are more valuable, important, popular, …
Storage Pool: A named collection of disks with similar attributes intended to hold similar data
– System pool: one per file system;holds all metadata
– Data pools: zero or more: only hold data– External pool: off-line storage (e.g., tape)
for rarely accessed data
Policy: A set of user-specified rules that match data to the appropriate pool
– SQL-like syntax for selecting files based on file attributes, such as:
• name or name pattern (e.g., *.jpg)• owner, file size, time stamps• extended attributes
SSD
10k rpm SAS
7200 rpm SATA
“gold”
“silver”
“bronze”
© 2014 IBM Corporation
Actionable intelligence for File Storage Tiering via GPFS
+ Another 27 Misc Attributes & Custom Extended Attributes (XATTR)
GPFS knows• File name• File type• I/O Size• Type of storage technology• Latency of storage• Locality of storage• Time of last access• Block size• Time of last change• Clone attributes• Time of file creation
• File Tree location• File heat• File size• Filesets• Generation of file’s reuse• Group owning file• User owning file• Space efficiency of file• Access Permissions• Time of last metadata change
Block Storage knows• I/O Size• Type of storage technology• Compressible data set• Latency of storage• Locality of storage
© 2014 IBM Corporation
GPFS Pools
When creating a file system or adding disks:specify the name of the pool that each disk belongs to. → Pool = collection of all disks with the same pool name
Pools can have attributes specified via “stanzas”, e.g., allocation map layout, block size
Separate allocation map for each pool→ Efficient to find space in a particular pool
Block size:– All data pools must have the same block size
(allows migrating files one data block at a time).– System pool may have different block size,
but only if used for metadata only
Pool assignment recorded in the inode of each file.→ A file can only “belong” to a single pool→ Writes fail if pool is full (ENOSPC)
system
data1
data2
© 2014 IBM Corporation
GPFS Policies
Placement policy:– Evaluated at file creation time– Determines initial file placement and replication
Migration policy:– Evaluated periodically or on-demand– Can move data between pools, changes replication, delete data,
or run arbitrary user commands
Policy engine (mmapplypolicy):– Fast, parallel directory traversal combined with inode scan– Runs outside the daemon, but makes use of GPFS infrastructure and APIs
(extended readdir, inode scan)– Can be used as powerful framework for building parallel file system utilities, e.g.
• Fast find/grep• Remote replication
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r5.gpfs100.doc%2Fbl1adm_mmapplypolicy.htm
© 2014 IBM Corporation
GPFS Filesets & Fileset Snapshots
root
fset1
fset3
fset2
Fileset: A partition of the file system name space (sub-directory tree)
– Allows administrative operations at finer granularity than entire file system, e.g.,
disk space limits, user/group quota, snapshots, caching, ...
– Can be used to refer to a collection of files in policy rules
Independent Fileset: A fileset with a reserved set of inode block ranges (“inode space”)
– Allows per-fileset inode scan– Enables fileset snapshots (inode copy-on-write
operates on inode blocks)– Separate inode limit and inode file expansion for
each inode space→ Active inode file may become sparse
© 2014 IBM Corporation
GPFS Fileset Snapshotsro
otfs
et1
fset
2fs
et1
root
activeFS
fset2snapshot
fset1snapshot
globalsnapshot
copy-on-write
ditto resolution
© 2014 IBM Corporation
GPFS BACKUP & RESTOREGPFS Backup, Restore and HSM via TSM
© 2014 IBM Corporation
Backup/Restore via Tivoli StorageManager
Copy Pool #2
Copy Pool #1
GPFS can use Multiple TSM Servers in parallelTSM B/A Client for GPFS runs on each NodeBackup & Restore is done in parallelLAN Free mode is possibleGPFS Policy Engine is used; no filetree walk needed
3. Storage pool backup
TSMSrv #1
2. Migration
1. Backup4. Restore
/gpfs01 ‐ FS
TSMSrv #2
TSMSrv #N
Scale
TSM Disk Storage Pool(s)
GPFS File System
© 2014 IBM Corporation
What is ‘mmbackup’ ?
© 2014 IBM Corporation
GPFS HSM via DMAPI
file
file
...
HSM Object ID / Handle
DMAPI
stub
Migration
© 2014 IBM Corporation
GPFS Hierarchical StorageManagement
stubObject ID (DMAPI handle)
TSM Server
filemigrated
Object ID (DMAPI handle)
filepremigrated
stubObject ID (new DMAPI handle)
afterrestore
migstate=yes Move reestablishing ofthe link to the restore path
file
file (# of versions)normal file
file
backup
restore
migraterecall
migraterecall
HSM Client
© 2014 IBM Corporation
GPFS Hierarchical StorageManagement
© 2014 IBM Corporation
GPFS SOBAR Backup Procedure
PreparationPre/Migrate all files
Information CollectionCreate file system configuration backup fileCreate file system snapshot & file system image
TSM BackupBackup file system configuration & file system image to TSM
Scale Out Backup and Restore (SOBAR) is a specialized mechanism for data protection against disaster only for GPFS file systems that are managed by Tivoli Storage Manager (TSM) Hierarchical Storage Management (HSM).
© 2014 IBM Corporation
GPFS SOBAR Restore ProcedureTSM RestoreRestore file system configuration & file system image
Target FS PreparationExtract & apply file system configurationCreate NSD’s and file system
Extract File System ImageMount the file system Recreate file system image
Start ProductionStart HSM daemons & remount the file systemAdd HSM management and start recall
© 2014 IBM Corporation
GPFS ILM VIA LTFS‐EELinear Tape File System Enterprise Edition
© 2014 IBM Corporation
GPFS ILM with LTFS‐EE• LTFS Enterprise Edition integrates LTFS with GPFS
– LTFS represents external tape pool to GPFS– Files can be migrated using on GPFS policies or LTFS EE commands– Similar implementation as with TSM HSM
• LTFS EE can be configured on multiple nodes– Multiple instances of LTFS EE share the same tape library
GPFS NodesGPFS file system
LTFS LE+
User file system
Data
IBM TS3500Tape Library
LTFS EE SAN
Fibre Channel
© 2014 IBM Corporation
GPFS and LTFS EE integration
User accesses file system on all GPFS nodes
User file system is staging area for subsequent migration
HSM integrates with GPFS user file system and MMM to manage migration and recall
MMM manages workload over all LTFS EE instances
LTFS LE+ manages tape access via local tape drives
Metadata file system stores shared LTFS tape index
© 2014 IBM Corporation
GPFS HSM and LTFS EE• HSM client integrates with the DMAPI to intercept file access• HSM client calls migration driver / MMM to perform migration
–Migration can be triggered manually or via policies–Migration moves file to LTFS and leaves stub– Stub includes reference to directory on LTFS tape–MMM performs load balancing
• When stub is accessed HSM client calls MMM –MMM identifies free resources and performs recall from LTFS tape–After entire file is on disk user access is granted
User file system DMAPI HSMclient
Disk
File accessMigration
Recall
MMM
LTFS LE+ Tape LibraryMigration
driver
Other LTFS node
© 2014 IBM Corporation
LTFS‐EE Tape Import Feature• Import adds the specified tape to the LTFS Enterprise Edition system
– Adds stub files in GPFS file system, imported files are in migrated state– No file data movement, actual file data remains on tape– File data can still be accessed (recalled) via stub
• LTFS SDE and LE tapes can also be imported– Are first converted to LTFS EE tapes
• LTFS EE provides command for tape import: ltfsee import– Import can be done to specific directory in GPFS file system– Options rename, overwrite or ignore can be used to manage conflicts with existing files
© 2014 IBM Corporation
LTFS‐EE Tape Export Feature• Export removes tapes from LTFS EE system for vaulting or data exchange
– Removes tapes from pool – Exported tapes are not longer target for migrations or recalls– Files (stubs) migrated to exported tape can be deleted or kept in GPFS– Export message can be added to file stubs (64 bytes)
• Export with offline option keeps file stubs and GPFS – Files remain visible in the GPFS namespace but are no longer accessible
• LTFS EE provides command for tape export: ltfsee export– To move tape to I/O station use ltfsee tape move ieslot
© 2014 IBM Corporation
LTFS‐EE ReferencesLTFS EE InfoCenter: http://pic.dhe.ibm.com/infocenter/ltfsee/cust/index.jsp
GPFS InfoCenter: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp
LTFS EE Redbooks: http://www.redbooks.ibm.com/redpieces/abstracts/sg248143.html
LTFS EE Installation Demo:http://www.youtube.com/watch?v=bF5tHAjp5xA&feature=youtu.be
© 2014 IBM Corporation
NETWORK ATTACHED STORAGE (NAS)CIFS/SMB & NFS via GPFS
© 2014 IBM Corporation
GPFS Clustered NFS (cNFS)
70
Feature of GPFS on LinuxShare files with non‐GPFS clients using NFS protocolNFS daemon (nfsd) of theLinux OS is used as normalAll nodes can share the same dataIf a NFS Server Node fails client connections are moved to another serverNFS Server Node(s) need GPFS Server LicenseNFS Clients need no GPFS License
Local area network (LAN)NFS Clients
NFS server(s) GPFS NSD
GPFS File System
AIX, Linux, OSX, Windows
(Linux only !)
© 2014 IBM Corporation
GPFS Clustered NFS (cNFS) #2
# Enable cNFS on GPFS Cluster1> mmchconfig cnfsSharedRoot=<Dir_Path_Name>
# Add each node with the correct IP interface2> mmchnode ‐N <node_name> ‐‐cnfs‐enable –cnfs‐interface=<nfs_ip>
# Check Cluster Status3> mmlscluster –cnfs
# Done !
http://www.redbooks.ibm.com/redpapers/pdfs/redp4400.pdf
© 2014 IBM Corporation
User Space NFS v4
“NFS‐Ganesha”NFS‐GANESHA is a NFS server running in User Space. It is available under the LGPL license.
It has been designed to meet two goals:1. providing very large metadata and data caches (up to millions of records)2. providing NFS exports to various files systems and namespaces (a set of data organized
as trees, with a structure similar to a files system)
NFS‐GANESHA uses dedicated backend modules called FSAL (which stand for File System Abstraction Layer) that provided the product with a unique API (used internally) to access the underlying namespace. The FSAL module is basically the "glue" between the namespace and the other part of NFS‐GANESHA.
https://github.com/nfs‐ganesha/nfs‐ganesha/wikihttps://github.com/nfs‐ganesha/nfs‐ganesha/
© 2014 IBM Corporation
GANESHA NFS
File System Abstraction Layer = FSAL_GPFS
GANESHA, a multi‐usage with large cache NFSv4 server(Part of SONAS / V7000U v1.5)
© 2014 IBM Corporation
SAMBA – Does it work with GPFS ?
data
data
GPFS Cluster
data
SMB / CIFS
SMB / CIFS
SMB / CIFS
SMB / CIFS
SMB/CIFS Clients
CTDB runs here
Many customers use SAMBA & CTDB (the clustered version of SAMBA) to share GPFS data with SMB/CIFS Clients.
Clustered Trivial Database (CTDB)
© 2014 IBM Corporation
Samba/CTDB/GPFS Update
http://sambaxp.org/past‐conferences/sambaxp‐2013/archive.htmlhttp://sambaxp.org/program/schedule.html
Reminder: SMB/CIFS support via SAMBA or other software, not provided or supported by IBM with GPFS – You’re on your own!
Find more technical details:
© 2014 IBM Corporation
GPFS NATIVE RAID (GNR)GPFS Perseus – Declustered RAID
© 2014 IBM Corporation
GPFS Native Raid (GNR)
Features• Auto rebalancing• Only 2% rebuild performance hit• Reed Solomon erasure code, “8 data +3 parity”• ~105 year MTDDL for 100‐PB file system• End‐to‐end, disk‐to‐GPFS‐client data checksums
Software RAID on the I/O Servers:
SAS attached JBOD’s Special JBOD storage drawer for very dense drive packingSolid‐state drives (SSDs) for metadata storage
SAS
vDISK
Local area network (LAN)
NSD servers
SAS
vDISK
JBODs
© 2014 IBM Corporation
GNR is a software implementation of storage RAID technologies
GPFS„Classic“ GPFS Native Raid
© 2014 IBM Corporation
GNR Fault Tolerance2 or 3‐fault tolerant RAID– 8 data strips + 2 or 3 parity strips– 3 or 4 way replication
When one disk is down (most common case)
– Rebuild slowly with minimal impact to client workload
When three disks are down (rare case):– Fraction of stripes that have three failures ~ 1%– Quickly get back to non‐critical (2 failure) state vs.
rebuilding all stripes for conventional RAID
© 2014 IBM Corporation
GPFS GNR v2.5
Supported Server Hardware:
GPFS Storage Server V2.5, consisting of two of either of these IBM Power SystemS822L servers (type 5146):
• 128 GB memory (models 21S and 22S)• 256 GB memory (models 24S and 26S)
GPFS Native RAID for GPFS Storage Server is also supported with the LenovoIntelligent Cluster and current GSS Lenovo x86 solutions.
*NEW* Oct 6th 2014
© 2014 IBM Corporation
GPFS STORAGE SERVER (GSS)Declustered Software RAID Building Block
© 2014 IBM Corporation
GPFS Storage Server (GSS)
Benefits of GSS:3 years maintenance and supportImproved storage affordabilityDelivers data integrity, end‐to‐endFaster rebuild and recovery timesReduces rebuild overhead by 3.5x
FeaturesDe‐clustered RAID (8+2p, 8+3p)2‐ and 3‐fault‐tolerant erasure codesEnd‐to‐end checksumProtection against lost writesOff‐the‐shelf JBODsStandardized in‐band SES managementSSD Acceleration Built‐in
Local Area Network (LAN)
GPFS
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gpfs.doc/doc_updates/bl1du13a.pdf
© 2014 IBM Corporation
GSS v2.0 ‚Runs‘ GPFS v4.1Release “R2.0”
GUIConfigurationPerformanceMonitoring
Hardware ChangesNew servers and cardsSmaller TraysSSD, SAS and NL/SAS
Software EnhancementsEnclosure protection
GSS#2
GSS#1
© 2014 IBM CorporationGPFS Native RAID @ x86
NSD
Server 02 (x3650‐M4)
with three LSI 9201‐16e (PCIe gen2 x8)N
SD Server 01
(x36
50‐M
4)with
three LSI 920
1‐16
e (PCIe gen2
x8)
Building Block on 6x IBM JBOD – Recovery Group Layout6x Disk enclosures JBOD01-06 (6x 60 disk slots)
RG01
RG01_DA2(58 disks)
RG01_DA1(58 disks)
5 29
RG02
RG02_DA2(58 disks)
RG02_DA1(58 disks)
RG01_DA3(58 disks)
RG02_DA3(58 disks)
RG02
_LOG (3
WayRe
plication @ 3 HDD)
RG01
_LOG (3
WayRe
plication @ 3 HDD)
p
SSD
SSD
SSD
p
p
SSD
SSD
SSD
p
p
p
432
5 29432
5 29432
5 29432
5 29432
5 29432
33 59323130
33 59323130
33 59323130
33 59323130
33 59323130
33 59323130
© 2014 IBM Corporation
GPFS GSS GUI #1Top level Navigation
• Home• Monitoring• Files• Volumes • Copy Services• Access• Configuration
(Preview Information)
© 2014 IBM Corporation
GPFS GSS GUI #2
(Preview Information)
© 2014 IBM Corporation
GPFS GSS GUI #3
(Preview Information)
© 2014 IBM Corporation
IBM System x GPFS Storage Server (GSS) 2.0Introducing four *new* models for entry-level high-performance storage server
Model 21s24 SSD
Model 22s48 SAS or SSD Drives
Model 24s96 SAS or SSD Drives
Model 26s144 SAS Drives
What’s inside the new models?
Server: IBM System x 3650 M4Storage: 2U JBOD (24 slots)
• 24 SSDs (Model 21s)• 1.2 TB SAS Drive plus, choice of
2 x 200 GB or 2 x 800 GB SSDs (Model 22s) Networking: 10 / 40 Gb Enet and/or FDR INFBSoftware: GPFS 4.1
Balanced system – high capacity and performanceNear-linear scalabilityLess hardware - more reliable, lower cost! Pre-integrated, shipped with one part number, 3 yr supportFast disk rebuilds
New New
Announce: June 10, 2014 , Ship Support: June 12 , GA: June 13
Owner: Scott Seal, [email protected] Kit: SSI , PartnerWorldI
© 2014 IBM Corporation
Non‐intrusive disk diagnosticsGPFS/GNR Disk Hospital
Background determination of problemsWhile a disk is in hospital, GNR non‐intrusively and immediately returns data to the client utilizing the error correction code.For writes, GNR non‐intrusively marks write data and reconstructs it later in the background after problem determination is complete.
Advanced fault determinationStatistical reliability and SMART monitoringNeighbor check, drive power cyclingMedia error detection and correctionSupports concurrent disk firmware updates
© 2014 IBM Corporation
GPFS GSS
http://www.ibm.com/systems/technicalcomputing/platformcomputing/products/gpfs/
© 2014 IBM Corporation
20x IBM GSS‐24 @ FZ Jülich
http://www.fz‐juelich.de/ias/jsc/EN/Expertise/Datamanagement/OnlineStorage/JUST/Configuration/Configuration_node.html
(4640 Disks + 120 SSDs)
© 2014 IBM Corporation
ELASTIC STORAGE SERVER (ESS)Declustered Software RAID Building Block
© 2014 IBM Corporation
GPFSElastic Storage Server (ESS)
*NEW* Oct 6th 2014
Power8 Server HardwareRed Hat Enterprise Linux 7 for PowerGPFS Standard Edition v4.1GPFS Native RAID v4.1IBM Support for xCAT 2
5146‐GL25146‐GL45146‐GL65146‐GS15146‐GS25146‐GS45146‐GS6
DCS3700
EXP24S
© 2014 IBM Corporation
IDEA ‐ GPFSElastic Storage Server (ESS)
*NEW* Oct 6th 2014
IBM Data Engine for Analytics is a customized infrastructure solution with integrated software that is optimized for big data and analytics workloads.
© 2014 IBM Corporation
TSM BACKUP TO DISK TSM Backup to Disk Storage Pool on GPFS GSS
© 2014 IBM Corporation
Whitepaper
2 x TSM ServerIBM x3650‐M4Red Hat Enterprise Linux Server release 6.5IBM Tivoli Storage Manager v7.1
1 x IBM System x GPFS Storage Server ‐ GSS266 x 4U‐60 with 58 x 2 TB NL‐SAS disks drawerIn total 348 disks
1 x Mellanox 32 Port InfiniBand FDR switchEach TSM server is connected with a 56 GBit/s link to the GSS system
„More TSM bang for the buckthan EMC Isilon…“
© 2014 IBM Corporation
PoC Hardware Setup
© 2014 IBM Corporation
GSS als Backend Disk Storage für TSM
Peak Performance for a single TSM server is 5,4 GB/s sequential write with 10 or 50 sessions in parallel Peak Performance for both TSM servers is 4,5 GB/s per server or 9 GB/s sequential write with 10 sessions per server (or 3,8 GB/s per server with 50 sessions per server)Performance for a single sequential write session starts at 100 MB/s with 100KB file size and reach 2,5 GB/s with 1GB file sizeMultiple sequential write session performance starts at 12 MB/s per session (50 sessions in parallel = 600 MB/s) with 100KB file size and reach 108 MB/s per session (50 sessions in parallel = 5,4 GB/s) with 1 GB file size
Environment: 1 x GSS26 connected via dedicated 56 Gbit Infiniband lines to 2 x TSM v7.1
© 2014 IBM Corporation
GPFS‐FPO Shared Nothing Cluster and Hadoop
© 2014 IBM Corporation
GPFS‐FPO for Hadoop, BigData & HANA
PERFORMANCE & FLEXIBILITY
IMPROVED DATA SHARING FORBETTER COLLABORATION
BUSINESS CONTINUITY AND DATA INTEGRITY
MORE EFFECTIVE MANAGEMENT OF DATA OVER ITS LIFECYCLE
AVOID EXPENSIVE DATA SILOS WITH MORE VERSATILE STORAGE
Enterprise features
© 2014 IBM Corporation
What is HDFS ?The Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file‐system written in Java for the Hadoop framework.
File access can be achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command‐line interface, or browsed through the HDFS‐UI webapp over HTTP.
http://en.wikipedia.org/wiki/Apache_Hadoop
Rangers know: "Lot's of yellow Elephants can Cause Extensive Damage to your IT!”
© 2014 IBM Corporation
Research Paper
“In this paper, we revisit the debate on the need of a new non‐POSIX storage stack for cloud analytics and argue, based on an initial evaluation, that it can be built on traditional POSIX‐based cluster filesystems.“
© 2014 IBM Corporation
Cluster storage configuration for Hadoop and GPFS Storage
Example with 4 datanodes (3 internal disks per datanodes)• 2 pools: a system pool for metadata (1 disk) and 2 FPO datapool for data (2 disks for data)• Several filesets in data pool to manage block replication factors
nsd1 nsd5 nsd6 nsd2 nsd7 nsd8 nsd4 nsd11 nsd12nsd3 nsd9 nsd10 NSD
GPFS-FPODatapool poolwith 3 filesets
Datanodes
with JBOB
No Namenode any moreMetadata are distributed accross the datanodes in a dedicated Storage pool, called system pool
- using physical disks or disk partitions
/dev/sda /dev/sdb /dev/sdc /dev/sda /dev/sdb /dev/sdc /dev/sda /dev/sdb /dev/sdc/dev/sda /dev/sdb /dev/sdcLinux disks
System pool for metadata
nsd1 nsd2 nsd4nsd3
nsd5 nsd6 nsd7 nsd8 nsd11 nsd12nsd9 nsd10
root fileset, replication factor 3 /gpfs-fpo
mrl fileset for map local dir ,replication factor 3 /gpfs-fpo/hadoop/mapred/loal/node1-4
tmp_set fileset for hadoop framework ,replication factor 1 /gpfs-fpo/tmp/hadoop4
© 2014 IBM Corporation
GPFS‐FPO for Hadoop/BigInsightshttp://www.ibm.com/systems/technicalcomputing/platformcomputing/products/gpfs/
© 2014 IBM Corporation
GPFS‐FPO new capabilities for BigInsightsFile system reliability
• GPFS-FPO avoids the need for a central namenode, a common failure point in HDFS
• Avoid long recovery times in the event of name node failure
• Pipelined replication for efficient storage of block replicas in GPFS-FPO environment
• Boost performance for meta-data intensive applications where the name-node can emerge as a bottleneck.
HDFSNamenode
SecondaryNamenode
Metadata is striped across GPFS FPO nodes, providing better reliability and avoiding the need for primary and secondary name nodes
IBM BigInsights cluster with GPFS-FPO
© 2014 IBM Corporation
GPFS‐FPO new capabilities for BigInsightsFlexible storage configuration
• GPFS-FPO avoids the need for a central namenode with distributed metadata, a common failure point in HDFS environments
• Avoids long recovery times in the event that the namenode fails and metadata needs to be recovered from the secondary name node
• Pipelined replication for efficient storage of block replicas in GPFS FPO environment
GPFS Server GPFS Server
Switched Fabric
Shared nothing storage – GPFS‐FPO
Shared storage ‐ GPFS
IBM BigInsights cluster with GPFS FPO
© 2014 IBM Corporation
GPFS‐FPO advanced storage for MapReduce
Hadoop HDFS IBM GPFS‐FPO Advantages
HDFS NameNode is a single point of failure
Large block‐sizes – poor support for small files
Non‐POSIX file system – obscure commands
Difficulty to ingest data – special tools required
Single‐purpose, Hadoop MapReduce only
Not recommended for critical data
No single point of failure, distributed metadata
Variable block sizes – suited to multiple types of data and data access patterns
POSIX file system – easy to use and manage
Policy based data ingest
Versatile, Multi‐purpose
Enterprise Class advanced storage features
GPFS‐FPO (File Placement Optimzier)
© 2014 IBM Corporation
SUMMARY & ROADMAP
© 2014 IBM Corporation
GPFS Elastic Storage Vision
© 2014 IBM Corporation
GPFS Wiki, FAQ & Forums
• GPFS Home Pagehttp://www.ibm.com/systems/gpfs
• GPFS Wikihttp://www.ibm.com/developerworks/wikis/display/hpccentral/General+Parallel+File+System+(GPFS)
• GPFS FAQhttp://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf
• GPFS Forum andMailing listhttp://www‐128.ibm.com/developerworks/forums/dw_forum.jsp?forum=479&cat=13http://lists.sdsc.edu/mailman/listinfo.cgi/gpfs‐general
© 2014 IBM Corporation