IBM GPFS 2014 Elastic Storage - INFN-CNAFvladimir/export/SDS-in... · Performance & Health...

© 2014 IBM Corporation

IBM GPFS 2014 / Elastic StorageTitle Software Defined Storage in action with GPFS v4.1Speaker Frank Kraemer

Frank KraemerIBM Systems Architectmailto:[email protected]


Agenda:

• File Systems Market Overview• GPFS v4.1 News & Roadmap• ILM with TSM/HSM & LTFS‐EE• Network Attached Storage (cNFS)• GPFS Native Raid (GNR)

o GPFS Storage Server (Intel x86)o Elastic Storage Server (Power8)

• GPFS‐FPO (Hadoop/Mapreduce)• Summary & Roadmap

LEGO, the LEGO logo and the Minifigure are trademarks and/or copyrights of the LEGO Group.


IBM GPFS vs. Competitors

Why choosing GPFS?1. Stability2. Features3. Scalability4. OS Platform support 5. Global Namespace6. References

Competitors (some)• Lustre (Intel, DDN, Cray, Xyratex,..)• StorNext FS, Quantum• Gluster (RedHat)

• Panasas (NAS)• EMC Isilion (NAS)• NetApp Ontap v8.x (NAS)• HDS HNAS/BlueArc (NAS)

Open source & research projects• Ceph (April 30th 2014 now RedHat)

• BeeGFS (ex-Fraunhofer FS)• dCache• XtreemFS

http://en.wikipedia.org/wiki/List_of_file_systemsInfos

Seagate

(Spinnaker Networks)


GPFS 1998

GPFS: A Shared‐Disk File System for Large Computing Clusters

Frank Schmuck and Roger HaskinIBM Almaden Research CenterSan Jose, CA


GPFS history and milestones

Tiger

Shark

1.31.21.1

lc lc lc

2.2

2.2

2.2

lc

lc

sphacmprpdlc

2.11.5

sphacmprpd

sphacmp(ESS)

1.4

sphacmp(SSA)

1.x

sp

1998 2001 2002 2003 2004 2006 20072000

2.2

2.2

2.2

2.3

2.3

2.3

3.2

3.2

3.2

lcAIX v6/7

pLinux

Linux

InteroperabilityDisaster Recovery (DR)

Remote mountcapabilities (WAN)

Information LifecycleManagement (ILM)

2010

Win2012R23.2

3.3

3.3

3.3

3.3

3.1

3.1

3.1

1993

SFS v1.0

2009

SFS v1.1

IBM SAN File System (SFS)

2008 2014

Windows 2008R2

3.5

3.5

3.5

3.5

20112005

GPFS AFM / Panache

GPFS Native RAID

GPFS‐SNC / Hadoop Octv4.1.0.3

Win7 x64


Software Defined Storage for Dummieshttp://www‐01.ibm.com/common/ssi/cgi‐bin/ssialias?subtype=BK&infotype=PM&appname=STGE_DC_ZQ_USEN&htmlfid=DCM03004USEN&attachment=DCM03004USEN.PDF

This book examines data storage and management challenges and explains software‐defined storage, an innovative solution for high‐performance, cost‐effective storage using the IBM General Parallel File System (GPFS).

http://en.wikipedia.org/wiki/Software‐defined_storage

mailto:[email protected]


GPFS = Software Defined Storage (SDS)

GPFS Storage Server Cluster

Cinder SwiftGPFS Hadoop

Connector

GPFS NFS

Single software defined storage solution across all these application types

Linear capacity & performance scale out

POSIX

Enterprise storage on standard hardware

Single Name Space

Technical Computing Big Data & Analytics Cloud

Block ObjectFile


GPFS DEVELOPMENT TEAMSBackground Information


GPFS Almaden Research, CA

Latitude: 37°12‘37.53‘N / Longitude: 121°48‘25.23‘W


GPFS Lab Poughkeepsie, N.Y.

Latitude: 41°39‘8.35‘N / Longitude: 73°56‘5.20‘W


GPFS Support & Lab Mainz, Germany


GPFS CONCEPTSTutorial


GPFS Architecture (Basis)

Storage Area Network (SAN),

Shared SAS, Twin Tailed, etc.

LUN = Logical Unit Number / NSD = Network Shared Disk

1

SAN LUN

GPFS NSD

„1:1“ Relation


GPFS Architecture (Common)

SAN

LAN

LUN‘s

GPFS NSD Client

GPFS NSD Server

2


GPFS Architecture (Typical)

Disk LUN‘s

GPFS NSD Clients

GPFS NSD Server

FC SAN

LAN / WAN / Infiniband (any Mix)

3

+ Twin‐Tailed Disks + Internal Disks

FPOHadoop

(GSS = GPFS Storage Server // FPO = File Placement Optimizer)

GSS GSSFPOHadoop


LUN‘s

NSD Clients

NSD Server

(NSD = Network Shared Disk)

LANInfiniband

remote cluster

Remote Cluster Mount (synchronous)local cluster

4


LUN‘s

NSD Clients

NSD Server

(NSD = Network Shared Disk)

WANInfiniband

remote cluster

GPFS Advanced File Management (async)local cluster

Caching (R/W)

5


GPFS System Structureapplication

File system call

configuration manager file system manager metanode

OS kernel

OS vnode / vfs

GPFS kernel extension

GPFS inode

GPFS administration commands (mm...)

Multi‐threadedGPFS daemon

mmfsdNSD

GPFS portability layer (required for Linux only)

NetworkSharedDisk


GPFS Metadata ServicesMulti‐threaded GPFS daemon

mmfsd

Configuration manager

drives recovery after nodefailure

1 per cluster, elected by thequorum nodes

1 per file system

1 per open file file metadata updatesMetanode(s)

File system manager(s)

selects file system managers

file system configuration

disk space allocation

token management

quota management

security services


Consistency control: Locks and Tokens

Token Servers

Applications

GPFSTokensLocks

Foo.1

Foo.2

Foo.3

Foo.1

Blk.02

Blk.19

Blk.936

Blk.237

Data buffers

File structures

Block 237

Block 2

Local consistency Cached capability, global consistency

Client systems

Request / releaseRevoke

Multiple modesDistributed via hashRecoverable service


GPFS Replicated Data and Metadata

No designated "mirror", no fixed placement function:flexible replication (e.g., replicate only metadata, or only important files)dynamic reconfiguration: data can migrate block‐by‐blockmm<cr|ch>fs interfaces for admin

Inode, indirect block, and/ordata blocks may be replicated

Each disk address:list of pointers to replicas

Each pointer:disk id + sector no.


GPFS Failure Group (FG) conceptFailure Group: collection of disks that could become unavailable simultaneously, e.g.,

– Disks attached to the same storage controller– Disks served by the same NSD server

Used for two purposes:– Replication: replicas of the same block must be

on disks in two different failure groups– Striping: stripe across failure groups, then

across disks within failure group:D1, D3, D5, D7, D2, D4, D6, D8

Reason: common point of failure = common resource that requires load balancing

GPFS-FPO: “extended failure group” (conveys additional location information)

Example: r,n = rack, node within rackwith replication 3:

– second copy placed in a different rack– third copy: same rack, but different node

D1 D2 D3 D4 D5 D6 D7 D8

FG1 FG2 FG3 FG4

1,1 1,2 2,1 2,2

rack 1 rack 2


GPFS v3.5 has fullIPv6 Support

• IPv6 (Internet Protocol version 6) is a version of the Internet Protocol (IP) intended to succeed IPv4• IPv6 was developed by the Internet Engineering Task Force (IETF) to deal with this long‐anticipated IPv4 address exhaustion, and is described in Internet standard document RFC 2460, published in December 1998.

• While IPv4 allows 32 bits for an IP address, and therefore has 2^32 (4 294 967 296) possible addresses, IPv6 uses 128‐bit addresses, for an address space of 2^128 addresses.

• IPv6 also implements additional features not present in IPv4. • Network security is also integrated into the design of the IPv6 architecture, including IPsec.


GPFS VERSION 4.1What‘s new with GPFS


GPFS v4.1April 22 2014

• IBM GPFS Concepts, Planning, and Installation Guide (GA76‐0441)• IBM GPFS Administration and Programming Reference (SA23‐1452)• IBM GPFS Advanced Administration and Programming Reference (SC23‐7032)• IBM GPFS Problem Determination Guide (GA76‐0443)• IBM GPFS Data Management API Guide (GA76‐0442)

http://www.ibm.com/common/ssi/cgi‐bin/ssialias?infotype=AN&subtype=CA&appname=gpateam&supplier=897&letternum=ENUS214‐079&pdf=yes


GPFS v4.1 product structure

Server and Client for EachSocket Based Licensing• Simpler, no more PVUs

Express Edition• gpfs.base (no ilm, afm, cnfs) • gpfs.docs• gpfs.gpl• gpfs.msg• gpfs.gskit

Standard Edition• Add gpfs.ext

Advanced Edition• Add – gpfs.crypto

Platforms• zLinux• Ubuntu

Features Express Edition Standard Edition Advanced Edition

Basic GPFS functionality

ILM: Storage pools, Policy, mmbackup

Active File Management (AFM)

Clustered NFS (cNFS)

Encryption

(same as v3.5)

*NEW* *NEW*


Encryption and NIST ComplianceNative encryption support for GPFS v4.1 filesystems

Addresses critical requirementsEncryption of data at restSecure Erase is mandatory today

User and directory blocks will be fully encrypted.Per-inode file encryption key (FEK), which would be wrapped withone or more master encryption keys (MEK).MEK management will be external to GPFS. (TKLM)GPFS v4.1 will be NIST SP 800-131A compliant.


Encryption and NIST Compliance

• Native: encryption is built into the “Advanced” GPFS product

• Protects data from security breaches, unauthorized access, and being lost, stolen or improperly discarded

• Cryptographic erase for fast, simple and secure file deletion

• Complies with NIST SP 800-131A and is FIPS 140-2 certified

• Supports HIPAA, Sarbanes-Oxley, EU and national data privacy law compliance


Native Encryption and Secure Erase

Encryption of data at rest• Files are encrypted before they are stored on disk

• Keys are never written to disk

• No data leakage in case disks are stolen or improperly decommissioned

Secure deletion • Ability to destroy arbitrarily large subsets of a file system

• No “digital shredding”, no overwriting: secure deletion is a cryptographic operation


Reliability, Availability and Serviceability (RAS) #1

Automated deadlock detection, notification, and debug data collectionAutomated deadlock detectionAutomated deadlock data collectionAutomated deadlock breakup

Dump improvementDaemon survival under heavy loadsAbility to dump more data

Message LoggingSend message logs to system event logging facility

Directory enhancements to allow shrinkingMerging mostly empty blocksAllows larger directory block sizes


Reliability, Availability and Serviceability (RAS) #2

User-defined Node classesmmcrnodeclass, mmchnodeclass, mmdelnodeclass, mmlsnodeclass

Quota file improvementsQuota management enablement without unmounting the file systemfsck() speed improvementsSupport for GPT NSD

Adds a standard disk partition table (GPT type) to NSDsDisk label support for Linux

New GPFS NSD v2 format provides the following benefits:Includes a partition table so that the disk is recognized as a GPFS deviceAdjusts data alignment to support disks with a 4 KB physical block sizeAdds backup copies of some key GPFS data structuresExpands some reserved areas to allow for future growth


Performance & Health MonitoringNetwork Performance Monitoring

GPFS daemon caches statistics relating to RPCsA set of up to seven statistics cached per node

Channel wait timeSend time TCPSend time verbsReceive time TCPLatency TCPLatency verbsLatency mixed

GPFS RPC Latency Measurement.mmdiag –rpc

Ongoing enhancements in GPFS 4.1 TLsDisk Performance MonitoringMemory Utilization Monitoring


Performance Improvements

Fine Grained Directory Locking (FGDL)Local Read Only Cache (LROC)

Overflow file data cache to local SSD storageDefined as NSD with “localCache” as usageConfigure it for data or metadata (inodes/dirs)Utilize SSD as an extension of the GPFS buffer pool

Save more memory for applications Automatic management of local storage

Write Data Logging (WDL)Takes advantage of NVRAMs in GPFS client nodes to reduce latency of small and synchronous writesScale write performance with addition of GPFS client nodes

GPFS Clients

GPFS Storage Server Cluster

GPFS LROC


Backup/Restore Improvements New tool to restore from a fileset snapshot into the active file system.

Only copy the blocks that have been changed as well as the changed attributes since the restoring snapshot.

TSM Configuration Verification by mmbackupTSM B/A client must be installed and at the same version on all the nodes that will execute the mmbackup command.TSM B/A configuration are correct before executing the backup.

Automatic TSM tuning adjustments:

“The mmbackup command can be tuned to control the numbers of threads used on each node to scan the file system, perform inactive object expiration, and modified object backup. In addition, the sizes of lists of objects expired or backed up can be controlled or autonomic ally tuned to select these list sizes if they are not specified. List sizes are now independent for backup and expire tasks.”


GPFS v4.1 on Windows via Cygwin

http://cygwin.com/

Cygwin is:A large collection of GNU and Open Source tools which provide functionality similar to a Linux distribution on Windows.A DLL (cygwin1.dll) which provides substantial POSIX API functionality.

GPFS:GPFS will use Cygwin for it‘s shell execution enviroment only.All GPFS programs (executables/binaries) will be native Windows binariesand will not have any linkage with Cygwin DLLs.Cygwin is needed as SUA has been completely removed by Microsoft in Windows Server 2012 R2. (see http://technet.microsoft.com/en-us/library/dn303411.aspx)


New and changed commandsChanged with GPFS v4.1

mmaddcallbackmmafmctlmmafmlocalmmbackupmmchclustermmchconfigmmchfilesetmmchfsmmcrclustermmcrfilesetmmcrfsmmdiagmmlsfsmmlsmountmmmigratefsmmmountmmrestorefsmmsnapdirmmumount

New with GPFS v4.1mmafmconfigmmchnodeclassmmcrnodeclassmmdelnodeclassmmlsnodeclassmmsetquota


GPFS MULTICLUSTERCloud File Systems via WAN (IP)


GPFS the Cloud ‚backbone‘

Why?

Tie together multiple sets of data into a single namespace

Allow multiple application groups to share portions or all data

Secure, available and high performance data sharing

Support of public and private clouds

LAN

SANSAN

GPFS

LAN

SANSAN

GPFS

GPFS NSD Protocol on TCP/IP

Create an enterprise‐wideGlobal namespace


Cluster A ‚Europe‘Cluster B ‚US‘

/gpfs1_clusterA

/gpfs2_clusterB

LAN / WANvia TCP/IP

Cluster C ‚Far East Asia‘

GPFS Multicluster(Cloud Mode)


GPFS Multicluster ‐ Firewall

• bi‐ directional deamon communication

• data to filesystem always uses port 1191 (default)

• optional: mmchconfig tscTcpPort=PortNumber

11911191


WIDE AREA DATA SERVICESGPFS WAN Cache with AFM / Panache


GPFS WAN Cache Support (AFM)

Cache

http://www.almaden.ibm.com/storagesystems/projects/panache/http://static.usenix.org/event/fast10/tech/full_papers/eshel.pdf


Motivation for GPFS AFM

Data sharing across geographically distributed sites is commonWhile the bandwidth is decent, latency is highNetwork is unreliable, subject to outages

Infrastructure needs to be scalable to move data across the WANMask latency and fluctuating performance of the network

Applications desire local performance for remote dataMove data closer to compute servers

Traditional protocols for remote file serving are chatty and unsuitableLarge files (VM images, virtual disks) are becoming predominantExisting caching systems are primitive


Clients access:/global/data1/global/data2/global/data3/global/data4/global/data5/global/data6



Cache Filesets:/data1/data2

Local Filesets:/data3/data4


File System: store1




File System: store2




File System: store3

See all data from any ClusterCache as much data as required or fetch data on demand

Home Cache

Global Namespace + AFM Cache


Read Only (RO)– Cache can only read data, no data change allowed.

Local Update (LU)– Data is cached from home and changes are allowed like SW mode but changes are not pushed to home. – Once data is changed the relationship is broken i.e cache and home are no longer in sync for that file.

Single Writer (SW)– Only cache can write data.– Home can’t change.– Other peer caches have to be setup as read only caches.

Independent Writer (IW)– One or more filesets can be linked to the same HOME. Other peer caches

can point to the same HOME and can be set up as “iw“ as well.

Change of Modes– SW & RO mode caches can be changed to any other mode.– LU cache can’t be changed. (Too many complications/conflicts to deal with.)

AFM Cache Modes


Home

appl

data

web

Cache

appl

data

web

Homedirectory tree

Cachedirectory tree

Inode: 100Inode attrs: < … >Remote state:<inode: 1024attrs: mtime, ctime, …. >

Inode 1024Inode attrs: < … >

Independent filesystemsSeparate inode spaceSW: Home FS must not be changedIW: HOME can be changedCache is a clustered FS

LOOKUPGETATTR

[root@c25m4n03 fs10]# mmlsattr -d -X -L file435234file name: file435234metadata replication: 1 max 2data replication: 1 max 2immutable: noappendOnly: noflags: storage pool name: systemfileset name: AFM_fs10snapshot name: creation time: Fri Mar 22 10:35:06 2013Windows attributes: ARCHIVE

gpfs.pcache.inode: 0x0000000000500003597E255F00000001gpfs.pcache.attr: 0x0000000000036126000000000[...]

AFM Technical Details


GPFS ILMInformation Lifecycle Management (ILM)


GPFS Storage Pools & PoliciesMotivation:

– Not all storage is the same:some is faster, cheaper, more reliable, …

– Not all data are the same:some are more valuable, important, popular, …

Storage Pool: A named collection of disks with similar attributes intended to hold similar data

– System pool: one per file system;holds all metadata

– Data pools: zero or more: only hold data– External pool: off-line storage (e.g., tape)

for rarely accessed data

Policy: A set of user-specified rules that match data to the appropriate pool

– SQL-like syntax for selecting files based on file attributes, such as:

• name or name pattern (e.g., *.jpg)• owner, file size, time stamps• extended attributes

SSD

10k rpm SAS

7200 rpm SATA

“gold”

“silver”

“bronze”


Actionable intelligence for File Storage Tiering via GPFS

+ Another 27 Misc Attributes & Custom Extended Attributes (XATTR)

GPFS knows• File name• File type• I/O Size• Type of storage technology• Latency of storage• Locality of storage• Time of last access• Block size• Time of last change• Clone attributes• Time of file creation

• File Tree location• File heat• File size• Filesets• Generation of file’s reuse• Group owning file• User owning file• Space efficiency of file• Access Permissions• Time of last metadata change

Block Storage knows• I/O Size• Type of storage technology• Compressible data set• Latency of storage• Locality of storage


GPFS Pools

When creating a file system or adding disks:specify the name of the pool that each disk belongs to. → Pool = collection of all disks with the same pool name

Pools can have attributes specified via “stanzas”, e.g., allocation map layout, block size

Separate allocation map for each pool→ Efficient to find space in a particular pool

Block size:– All data pools must have the same block size

(allows migrating files one data block at a time).– System pool may have different block size,

but only if used for metadata only

Pool assignment recorded in the inode of each file.→ A file can only “belong” to a single pool→ Writes fail if pool is full (ENOSPC)

system

data1

data2


GPFS Policies

Placement policy:– Evaluated at file creation time– Determines initial file placement and replication

Migration policy:– Evaluated periodically or on-demand– Can move data between pools, changes replication, delete data,

or run arbitrary user commands

Policy engine (mmapplypolicy):– Fast, parallel directory traversal combined with inode scan– Runs outside the daemon, but makes use of GPFS infrastructure and APIs

(extended readdir, inode scan)– Can be used as powerful framework for building parallel file system utilities, e.g.

• Fast find/grep• Remote replication

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r5.gpfs100.doc%2Fbl1adm_mmapplypolicy.htm


GPFS Filesets & Fileset Snapshots

root

fset1

fset3

fset2

Fileset: A partition of the file system name space (sub-directory tree)

– Allows administrative operations at finer granularity than entire file system, e.g.,

disk space limits, user/group quota, snapshots, caching, ...

– Can be used to refer to a collection of files in policy rules

Independent Fileset: A fileset with a reserved set of inode block ranges (“inode space”)

– Allows per-fileset inode scan– Enables fileset snapshots (inode copy-on-write

operates on inode blocks)– Separate inode limit and inode file expansion for

each inode space→ Active inode file may become sparse


GPFS Fileset Snapshotsro

otfs

et1

fset

2fs

et1

root

activeFS

fset2snapshot

fset1snapshot

globalsnapshot

copy-on-write

ditto resolution


GPFS BACKUP & RESTOREGPFS Backup, Restore and HSM via TSM


Backup/Restore via Tivoli StorageManager

Copy Pool #2

Copy Pool #1

GPFS can use Multiple TSM Servers in parallelTSM B/A Client for GPFS runs on each NodeBackup & Restore is done in parallelLAN Free mode is possibleGPFS Policy Engine is used; no filetree walk needed

3. Storage pool backup

TSMSrv #1

2. Migration

1. Backup4. Restore

/gpfs01 ‐ FS

TSMSrv #2

TSMSrv #N

Scale

TSM Disk Storage Pool(s)

GPFS File System


What is ‘mmbackup’ ?


GPFS HSM via DMAPI

file

file

...

HSM Object ID / Handle

DMAPI

stub

Migration


GPFS Hierarchical StorageManagement

stubObject ID (DMAPI handle)

TSM Server

filemigrated

Object ID (DMAPI handle)

filepremigrated

stubObject ID (new DMAPI handle)

afterrestore

migstate=yes Move reestablishing ofthe link to the restore path

file

file (# of versions)normal file

file

backup

restore

migraterecall

migraterecall

HSM Client


GPFS Hierarchical StorageManagement


GPFS SOBAR Backup Procedure

PreparationPre/Migrate all files

Information CollectionCreate file system configuration backup fileCreate file system snapshot & file system image

TSM BackupBackup file system configuration & file system image to TSM

Scale Out Backup and Restore (SOBAR) is a specialized mechanism for data protection against disaster only for GPFS file systems that are managed by Tivoli Storage Manager (TSM) Hierarchical Storage Management (HSM).


GPFS SOBAR Restore ProcedureTSM RestoreRestore file system configuration & file system image

Target FS PreparationExtract & apply file system configurationCreate NSD’s and file system

Extract File System ImageMount the file system Recreate file system image

Start ProductionStart HSM daemons & remount the file systemAdd HSM management and start recall


GPFS ILM VIA LTFS‐EELinear Tape File System Enterprise Edition


GPFS ILM with LTFS‐EE• LTFS Enterprise Edition integrates LTFS with GPFS

– LTFS represents external tape pool to GPFS– Files can be migrated using on GPFS policies or LTFS EE commands– Similar implementation as with TSM HSM

• LTFS EE can be configured on multiple nodes– Multiple instances of LTFS EE share the same tape library

GPFS NodesGPFS file system

LTFS LE+

User file system

Data

IBM TS3500Tape Library

LTFS EE SAN

Fibre Channel


GPFS and LTFS EE integration

User accesses file system on all GPFS nodes

User file system is staging area for subsequent migration

HSM integrates with GPFS user file system and MMM to manage migration and recall

MMM manages workload over all LTFS EE instances

LTFS LE+ manages tape access via local tape drives

Metadata file system stores shared LTFS tape index


GPFS HSM and LTFS EE• HSM client integrates with the DMAPI to intercept file access• HSM client calls migration driver / MMM to perform migration

–Migration can be triggered manually or via policies–Migration moves file to LTFS and leaves stub– Stub includes reference to directory on LTFS tape–MMM performs load balancing

• When stub is accessed HSM client calls MMM –MMM identifies free resources and performs recall from LTFS tape–After entire file is on disk user access is granted

User file system DMAPI HSMclient

Disk

File accessMigration

Recall

MMM

LTFS LE+ Tape LibraryMigration

driver

Other LTFS node


LTFS‐EE Tape Import Feature• Import adds the specified tape to the LTFS Enterprise Edition system

– Adds stub files in GPFS file system, imported files are in migrated state– No file data movement, actual file data remains on tape– File data can still be accessed (recalled) via stub

• LTFS SDE and LE tapes can also be imported– Are first converted to LTFS EE tapes

• LTFS EE provides command for tape import: ltfsee import– Import can be done to specific directory in GPFS file system– Options rename, overwrite or ignore can be used to manage conflicts with existing files


LTFS‐EE Tape Export Feature• Export removes tapes from LTFS EE system for vaulting or data exchange

– Removes tapes from pool – Exported tapes are not longer target for migrations or recalls– Files (stubs) migrated to exported tape can be deleted or kept in GPFS– Export message can be added to file stubs (64 bytes)

• Export with offline option keeps file stubs and GPFS – Files remain visible in the GPFS namespace but are no longer accessible

• LTFS EE provides command for tape export: ltfsee export– To move tape to I/O station use ltfsee tape move ieslot


LTFS‐EE ReferencesLTFS EE InfoCenter: http://pic.dhe.ibm.com/infocenter/ltfsee/cust/index.jsp

GPFS InfoCenter: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp

LTFS EE Redbooks: http://www.redbooks.ibm.com/redpieces/abstracts/sg248143.html

LTFS EE Installation Demo:http://www.youtube.com/watch?v=bF5tHAjp5xA&feature=youtu.be


NETWORK ATTACHED STORAGE (NAS)CIFS/SMB & NFS via GPFS


GPFS Clustered NFS (cNFS)

70

Feature of GPFS on LinuxShare files with non‐GPFS clients using NFS protocolNFS daemon (nfsd) of theLinux OS is used as normalAll nodes can share the same dataIf a NFS Server Node fails client connections are moved to another serverNFS Server Node(s) need GPFS Server LicenseNFS Clients need no GPFS License

Local area network (LAN)NFS Clients

NFS server(s) GPFS NSD

GPFS File System

AIX, Linux, OSX, Windows

(Linux only !)


GPFS Clustered NFS (cNFS) #2

# Enable cNFS on GPFS Cluster1> mmchconfig cnfsSharedRoot=<Dir_Path_Name>

# Add each node with the correct IP interface2> mmchnode ‐N <node_name> ‐‐cnfs‐enable –cnfs‐interface=<nfs_ip>

# Check Cluster Status3> mmlscluster –cnfs

# Done !

http://www.redbooks.ibm.com/redpapers/pdfs/redp4400.pdf


User Space NFS v4

“NFS‐Ganesha”NFS‐GANESHA is a NFS server running in User Space. It is available under the LGPL license.

It has been designed to meet two goals:1. providing very large metadata and data caches (up to millions of records)2. providing NFS exports to various files systems and namespaces (a set of data organized

as trees, with a structure similar to a files system)

NFS‐GANESHA uses dedicated backend modules called FSAL (which stand for File System Abstraction Layer) that provided the product with a unique API (used internally) to access the underlying namespace. The FSAL module is basically the "glue" between the namespace and the other part of NFS‐GANESHA.

https://github.com/nfs‐ganesha/nfs‐ganesha/wikihttps://github.com/nfs‐ganesha/nfs‐ganesha/


GANESHA NFS

File System Abstraction Layer = FSAL_GPFS

GANESHA, a multi‐usage with large cache NFSv4 server(Part of SONAS / V7000U v1.5)


SAMBA – Does it work with GPFS ?

data

data

GPFS Cluster

data

SMB / CIFS

SMB / CIFS

SMB / CIFS

SMB / CIFS

SMB/CIFS Clients

CTDB runs here

Many customers use SAMBA & CTDB (the clustered version of SAMBA) to share GPFS data with SMB/CIFS Clients.

Clustered Trivial Database (CTDB)


Samba/CTDB/GPFS Update

http://sambaxp.org/past‐conferences/sambaxp‐2013/archive.htmlhttp://sambaxp.org/program/schedule.html

Reminder: SMB/CIFS support via SAMBA or other software, not provided or supported by IBM with GPFS – You’re on your own!

Find more technical details:


GPFS NATIVE RAID (GNR)GPFS Perseus – Declustered RAID


GPFS Native Raid (GNR)

Features• Auto rebalancing• Only 2% rebuild performance hit• Reed Solomon erasure code, “8 data +3 parity”• ~105 year MTDDL for 100‐PB file system• End‐to‐end, disk‐to‐GPFS‐client data checksums

Software RAID on the I/O Servers:

SAS attached JBOD’s Special JBOD storage drawer for very dense drive packingSolid‐state drives (SSDs) for metadata storage

SAS

vDISK

Local area network (LAN)

NSD servers

SAS

vDISK

JBODs


GNR is a software implementation of storage RAID technologies

GPFS„Classic“ GPFS Native Raid


GNR Fault Tolerance2 or 3‐fault tolerant RAID– 8 data strips + 2 or 3 parity strips– 3 or 4 way replication

When one disk is down (most common case)

– Rebuild slowly with minimal impact to client workload

When three disks are down (rare case):– Fraction of stripes that have three failures ~ 1%– Quickly get back to non‐critical (2 failure) state vs.

rebuilding all stripes for conventional RAID


GPFS GNR v2.5

Supported Server Hardware:

GPFS Storage Server V2.5, consisting of two of either of these IBM Power SystemS822L servers (type 5146):

• 128 GB memory (models 21S and 22S)• 256 GB memory (models 24S and 26S)

GPFS Native RAID for GPFS Storage Server is also supported with the LenovoIntelligent Cluster and current GSS Lenovo x86 solutions.

*NEW* Oct 6th 2014


GPFS STORAGE SERVER (GSS)Declustered Software RAID Building Block


GPFS Storage Server (GSS)

Benefits of GSS:3 years maintenance and supportImproved storage affordabilityDelivers data integrity, end‐to‐endFaster rebuild and recovery timesReduces rebuild overhead by 3.5x

FeaturesDe‐clustered RAID (8+2p, 8+3p)2‐ and 3‐fault‐tolerant erasure codesEnd‐to‐end checksumProtection against lost writesOff‐the‐shelf JBODsStandardized in‐band SES managementSSD Acceleration Built‐in

Local Area Network (LAN)

GPFS

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gpfs.doc/doc_updates/bl1du13a.pdf


GSS v2.0 ‚Runs‘ GPFS v4.1Release “R2.0”

GUIConfigurationPerformanceMonitoring

Hardware ChangesNew servers and cardsSmaller TraysSSD, SAS and NL/SAS

Software EnhancementsEnclosure protection

GSS#2

GSS#1

© 2014 IBM CorporationGPFS Native RAID @ x86

NSD

Server 02 (x3650‐M4)

with three LSI 9201‐16e (PCIe gen2 x8)N

SD Server 01

(x36

50‐M

4)with

three LSI 920

1‐16

e (PCIe gen2

x8)

Building Block on 6x IBM JBOD – Recovery Group Layout6x Disk enclosures JBOD01-06 (6x 60 disk slots)

RG01

RG01_DA2(58 disks)

RG01_DA1(58 disks)

5 29

RG02

RG02_DA2(58 disks)

RG02_DA1(58 disks)

RG01_DA3(58 disks)

RG02_DA3(58 disks)

RG02

_LOG (3

WayRe

plication @ 3 HDD)

RG01

_LOG (3

WayRe

plication @ 3 HDD)

p

SSD

SSD

SSD

p

p

SSD

SSD

SSD

p

p

p

432

5 29432

5 29432

5 29432

5 29432

5 29432

33 59323130

33 59323130

33 59323130

33 59323130

33 59323130

33 59323130


GPFS GSS GUI #1Top level Navigation

• Home• Monitoring• Files• Volumes • Copy Services• Access• Configuration

(Preview Information)


GPFS GSS GUI #2



GPFS GSS GUI #3



IBM System x GPFS Storage Server (GSS) 2.0Introducing four *new* models for entry-level high-performance storage server

Model 21s24 SSD

Model 22s48 SAS or SSD Drives

Model 24s96 SAS or SSD Drives

Model 26s144 SAS Drives

What’s inside the new models?

Server: IBM System x 3650 M4Storage: 2U JBOD (24 slots)

• 24 SSDs (Model 21s)• 1.2 TB SAS Drive plus, choice of

2 x 200 GB or 2 x 800 GB SSDs (Model 22s) Networking: 10 / 40 Gb Enet and/or FDR INFBSoftware: GPFS 4.1

Balanced system – high capacity and performanceNear-linear scalabilityLess hardware - more reliable, lower cost! Pre-integrated, shipped with one part number, 3 yr supportFast disk rebuilds

New New

Announce: June 10, 2014 , Ship Support: June 12 , GA: June 13

Owner: Scott Seal, [email protected] Kit: SSI , PartnerWorldI


Non‐intrusive disk diagnosticsGPFS/GNR Disk Hospital

Background determination of problemsWhile a disk is in hospital, GNR non‐intrusively and immediately returns data to the client utilizing the error correction code.For writes, GNR non‐intrusively marks write data and reconstructs it later in the background after problem determination is complete.

Advanced fault determinationStatistical reliability and SMART monitoringNeighbor check, drive power cyclingMedia error detection and correctionSupports concurrent disk firmware updates


GPFS GSS

http://www.ibm.com/systems/technicalcomputing/platformcomputing/products/gpfs/


20x IBM GSS‐24 @ FZ Jülich

http://www.fz‐juelich.de/ias/jsc/EN/Expertise/Datamanagement/OnlineStorage/JUST/Configuration/Configuration_node.html

(4640 Disks + 120 SSDs)


ELASTIC STORAGE SERVER (ESS)Declustered Software RAID Building Block


GPFSElastic Storage Server (ESS)

*NEW* Oct 6th 2014

Power8 Server HardwareRed Hat Enterprise Linux 7 for PowerGPFS Standard Edition v4.1GPFS Native RAID v4.1IBM Support for xCAT 2

5146‐GL25146‐GL45146‐GL65146‐GS15146‐GS25146‐GS45146‐GS6

DCS3700

EXP24S


IDEA ‐ GPFSElastic Storage Server (ESS)

*NEW* Oct 6th 2014

IBM Data Engine for Analytics is a customized infrastructure solution with integrated software that is optimized for big data and analytics workloads.


TSM BACKUP TO DISK TSM Backup to Disk Storage Pool on GPFS GSS


Whitepaper

2 x TSM ServerIBM x3650‐M4Red Hat Enterprise Linux Server release 6.5IBM Tivoli Storage Manager v7.1

1 x IBM System x GPFS Storage Server ‐ GSS266 x 4U‐60 with 58 x 2 TB NL‐SAS disks drawerIn total 348 disks

1 x Mellanox 32 Port InfiniBand FDR switchEach TSM server is connected with a 56 GBit/s link to the GSS system

„More TSM bang for the buckthan EMC Isilon…“


PoC Hardware Setup


GSS als Backend Disk Storage für TSM

Peak Performance for a single TSM server is 5,4 GB/s sequential write with 10 or 50 sessions in parallel Peak Performance for both TSM servers is 4,5 GB/s per server or 9 GB/s sequential write with 10 sessions per server (or 3,8 GB/s per server with 50 sessions per server)Performance for a single sequential write session starts at 100 MB/s with 100KB file size and reach 2,5 GB/s with 1GB file sizeMultiple sequential write session performance starts at 12 MB/s per session (50 sessions in parallel = 600 MB/s) with 100KB file size and reach 108 MB/s per session (50 sessions in parallel = 5,4 GB/s) with 1 GB file size

Environment: 1 x GSS26 connected via dedicated 56 Gbit Infiniband lines to 2 x TSM v7.1


GPFS‐FPO Shared Nothing Cluster and Hadoop


GPFS‐FPO for Hadoop, BigData & HANA

PERFORMANCE & FLEXIBILITY

IMPROVED DATA SHARING FORBETTER COLLABORATION

BUSINESS CONTINUITY AND DATA INTEGRITY

MORE EFFECTIVE MANAGEMENT OF DATA OVER ITS LIFECYCLE

AVOID EXPENSIVE DATA SILOS WITH MORE VERSATILE STORAGE

Enterprise features


What is HDFS ?The Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file‐system written in Java for the Hadoop framework.

File access can be achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command‐line interface, or browsed through the HDFS‐UI webapp over HTTP.

http://en.wikipedia.org/wiki/Apache_Hadoop

Rangers know: "Lot's of yellow Elephants can Cause Extensive Damage to your IT!”


Research Paper

“In this paper, we revisit the debate on the need of a new non‐POSIX storage stack for cloud analytics and argue, based on an initial evaluation, that it can be built on traditional POSIX‐based cluster filesystems.“


Cluster storage configuration for Hadoop and GPFS Storage

Example with 4 datanodes (3 internal disks per datanodes)• 2 pools: a system pool for metadata (1 disk) and 2 FPO datapool for data (2 disks for data)• Several filesets in data pool to manage block replication factors

nsd1 nsd5 nsd6 nsd2 nsd7 nsd8 nsd4 nsd11 nsd12nsd3 nsd9 nsd10 NSD

GPFS-FPODatapool poolwith 3 filesets

Datanodes

with JBOB

No Namenode any moreMetadata are distributed accross the datanodes in a dedicated Storage pool, called system pool

- using physical disks or disk partitions

/dev/sda /dev/sdb /dev/sdc /dev/sda /dev/sdb /dev/sdc /dev/sda /dev/sdb /dev/sdc/dev/sda /dev/sdb /dev/sdcLinux disks

System pool for metadata

nsd1 nsd2 nsd4nsd3

nsd5 nsd6 nsd7 nsd8 nsd11 nsd12nsd9 nsd10

root fileset, replication factor 3 /gpfs-fpo

mrl fileset for map local dir ,replication factor 3 /gpfs-fpo/hadoop/mapred/loal/node1-4

tmp_set fileset for hadoop framework ,replication factor 1 /gpfs-fpo/tmp/hadoop4


GPFS‐FPO for Hadoop/BigInsightshttp://www.ibm.com/systems/technicalcomputing/platformcomputing/products/gpfs/


GPFS‐FPO new capabilities for BigInsightsFile system reliability

• GPFS-FPO avoids the need for a central namenode, a common failure point in HDFS

• Avoid long recovery times in the event of name node failure

• Pipelined replication for efficient storage of block replicas in GPFS-FPO environment

• Boost performance for meta-data intensive applications where the name-node can emerge as a bottleneck.

HDFSNamenode

SecondaryNamenode

Metadata is striped across GPFS FPO nodes, providing better reliability and avoiding the need for primary and secondary name nodes

IBM BigInsights cluster with GPFS-FPO


GPFS‐FPO new capabilities for BigInsightsFlexible storage configuration

• GPFS-FPO avoids the need for a central namenode with distributed metadata, a common failure point in HDFS environments

• Avoids long recovery times in the event that the namenode fails and metadata needs to be recovered from the secondary name node

• Pipelined replication for efficient storage of block replicas in GPFS FPO environment

GPFS Server GPFS Server

Switched Fabric

Shared nothing storage – GPFS‐FPO

Shared storage ‐ GPFS

IBM BigInsights cluster with GPFS FPO


GPFS‐FPO advanced storage for MapReduce

Hadoop HDFS IBM GPFS‐FPO Advantages

HDFS NameNode is a single point of failure

Large block‐sizes – poor support for small files

Non‐POSIX file system – obscure commands

Difficulty to ingest data – special tools required

Single‐purpose, Hadoop MapReduce only

Not recommended for critical data

No single point of failure, distributed metadata

Variable block sizes – suited to multiple types of data and data access patterns

POSIX file system – easy to use and manage

Policy based data ingest

Versatile, Multi‐purpose

Enterprise Class advanced storage features

GPFS‐FPO (File Placement Optimzier)


SUMMARY & ROADMAP


GPFS Elastic Storage Vision


GPFS Wiki, FAQ & Forums

• GPFS Home Pagehttp://www.ibm.com/systems/gpfs

• GPFS Wikihttp://www.ibm.com/developerworks/wikis/display/hpccentral/General+Parallel+File+System+(GPFS)

• GPFS FAQhttp://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf

• GPFS Forum andMailing listhttp://www‐128.ibm.com/developerworks/forums/dw_forum.jsp?forum=479&cat=13http://lists.sdsc.edu/mailman/listinfo.cgi/gpfs‐general


Date post:	05-Oct-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	1 times

IBM GPFS 2014 Elastic Storage - INFN-CNAFvladimir/export/SDS-in... · Performance & Health...

Documents