FROM RESEARCH TO INDUSTRY Taking back control of HPC file...

Taking back control of

HPC file systems with

RobinHood Policy Engine

PER3S – 2017, JANUARY 30TH

Thomas Leibovici, CEA

[email protected]

http://robinhood.sf.net

https://github.com/cea-hpc/robinhood

FROM RESEARCH TO INDUSTRY

LOSING CONTROL? (1)

Filesystem features are limited:df: overall usageQuotas (if implemented) :- per user inode/volume usage

For customized statistics, scanning is often required:

find, du…

� Become endless as filesystems grow

Monitoring filesystem contents: what we have

LOSING CONTROL? (2)

An accurate view of user data profiles:Size profileAge profilePer project, per group, per directory…

Aggregated statistics at willAvailable in a few seconds

Customized reportsMultiple arbitrary criteria

Filesystem activity indicators

Monitoring filesystem contents: what we need

LOSING CONTROL? (3)

Tools:cp, rsync, backup tools, …

Policies:find <criteria> -exec <action>

� Again, needs scanning� Managing multiple criteria/actions is even

longer and painful!

Data management: what we have

LOSING CONTROL? (4)

Applying mass actions to filesystem entriesFastUsing various criteria and actions

Life cycle managementHierarchical Storage,support many technologies(Tapes, HDD, SSD, NVMe…)Cleaning old/unused data…

Other mass processingPost-processing, compression, dedup, …Checking data integrity…

Data management: what we need

ROBINHOOD POLICY ENGINE

ABOUT THE PROJECT

A few words about the project

Project started in 2006OpenSource since 2009 (CeCILL-C, LGPL equivalent)

User community ~1200 HPC sites

Included to several storage solutions (Cray, DDN, Intel, Seagate, …)

Git repository:http://github.com/cea-hpc/robinhood

Project home page:http://robinhood.sf.net

Mailing lists:[email protected]@[email protected]

ROBINHOOD POLICY ENGINE: BIG PICTURE

find and du clones

Fine-grained statistics + web UI

ROBINHOOD V3 PLUGIN BASED ARCHITECTURE

Robinhood core made generic:

Purpose-specific code moved out of

robinhood core: now dynamic plugins

loaded at run-time

All policy behaviors made configurable

Users can write their own plugins for

specific needs

Easily implement new policies, just by writing a few lines of configuration

OST rebalancing

Pool-to-pool data migration

Data integrity checks

Trash can mechanism

Massive data conversions

…

OVERVIEW: FILE CLASSES

File classesFilesystem entries can be catagorized in arbitrary fileclasses,

Definition based on any entry attributes:

Reporting of file class statistics provides an accurate knowledge of system contents

fileclass foobar_files {definition { ( size == 0 or name == ‘*.log’)

and owner == ‘foo’ and type == file }}

OVERVIEW: POLICIES

Custom policy definitionsExample: cleaning old unused files

Basic policy rules

declare_policy cleanup {default_action = common.unlink ; # defaut action of the policydefault_lru_sort_attr = last_access ; # process oldest firststatus_manager = none ; # to manage a state-machine per entry (optional)scope { type == file } # scope of the policy (optimization)

}

cleanup _rules {ignore_fileclass = my_whitelisted ;

rule clean15d {target_fileclass = my_logs ;target_fileclass = my_tmp_files ;condition { last_mod > 15d }

}…

QUICK TOUR: POLICIES (CONT’D)

Customizable actions and/or parameters, for each rule, fileclass, …

Policy triggers:Example: trigger when a user’s usage > 100TB

Other trigger types:Group usage, overall FS usage, per server usage…More to come: triggers as plugins

rule somerule {…

action = cmd( ‘mycommand.sh {path} –o {oneparam}’ );action_params {

oneparam = ‘truc’;}

…

cleanup_ trigger {trigger_on = user_usage ( [list]);high_threshold_vol = 100TB;check_interval = 1h;

}

RELATED WORK

CHALLENGES

Robinhood challenges :

Collecting

Processing

Storing

Aggregating

Reporting

Managing data

Predicting accesses

FOCUS: COLLECTING INFORMATION FROM STORAGE SYSTEMS

Scanning is unefficient and doesn’t scaleScanning is fundamentaly a O(N) operation

Scanning a POSIX namespace requires reading sequentially the contents of all

directories:

Readdir returns entries one by one

Getdents is a little better (returns chunks of entries) but still sequential

POSIX scanning results in a storm of syscalls

E.g.: 1 billion entries @ 1000entry/sec = 1million seconds = 11 days+

Let’s look for better solutions…

THE PROBLEMS WITH SCANNING

To avoid scanningRobinhood collects incremental changes from filesystems

Most complete implementation today: Lustre « changelogs »

All medata changes (create, unlink, setattr, …) + data related operations open, close,

mtime/atime change, …) are reported near real-time in a transactional log

Thus robinhood can maintain an up-to-date view of the filesystem metadata, without

scanning

Other implementations:

POSIX: mechanisms based on inotify, fanotify, …

???

For scalable management, think about implementing such a mechanism for your

storage system!

AVOID SCANNING

SCALING EVENT PROCESSING

Processing multiple event streams: scaling

Possible configurations:1 reader thread per changelog stream1 reader process per changelog stream (possibly on multiple clients)N readers with changelog proxies

MDT

MDS

CLreader

CLreader

robinhood

Lustre client

MDT

MDS

1 robinhood instance, 1 thread / MDT 1 robinhood instance per MDT

MDT

MDS

MDT

MDS

CLreader

robinhood

Lustre client

CLreader

robinhood

Lustre client

MDT

MDS

MDS

MDT

CLreader

robinhood

Lustre client

CLreader

robinhood

Lustre client

Using “LCAP” proxies

Proxy

PARALLEL SCANS

If scanning is the only solution…Robinhood implements a multi-threaded version of depth-first traversal

The namespace is split into individual tasksEach task consists in scanning a single directoryDepth-first traversal to limit memory usage

Initial task thr1

thr2 thr3 thr4

new tasks

thr1 thr2 thr3

new tasks

Example with 4 threads

thr4 thr1

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 4 8 12 16 20 24 28 32

entr

ies

/ sec

.

# scan threads

robinhood scan speed

XFS filesystem scan

Distributing scans

Scan can be distributed across multiple clients

Split the namespace between instances

This allows cumulating ops/sec of individual clients

Robinhoodinstance1

PARALLEL SCAN (MULTI-CLIENT)

FS root

Robinhoodinstance2

Robinhoodinstance3

Filesystemclient1

Filesystemclient2

Filesystemclient3

RobinhoodDB

SCANNING: PERSPECTIVES

Implement a “Bulk Scan” filesystem service

Scanning based on POSIX calls is not efficient

Most of the time filesystem metadata can be dumped more efficientlyExample: ext4 metadata can be dumped very quickly using “e2scan” command

Proposal: implement a “bulk scan” feature to filesystemsNeed to standardize the interface to invoke such feature

Bonus features: server side filtering, …

MetadataService

Policy engine

List of entries(generic format)

FOCUS: STORING METADATA

FILESYSTEM AND DATABASE BENEFITS

Benefits

Filesystem Database

Goals:Optimize data access

Bandwidth, data allocation

Optimize medatada access for POSIX:lookup/readdir/create/unlink

Goals:Optimize per-record access

select/insert/update

Optimize multi-criteria searchesOptimize aggregating/sorting information

Dataintensiveworkloads

Search& aggregate

find . -user foo –size -1024 select * from ENTRIES where

user=‘foo’ and size<1024

OPTIMIZING INGEST RATE

Optimizing database ingest rate

Ingest rate is critical

Must keep up with high the filesystem activity

No index on entries table or ingest performance would drop

DB transactions: possible strategies

Batch database operations into large transactions to reduce DB IOPS

- Better for slow devices (spinning disks)

Up to 10k ops/sec

Run multiple operations in parallel

- Better for SSDs

Up to 35k ops/sec

Best performance: mixing the 2 methods

Execute multiple batches of operations in parallel

Up to 80k ops/sec

SCALING ROBINHOOD DATABASE

StatementTo exceed the ingest rate of a single DB host, robinhood DB backend must be parallel.

Many « clustered » versions of databases (Mysql, MariaDB) are designed to scale for

read operations (application: websites), but rarely for updates.Based on replication bus.

SolutionsSome popular databases are natively parallel

Ex. HBase, MongoDB, …

Weeknesses is ACID properties

Sharding can satisfy robinhood needs:most robinhood operations are about a single record

(no need for distributed transactions).

FOCUS: PREDICTING ACCESSES

SEQUENCE PREDICTION

Interests of Machine LearningPowerful OpenSource frameworks now available:

Caffe, Torch, Scikit-learn, …

Example application:

Predict application accesses based on past runs

� Make it possible to optimize data placement (caching, HSM, …)

Past access sequences(past application runs)

Learn

« Live » sequence

Predict nextCreate ARemove BWrite XRead Y…

CONCLUSION

Summary

Robinhood is a swiss-army knife to manage filesystemsto monitor filesystem contentsto schedule automatic actions on filesystem entries

Adopted by the HPC community

Works in progress:Always make it faster, and more scalable (parallel DB).Always make it more generic. Adapt it to new generations of storage systems (object stores…)Complete admin-defined rules by self-determined behaviors (Machine Learning).

ConclusionDrop your old-fashioned scripts based on ‘find’ and ‘du’

DAM Île-de-FranceCommissariat à l’énergie atomique et aux énergies alternatives

CEA / DAM Ile-de-France| Bruyères-le-Châtel - 91297 Arpajon Cedex

T. +33 (0)1 69 26 40 00

Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019

Date post:	27-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

FROM RESEARCH TO INDUSTRY Taking back control of HPC file...

Documents