Taking back control of
HPC file systems with
RobinHood Policy Engine
PER3S – 2017, JANUARY 30TH
Thomas Leibovici, CEA
http://robinhood.sf.net
https://github.com/cea-hpc/robinhood
FROM RESEARCH TO INDUSTRY
LOSING CONTROL? (1)
Filesystem features are limited:df: overall usageQuotas (if implemented) :- per user inode/volume usage
For customized statistics, scanning is often required:
find, du…
� Become endless as filesystems grow
Monitoring filesystem contents: what we have
LOSING CONTROL? (2)
An accurate view of user data profiles:Size profileAge profilePer project, per group, per directory…
Aggregated statistics at willAvailable in a few seconds
Customized reportsMultiple arbitrary criteria
Filesystem activity indicators
Monitoring filesystem contents: what we need
LOSING CONTROL? (3)
Tools:cp, rsync, backup tools, …
Policies:find <criteria> -exec <action>
� Again, needs scanning� Managing multiple criteria/actions is even
longer and painful!
Data management: what we have
LOSING CONTROL? (4)
Applying mass actions to filesystem entriesFastUsing various criteria and actions
Life cycle managementHierarchical Storage,support many technologies(Tapes, HDD, SSD, NVMe…)Cleaning old/unused data…
Other mass processingPost-processing, compression, dedup, …Checking data integrity…
Data management: what we need
ROBINHOOD POLICY ENGINE
ABOUT THE PROJECT
A few words about the project
Project started in 2006OpenSource since 2009 (CeCILL-C, LGPL equivalent)
User community ~1200 HPC sites
Included to several storage solutions (Cray, DDN, Intel, Seagate, …)
Git repository:http://github.com/cea-hpc/robinhood
Project home page:http://robinhood.sf.net
Mailing lists:[email protected]@[email protected]
ROBINHOOD POLICY ENGINE: BIG PICTURE
find and du clones
Fine-grained statistics + web UI
ROBINHOOD V3 PLUGIN BASED ARCHITECTURE
Robinhood core made generic:
Purpose-specific code moved out of
robinhood core: now dynamic plugins
loaded at run-time
All policy behaviors made configurable
Users can write their own plugins for
specific needs
Easily implement new policies, just by writing a few lines of configuration
OST rebalancing
Pool-to-pool data migration
Data integrity checks
Trash can mechanism
Massive data conversions
…
OVERVIEW: FILE CLASSES
File classesFilesystem entries can be catagorized in arbitrary fileclasses,
Definition based on any entry attributes:
Reporting of file class statistics provides an accurate knowledge of system contents
fileclass foobar_files {definition { ( size == 0 or name == ‘*.log’)
and owner == ‘foo’ and type == file }}
OVERVIEW: POLICIES
Custom policy definitionsExample: cleaning old unused files
Basic policy rules
declare_policy cleanup {default_action = common.unlink ; # defaut action of the policydefault_lru_sort_attr = last_access ; # process oldest firststatus_manager = none ; # to manage a state-machine per entry (optional)scope { type == file } # scope of the policy (optimization)
}
cleanup _rules {ignore_fileclass = my_whitelisted ;
rule clean15d {target_fileclass = my_logs ;target_fileclass = my_tmp_files ;condition { last_mod > 15d }
}…
QUICK TOUR: POLICIES (CONT’D)
Customizable actions and/or parameters, for each rule, fileclass, …
Policy triggers:Example: trigger when a user’s usage > 100TB
Other trigger types:Group usage, overall FS usage, per server usage…More to come: triggers as plugins
rule somerule {…
action = cmd( ‘mycommand.sh {path} –o {oneparam}’ );action_params {
oneparam = ‘truc’;}
…
cleanup_ trigger {trigger_on = user_usage ( [list]);high_threshold_vol = 100TB;check_interval = 1h;
}
RELATED WORK
CHALLENGES
Robinhood challenges :
Collecting
Processing
Storing
Aggregating
Reporting
Managing data
Predicting accesses
FOCUS: COLLECTING INFORMATION FROM STORAGE SYSTEMS
Scanning is unefficient and doesn’t scaleScanning is fundamentaly a O(N) operation
Scanning a POSIX namespace requires reading sequentially the contents of all
directories:
Readdir returns entries one by one
Getdents is a little better (returns chunks of entries) but still sequential
POSIX scanning results in a storm of syscalls
E.g.: 1 billion entries @ 1000entry/sec = 1million seconds = 11 days+
Let’s look for better solutions…
THE PROBLEMS WITH SCANNING
To avoid scanningRobinhood collects incremental changes from filesystems
Most complete implementation today: Lustre « changelogs »
All medata changes (create, unlink, setattr, …) + data related operations open, close,
mtime/atime change, …) are reported near real-time in a transactional log
Thus robinhood can maintain an up-to-date view of the filesystem metadata, without
scanning
Other implementations:
POSIX: mechanisms based on inotify, fanotify, …
???
For scalable management, think about implementing such a mechanism for your
storage system!
AVOID SCANNING
SCALING EVENT PROCESSING
Processing multiple event streams: scaling
Possible configurations:1 reader thread per changelog stream1 reader process per changelog stream (possibly on multiple clients)N readers with changelog proxies
MDT
MDS
CLreader
CLreader
robinhood
Lustre client
MDT
MDS
1 robinhood instance, 1 thread / MDT 1 robinhood instance per MDT
MDT
MDS
MDT
MDS
CLreader
robinhood
Lustre client
CLreader
robinhood
Lustre client
MDT
MDS
MDS
MDT
CLreader
robinhood
Lustre client
CLreader
robinhood
Lustre client
Using “LCAP” proxies
Proxy
PARALLEL SCANS
If scanning is the only solution…Robinhood implements a multi-threaded version of depth-first traversal
The namespace is split into individual tasksEach task consists in scanning a single directoryDepth-first traversal to limit memory usage
Initial task thr1
thr2 thr3 thr4
new tasks
thr1 thr2 thr3
new tasks
Example with 4 threads
thr4 thr1
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 4 8 12 16 20 24 28 32
entr
ies
/ sec
.
# scan threads
robinhood scan speed
XFS filesystem scan
Distributing scans
Scan can be distributed across multiple clients
Split the namespace between instances
This allows cumulating ops/sec of individual clients
Robinhoodinstance1
PARALLEL SCAN (MULTI-CLIENT)
FS root
Robinhoodinstance2
Robinhoodinstance3
Filesystemclient1
Filesystemclient2
Filesystemclient3
RobinhoodDB
SCANNING: PERSPECTIVES
Implement a “Bulk Scan” filesystem service
Scanning based on POSIX calls is not efficient
Most of the time filesystem metadata can be dumped more efficientlyExample: ext4 metadata can be dumped very quickly using “e2scan” command
Proposal: implement a “bulk scan” feature to filesystemsNeed to standardize the interface to invoke such feature
Bonus features: server side filtering, …
MetadataService
Policy engine
List of entries(generic format)
FOCUS: STORING METADATA
FILESYSTEM AND DATABASE BENEFITS
Benefits
Filesystem Database
Goals:Optimize data access
Bandwidth, data allocation
Optimize medatada access for POSIX:lookup/readdir/create/unlink
Goals:Optimize per-record access
select/insert/update
Optimize multi-criteria searchesOptimize aggregating/sorting information
Dataintensiveworkloads
Search& aggregate
find . -user foo –size -1024 select * from ENTRIES where
user=‘foo’ and size<1024
OPTIMIZING INGEST RATE
Optimizing database ingest rate
Ingest rate is critical
Must keep up with high the filesystem activity
No index on entries table or ingest performance would drop
DB transactions: possible strategies
Batch database operations into large transactions to reduce DB IOPS
- Better for slow devices (spinning disks)
Up to 10k ops/sec
Run multiple operations in parallel
- Better for SSDs
Up to 35k ops/sec
Best performance: mixing the 2 methods
Execute multiple batches of operations in parallel
Up to 80k ops/sec
SCALING ROBINHOOD DATABASE
StatementTo exceed the ingest rate of a single DB host, robinhood DB backend must be parallel.
Many « clustered » versions of databases (Mysql, MariaDB) are designed to scale for
read operations (application: websites), but rarely for updates.Based on replication bus.
SolutionsSome popular databases are natively parallel
Ex. HBase, MongoDB, …
Weeknesses is ACID properties
Sharding can satisfy robinhood needs:most robinhood operations are about a single record
(no need for distributed transactions).
FOCUS: PREDICTING ACCESSES
SEQUENCE PREDICTION
Interests of Machine LearningPowerful OpenSource frameworks now available:
Caffe, Torch, Scikit-learn, …
Example application:
Predict application accesses based on past runs
� Make it possible to optimize data placement (caching, HSM, …)
Past access sequences(past application runs)
Learn
« Live » sequence
Predict nextCreate ARemove BWrite XRead Y…
CONCLUSION
Summary
Robinhood is a swiss-army knife to manage filesystemsto monitor filesystem contentsto schedule automatic actions on filesystem entries
Adopted by the HPC community
Works in progress:Always make it faster, and more scalable (parallel DB).Always make it more generic. Adapt it to new generations of storage systems (object stores…)Complete admin-defined rules by self-determined behaviors (Machine Learning).
ConclusionDrop your old-fashioned scripts based on ‘find’ and ‘du’
DAM Île-de-FranceCommissariat à l’énergie atomique et aux énergies alternatives
CEA / DAM Ile-de-France| Bruyères-le-Châtel - 91297 Arpajon Cedex
T. +33 (0)1 69 26 40 00
Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019