Post on 27-Dec-2015
transcript
Synchronizing Lustrefile systems
Dénes Németh (nemeth.denes@iit.bme.hu)
Balázs Fülöp (fulop.balazs@ik.bme.hu)
Dr. János Török (torok@ik.bme.hu)
Dr. Imre Szeberényi (szebi@iit.bme.hu)
The current state of art
• Partially solved– Conventional local file systems– Off-line operation (rsync)
• Problems– Walk through the directory structure – Have to know what will change (Inotify)– Does not work on distributed file systems– Scalability problems
The environment - Lustre
• Distributed– Stripes (part of a file) on separate hosts– ~100-1000 clients (reading writing)
• Redundant– File system and file metadata
• Fault tolerance– Transaction driven operations– Rollback capability
Lustre – synchronization
• Distributed– Hosts absolute event sequencing
• Is the time accurate enough?
– Clients extreme efficiency
• Redundant – Fault tolerance– Pulling the plug during synchronizing
• Moving, tracking events
– Rollback synchronize to transactions
The basic Lustre concept
Object StorageTargets
Lustre Server Side Lustre Client SideMetadata
Server
failover
~100-1000
„inode”
Moving the information - metadata
Object StorageTargets
Lustre Server Side Lustre Client SideMetadata
Server
~100-1000
LustreMetadataAccess
Kernel space
Local EventSequencer Global Event
SequencerEvent
Reporter
EventMultiplexer
EventProcessor
How-to move the informationMetadata
Server
Local EventSequencer Global Event
SequencerEvent
Reporter
EventMultiplexer
EventProcessor
Block Device
Proc FileSystem
TCP/IPNet
work
TC
P/I
PN
etw
ork
TC
P/I
PN
etw
ork
Block Device
• Asynchrone notification
• system calls:
•Select (timeout)
•Read, write (blocking)
• Max 100.000 events/sec
• Relative Complicated access
Proc FileSystem
• Easy access from user-space
• Notifications through signals
• Possibility for multiple reporters
• Minimal network usage
• Usually not a bottleneck
• ER & EM can be deployed together or separately
TCP/IPNet
work
• Just multiplexing events
• No problems
• No authorization, registration
(fix configuration)
TCP/IPNet
work
TC
P/I
PN
etw
ork
TC
P/I
PN
etw
ork
• Big difficulties
• Sequencing = Accurate timing
• Network delay
• Delay from FS overload
• Connection to all MDS
• Can be a bottleneck
Average sequence performanceServer has enough threads
- Performance OK -
Server needs more threads- Performance DROPS -
Why?~ 5000 event/thread
„Graceful degradation”
Linear drop inperformance
Constant QoS
How-to commit the changes
MDS OST
SFS 2SFS 1
CommitterClient
EventProcessor
CommitterClient
EventProcessor
MDS OST
SFS 3
EventMultiplexer
MDS OST
EventReporter
EventMultiplexer
EventReporter
CommitterClient
EventProcessor
A B
A4
B3
A4
B3
How-to execute „3” if„4” already happened?
Unfortunately noreal good solution
Event sequence error resolution
1. Ostrich politic• Drop all evens with conflicting sequence
2. Conflict detection• Is the event applicable?• In design stage …
3. Replaying the already committed events• Currently lack of Lustre support