What’s next for Ceph? On the future of scalable storage
Martin Gerhard Loschwitz
© 2014 hastexo Professional Services GmbH. All rights reserved.
Who?
Quick reminder:
Object Storage
Users
Objects
HDD
FS
HDD
FS
HDD
FS
HDD
FS
HDD
FS
HDD
FS
HDD
FS
Cephalopod (Wikipedia, user Nhobgood)
RADOS
Redundant Autonomic Distributed Object Store
2 Components
OSDs
Users
Objects
HDD
FS
HDD
FS
HDD
FS
HDD
FS
HDD
FS
HDD
FS
HDD
FS
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
Users
Objects
Unified Storage
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
Users
Objects
Users
Objects
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
Users
Objects
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
Users
Objects
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
MONs
Users
Objects
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
MO
N
MO
N
MO
N
Data Placement
MONs
MONs
MONs
MONs
MONs
MONs
MONs
Parallelization
2 2 1 1
MONs
2 2 1 1
MONs
2 2 1 1 1 2 2 1
MONs
MONs
CRUSH
Controlled Replication Under Scalable Hashing
By configuring CRUSH, you make the cluster
rack-aware.
MO
N
MO
N
MO
N
Users
Objects
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
OS
D
RADOS Block Device Block-level interface
driver for RADOS
RADOS Gateway ReSTful API to
access RADOS
CephFS POSIX file system
access to RADOS
“Booooring!”
Cool Stuff ahead:
Erasure Coding Tiering
Multi-DC Setups Automation
CephFS Enterprise Support
Erasure Coding
2 2 1 1 1 2 2 1
MONs
Until now, Ceph has really worked like
a standard RAID 1.
Every binary object exists two times.
2 2 1 1 2 1 2 1
MONs
Works great. But it also reduces the net
capacity by 50%.
At least.
That is where Erasure Coding comes in.
It makes Ceph work
like a RAID 5.
Mostly developed by Loic Dachary
Idea: Split binary objects into even smaller chunks
MONs
MONs
MONs
MONs
This reduces the amount of space required for replicas enormously!
Different replication factors available
But: The lower the level is, the longer it takes to re-calculate
missing chunks.
Available in Ceph 0.80.
Tiering
Not all data stored in Ceph is equal.
Often needed, fresh data is usually expected
to be served quickly.
Also, customers may be willing to accept
slower performance in exchange for lower prices.
Until now, that wasn’t easy to implement in RADOS due to a
number of limitations.
With Ceph 0.80, pools will allow to
store data on different, hardware, based on
its performance
Wait. Pools?
Pools are a logical unit in RADOS. A pool is a bunch of
Placement Groups.
By using tiering, Pools can be tied to specific hardware components.
All replication happens intra-pool
Data may be moved from one pool to
another pool in RADOS
Available in Ceph 0.80.
Multi-DC Setups
Ceph was designed for high-performance,
synchronous replication
Off-Site replication is typically asynchronous.
Bummer!
But starting with Ceph 0.67, the RADOS Gateway
supports “Federation”
MONs
1
MONs
DC 2
DC 1
RADOS
Gateway
RADOS
Gateway
Sync-Agent
Sync-Agent
MONs
1
MONs
DC 2
DC 1
RADOS
Gateway
RADOS
Gateway
Sync-Agent
Sync-Agent
1
In fact, the federation feature adds asynchronous
replication on top of the RADOS storage cluster
Still needs better integration with the
other Ceph components
Automation
Ceph clusters will almost always be
deployed using tools for automation
Thus, it needs to play together well with Chef, Puppet & Co.
Chef: Yay!
Chef cookbooks are maintained and
provided by Inktank.
Puppet: Ouch
Inktank does not provide Puppet modules
for Ceph deployment
Right now, at least 6 concurring modules exist on GitHub, some
forks of each other
None of these use ceph-deploy, though.
But there is hope: puppet-cephdeploy does use ceph-deploy
Needs some additional work, but generally,
looks very promising and already works
Plays together nicely even with ENCs such
as the Puppet Dashboard or the Foreman project
CephFS
Considered Vapoware by some people already.
But that’s not fair!
CephFS is already available and works.
Well, mostly.
For CephFS, the really critical component is the Metadata Server (MDS)
Running CephFS today with exactly one active
MDS is fine and will most likely not cause trouble.
But Sage wants the MDS to scale-out properly so
that running several active MDSes at a time works
That’s called Subtree Partitioning. Every active MDS will be responsible for the meta-data of a certain subtree of the POSIX-compatible FS
Right now, Subtree partitioning is what’s
causing trouble.
CephFS is not Inktank’s main priority; likely to
be released as “stable” in Q4 2014
Enterprise Support
Major companies willing to run Ceph need some
type of support contract.
Inktank has started to offer that support through a
product called “Inktank Ceph Enterprise” (ICE)
Gives users Long-Term support for certain Ceph releases (such as 0.80)
and hot-fixes for problems
Also brings Calamari, Inktank’s Ceph GUI
Distribution Support
Inktank does a lot to make installing Ceph
on different distributions as smooth as possible already.
Ye olde OSes:
Ubuntu 12.04 Debian Wheezy
RHEL 6 SLES 11
Ubuntu 14.04: May 2014
RHEL 7: December 2014
Release Schedule
Firefly (0.80): May 2014, along
with ICE 1.2
Giant: Summer 2014
(Non-LTS version)
The “H”-release: December 2014,
along with ICE 2.0
Ceph Days
Ceph Days are information events run by Inktank all
over the world.
2 have happened in Europe so far:
London (October 2013)
Frankfurt (Februar 2014)
Ceph Days allow to gather with others willing to use
Ceph, exchange experiences.
And you can meet
Sage Weil
No shit. You can meet
Sage Weil!
Special thanks to Sage Weil (Twitter: @liewegas)
& Crew for Ceph Inktank (Twitter: @inktank)
for the Ceph-Logo