Ceph Intro & Architectural OverviewFederico LucifrediProduct Management Director, Ceph StorageVancouver & Guadalajara, May 18th, 2015
2
CLOUD SERVICES
COMPUTE NETWORK STORAGE
the future of storage™
3
HUMANHUMAN COMPUTERCOMPUTER TAPETAPE
HUMANHUMAN ROCKROCK
HUMANHUMAN
INKINK
PAPERPAPER
4
HUMANHUMAN COMPUTERCOMPUTER TAPETAPE
5
YOUYOU TECHNOLOGYTECHNOLOGY YOUR DATAYOUR DATA
6
How Much Store Things All Human History?!writing
paper
computers
distributed storage
cloud computing
gaaaaaaaaahhhh!!!!!!
carving
7
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
8
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
COMPUTERCOMPUTER
9
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
GIANT SPENDY
COMPUTER
GIANT SPENDY
COMPUTER
10
DISKDISKCOMPUTERCOMPUTER
HUMANHUMAN
HUMANHUMAN
HUMANHUMANDISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
11
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
12
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
“STORAGE APPLIANCE”
Storage ApplianceMichael Moll, Wikipedia / CC BY-SA 2.0 13
SUPPORT AND MAINTENANCESUPPORT AND MAINTENANCE
PROPRIETARY SOFTWARE
PROPRIETARY SOFTWARE
14
PROPRIETARY HARDWARE
PROPRIETARY HARDWARE
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
34% of revenue(5.7 billion dollars)
1.3 billion in R&DSpent in a year
1.6+ million square feetof manufacturing space
$NYSE:EMC, FY2014 10K
15
1010100110
1010110011
1001100101
1001101011
1001100111
1001010011
THE CLOUD
SUPPORT AND MAINTENANCESUPPORT AND MAINTENANCE
PROPRIETARY SOFTWARE
PROPRIETARY SOFTWARE
16
PROPRIETARY HARDWARE
PROPRIETARY HARDWARE
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
STANDARD HARDWARESTANDARD HARDWARE
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
OPEN SOURCE SOFTWARE
OPEN SOURCE SOFTWARE
ENTERPRISE SUBSCRIPTION
ENTERPRISE SUBSCRIPTION
(optional)
17
18
OPEN SOURCEOPEN SOURCE
COMMUNITY-FOCUSEDCOMMUNITY-FOCUSED
SCALABLESCALABLE
NO SINGLE POINT OF FAILURENO SINGLE POINT OF FAILURE
SOFTWARE BASEDSOFTWARE BASED
SELF-MANAGINGSELF-MANAGING
philosophy design
19
8 years & 20,000 commits later…
20
21
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
22
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
23
DISKDISK
FSFS
DISKDISK DISKDISK
OSDOSD
DISKDISK DISKDISK
OSDOSD OSDOSD OSDOSD OSDOSD
FSFS FSFS FSFSFSFS btrfsxfsext4
MMMMMM
24
MM
MM
MM
HUMANHUMAN
25
Monitors:• Maintain cluster membership and state• Provide consensus for distributed decision-making• Small, odd number• These do not serve stored objects to clients
MM
OSDs:• 10s to 10000s in a cluster• One per disk• (or one per SSD, RAID group…)• Serve stored objects to clients• Intelligently peer to perform replication and recovery tasks
26
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
LIBRADOSLIBRADOS
MM
MM
MM
27
APPAPP
socket
LLLIBRADOS• Provides direct access to
RADOS for applications• C, C++, Python, PHP, Java,
Erlang• Direct access to storage nodes• No HTTP overhead
29
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
30
MM
MM
MM
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
socket
REST
31
RADOS Gateway:• REST-based object storage
proxy• Uses RADOS to store objects• API supports buckets,
accounts• Usage accounting for billing• Compatible with S3 and
Swift applications
32
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
33
MM
MM
MM
VMVM
LIBRADOSLIBRADOSLIBRBDLIBRBD
VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER
LIBRADOSLIBRADOS
34
MM
MM
MM
LIBRBDLIBRBD
CONTAINERCONTAINER
LIBRADOSLIBRADOSLIBRBDLIBRBD
CONTAINERCONTAINERVMVM
LIBRADOSLIBRADOS
35
MM
MM
MM
KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)
HOSTHOST
36
RADOS Block Device:• Storage of disk images in RADOS• Decouples VMs from host• Images are striped across the cluster (pool)• Snapshots• Copy-on-write clones• Support in:• Mainline Linux Kernel (2.6.39+)• Qemu/KVM• OpenStack, CloudStack
37
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
38
MM
MM
MM
CLIENTCLIENT
01100110
datametadata
39
Metadata Server• Manages metadata for a POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to clients• Only required for shared filesystem