Software Defined Storage: What Makes Ceph Unique - Red...

Post on 22-May-2020

7 views 0 download

transcript

Software Defined Storage:What Makes Ceph UniqueFederico LucifrediProduct Management Director, Ceph StorageBoston/Guadalajara, December 14th, 2015

2

CLOUD SERVICES

COMPUTE NETWORK STORAGE

the future of storage™

3

HUMANHUMAN COMPUTERCOMPUTER TAPETAPE

HUMANHUMAN ROCKROCK

HUMANHUMAN

INKINK

PAPERPAPER

4

HUMANHUMAN COMPUTERCOMPUTER TAPETAPE

5

YOUYOU TECHNOLOGYTECHNOLOGY YOUR DATAYOUR DATA

6

How Much Store Things All Human History?!writing

paper

computers

distributed storage

cloud computing

gaaaaaaaaahhhh!!!!!!

carving

7

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

8

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

COMPUTERCOMPUTER

9

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

GIANT SPENDY

COMPUTER

GIANT SPENDY

COMPUTER

10

DISKDISKCOMPUTERCOMPUTER

HUMANHUMAN

HUMANHUMAN

HUMANHUMANDISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

11

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

12

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

“STORAGE APPLIANCE”

Storage ApplianceMichael Moll, Wikipedia / CC BY-SA 2.0 13

SUPPORT AND MAINTENANCESUPPORT AND MAINTENANCE

PROPRIETARY SOFTWARE

PROPRIETARY SOFTWARE

14

PROPRIETARY HARDWARE

PROPRIETARY HARDWARE

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

34% of revenue(5.2 billion dollars)

1.1 billion in R&DSpent in a year

1.6 million square feetof manufacturing space

15

1010100110

1010110011

1001100101

1001101011

1001100111

1001010011

THE CLOUD

SUPPORT AND MAINTENANCESUPPORT AND MAINTENANCE

PROPRIETARY SOFTWARE

PROPRIETARY SOFTWARE

16

PROPRIETARY HARDWARE

PROPRIETARY HARDWARE

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

STANDARD HARDWARESTANDARD HARDWARE

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

DISKDISKCOMPUTERCOMPUTER

OPEN SOURCE SOFTWARE

OPEN SOURCE SOFTWARE

ENTERPRISE SUBSCRIPTION

ENTERPRISE SUBSCRIPTION

(optional)

17

18

OPEN SOURCEOPEN SOURCE

COMMUNITY-FOCUSEDCOMMUNITY-FOCUSED

SCALABLESCALABLE

NO SINGLE POINT OF FAILURENO SINGLE POINT OF FAILURE

SOFTWARE BASEDSOFTWARE BASED

SELF-MANAGINGSELF-MANAGING

philosophy design

19

8 years & 20,000 commits later…

20

21

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

22

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

23

DISKDISK

FSFS

DISKDISK DISKDISK

OSDOSD

DISKDISK DISKDISK

OSDOSD OSDOSD OSDOSD OSDOSD

FSFS FSFS FSFSFSFS btrfsxfsext4

MMMMMM

24

MM

MM

MM

HUMANHUMAN

25

Monitors:• Maintain cluster membership

and state• Provide consensus for

distributed decision-making• Small, odd number• These do not serve stored

objects to clients

MM

OSDs:• 10s to 10000s in a cluster• One per disk• (or one per SSD, RAID group…)

• Serve stored objects to clients• Intelligently peer to perform

replication and recovery tasks

26

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

LIBRADOSLIBRADOS

MM

MM

MM

27

APPAPP

socket

LLLIBRADOS• Provides direct access to

RADOS for applications• C, C++, Python, PHP, Java,

Erlang• Direct access to storage nodes• No HTTP overhead

29

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

30

MM

MM

MM

LIBRADOSLIBRADOS

RADOSGWRADOSGW

APPAPP

socket

REST

31

RADOS Gateway:• REST-based object storage

proxy• Uses RADOS to store objects• API supports buckets,

accounts• Usage accounting for billing• Compatible with S3 and

Swift applications

32

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

33

MM

MM

MM

VMVM

LIBRADOSLIBRADOSLIBRBDLIBRBD

VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER

LIBRADOSLIBRADOS

34

MM

MM

MM

LIBRBDLIBRBD

CONTAINERCONTAINER

LIBRADOSLIBRADOSLIBRBDLIBRBD

CONTAINERCONTAINERVMVM

LIBRADOSLIBRADOS

35

MM

MM

MM

KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)

HOSTHOST

36

RADOS Block Device:• Storage of disk images in

RADOS• Decouples VMs from host• Images are striped across the

cluster (pool)• Snapshots• Copy-on-write clones• Support in:• Mainline Linux Kernel (2.6.39+)• Qemu/KVM• OpenStack, CloudStack

37

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

38

MM

MM

MM

CLIENTCLIENT

01100110

datametadata

39

Metadata Server• Manages metadata for a POSIX-

compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)

• Stores metadata in RADOS• Does not serve file data to

clients• Only required for shared

filesystem

What Makes Ceph Unique?Part one: CRUSH

40

41

APPAPP??

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

How Long Did It Take You To Find Your Keys This Morning?azmeen, Flickr / CC BY 2.0 42

43

APPAPP

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

Dear Diary: Today I Put My Keys on the Kitchen CounterBarnaby, Flickr / CC BY 2.0 44

45

APPAPP

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

DDCC

A-G

H-N

O-T

U-Z

F*F*

I Always Put My Keys on the Hook By the Doorvitamindave, Flickr / CC BY 2.0 46

HOW DO YOUFIND YOUR KEYS

WHEN YOUR HOUSEIS

INFINITELY BIGAND

ALWAYS CHANGING?

47

The Answer: CRUSH!!!!!pasukaru76, Flickr / CC SA 2.0 48

49

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

50

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

51

CRUSH• Pseudo-random placement

algorithm• Fast calculation, no lookup• Repeatable, deterministic

• Statistically uniform distribution• Stable mapping• Limited data migration on change

• Rule-based configuration• Infrastructure topology aware• Adjustable replication• Weighting

52

CLIENTCLIENT

??

53

NAME: "foo"POOL: "bar"

0101 11111001 00111010 11010011 1011 "bar" = 3

hash("foo") % 256 = 0x23

OBJECT PLACEMENT GROUP

243

12

CRUSH TARGET OSDsPLACEMENT GROUP

3.23

3.23

54

55

56

CLIENTCLIENT

??

What Makes Ceph UniquePart two: thin provisioning

57

LIBRADOSLIBRADOS

58

MM

MM

MM

VMVM

LIBRBDLIBRBD

VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER

HOW DO YOUSPIN UP

THOUSANDS OF VMsINSTANTLY

ANDEFFICIENTLY?

59

144144

60

00 00 00 00

instant copy

= 144

44144144

61

CLIENTCLIENT

write

write

write

= 148

write

44144144

62

CLIENTCLIENTread

read

read

= 148

What Makes Ceph Unique?Part three: clustered metadata

63

POSIX Filesystem MetadataBarnaby, Flickr / CC BY 2.0 64

65

MM

MM

MM

CLIENTCLIENT

01100110

66

MM

MM

MM

67

one tree

three metadata servers

??

68

69

70

71

72

DYNAMIC SUBTREE PARTITIONING

Getting Started With Ceph

Read about the latest version of Ceph.• The latest stuff is always at http://ceph.com/get

Deploy a test cluster using ceph-deploy.• Read the quick-start guide at http://ceph.com/qsg

Deploy a test cluster on the AWS free-tier using Juju.• Read the guide at http://ceph.com/juju

Read the rest of the docs!• Find docs for the latest release at http://ceph.com/docs

73

Have a working cluster up quickly.

Getting Involved With Ceph

Most project discussion happens on the mailing list.• Join or view archives at http://ceph.com/list

IRC is a great place to get help (or help others!)• Find details and historical logs at http://ceph.com/irc

The tracker manages our bugs and feature requests.• Register and start looking around at http://ceph.com/tracker

Doc updates and suggestions are always welcome.• Learn how to contribute docs at http://ceph.com/docwriting

74

Help build the best storage system around!

Ceph Hammer (v0.94.x)

1. Rados Performance enhancements: All Flash environments2. Simplified RGW deployment3. RGW Object Versioning and Bucket Sharding4. RBD Mandatory Locking, Object Maps, Copy on Read5. CephFS Snapshot improvements

and many more. See https://ceph.com/releases/v0-94-hammer-released/

75

Best Ceph ever.

Questions?

76

Federico LucifrediPM Director, Ceph

federico@redhat.com@0xF2

redhat.com | ceph.com