+ All Categories
Home > Technology > Storage Developer Conference - 09/19/2012

Storage Developer Conference - 09/19/2012

Date post: 05-Dec-2014
Category:
Upload: inktank
View: 811 times
Download: 0 times
Share this document with a friend
Description:
Sage Weil's slides from SDC in Sep 2012.
71
2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and beyond Sage Weil Inktank
Transcript
Page 1: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Ceph: scaling storage for the cloud and beyond

Sage Weil

Inktank

Page 2: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

outline● why you should care● what is it, what it does● distributed object storage● ceph fs● who we are, why we do this

Page 3: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

why should you care about anotherstorage system?

Page 4: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

requirements● diverse storage needs

– object storage

– block devices (for VMs) with snapshots, cloning

– shared file system with POSIX, coherent caches

– structured data... files, block devices, or objects?

● scale– terabytes, petabytes, exabytes

– heterogeneous hardware

– reliability and fault tolerance

Page 5: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

time● ease of administration● no manual data migration, load balancing● painless scaling

– expansion and contraction

– seamless migration

Page 6: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

cost● linear function of size or performance● incremental expansion

– no fork-lift upgrades

● no vendor lock-in– choice of hardware

– choice of software

● open

Page 7: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

what is ceph?

Page 8: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

unified storage system● objects

– native

– RESTful

● block– thin provisioning, snapshots, cloning

● file– strong consistency, snapshots

Page 9: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 10: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

open source● LGPLv2

– copyleft

– ok to link to proprietary code

● no copyright assignment– no dual licensing

– no “enterprise-only” feature set

● active community● commercial support

Page 11: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

distributed storage system● data center scale

– 10s to 10,000s of machines

– terabytes to exabytes

● fault tolerant– no single point of failure

– commodity hardware

● self-managing, self-healing

Page 12: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

ceph object model● pools

– 1s to 100s

– independent namespaces or object collections

– replication level, placement policy

● objects– bazillions

– blob of data (bytes to gigabytes)

– attributes (e.g., “version=12”; bytes to kilobytes)

– key/value bundle (bytes to gigabytes)

Page 13: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

why start with objects?● more useful than (disk) blocks

– names in a single flat namespace

– variable size

– simple API with rich semantics

● more scalable than files– no hard-to-distribute hierarchy

– update semantics do not span objects

– workload is trivially parallel

Page 14: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

Page 15: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

Page 16: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN (actually more like this…)

(COMPUTER)(COMPUTER)

Page 17: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

Page 18: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

Page 19: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Monitors:

• Maintain cluster membership and state

• Provide consensus for distributed decision-making via Paxos

• Small, odd number

• These do not serve stored objects to clients

M

Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform

replication tasks

Page 20: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

M

M

M

HUMAN

Page 21: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

data distribution● all objects are replicated N times● objects are automatically placed, balanced,

migrated in a dynamic cluster● must consider physical infrastructure

– ceph-osds on hosts in racks in rows in data centers

● three approaches– pick a spot; remember where you put it– pick a spot; write down where you put it

– calculate where to put it, where to find it

Page 22: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CRUSH• Pseudo-random placement

algorithm

• Fast calculation, no lookup

• Repeatable, deterministic

• Ensures even distribution

• Stable mapping

• Limited data migration

• Rule-based configuration

• specifiable replication

• infrastructure topology aware

• allows weighting

Page 23: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

Page 24: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

Page 25: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS● monitors publish osd map that describes cluster state

– ceph-osd node status (up/down, weight, IP)

– CRUSH function specifying desired data distribution

● object storage daemons (OSDs)– safely replicate and store object

– migrate data as the cluster changes over time

– coordinate based on shared view of reality – gossip!

● decentralized, distributed approach allows– massive scales (10,000s of servers or more)

– the illusion of a single copy with consistent behavior

M

Page 26: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CLIENTCLIENT

??

Page 27: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 28: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 29: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CLIENT

??

Page 30: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 31: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LIBRADOSLIBRADOS

MM

MM

MM

APPAPP

native

Page 32: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LLLIBRADOS

• Provides direct access to RADOS for applications

• C, C++, Python, PHP, Java• No HTTP overhead

Page 33: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

atomic transactions● client operations send to the OSD cluster

– operate on a single object

– can contain a sequence of operations, e.g.● truncate object● write new object data● set attribute

● atomicity– all operations commit or do not commit atomically

● conditional– 'guard' operations can control whether operation is performed

● verify xattr has specific value● assert object is a specific version

– allows atomic compare-and-swap etc.

Page 34: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

key/value storage● store key/value pairs in an object

– independent from object attrs or byte data payload

● based on google's leveldb– efficient random and range insert/query/removal

– based on BigTable SSTable design

● exposed via key/value API– insert, update, remove

– individual keys or ranges of keys

● avoid read/modify/write cycle for updating complex objects– e.g., file system directory objects

Page 35: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

watch/notify● establish stateful 'watch' on an object

– client interest persistently registered with object

– client keeps session to OSD open

● send 'notify' messages to all watchers– notify message (and payload) is distributed to all watchers

– variable timeout

– notification on completion● all watchers got and acknowledged the notify

● use any object as a communication/synchronization channel– locking, distributed coordination (ala ZooKeeper), etc.

Page 36: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CLIENT #1

CLIENT #2

CLIENT #3

OSD

watch

ack/commit

ack/commit

watch

ack/commitwatch

notify

notify

notify

notify

ackack

ack

complete

Page 37: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

watch/notify example● radosgw cache consistency

– radosgw instances watch a single object (.rgw/notify)

– locally cache bucket metadata

– on bucket metadata changes (removal, ACL changes)

● write change to relevant bucket object● send notify with bucket name to other radosgw

instances

– on receipt of notify● invalidate relevant portion of cache

Page 38: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

rados classes● dynamically loaded .so

– /var/lib/rados-classes/*

– implement new object “methods” using existing methods

– part of I/O pipeline

– simple internal API

● reads– can call existing native or class methods

– do whatever processing is appropriate

– return data

● writes– can call existing native or class methods

– do whatever processing is appropriate

– generates a resulting transaction to be applied atomically

Page 39: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

class examples● grep

– read an object, filter out individual records, and return those

● sha1– read object, generate fingerprint, return that

● images– rotate, resize, crop image stored in object

– remove red-eye

● crypto– encrypt/decrypt object data with provided key

Page 40: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 41: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 42: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

COMPUTERCOMPUTER

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

DISKDISK

Page 43: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

VMVM

VMVM

VMVM

Page 44: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers

• Live migration!• Images are striped across the cluster• Snapshots!• Support in

• Qemu/KVM

• OpenStack, CloudStack

• Mainline Linux kernel

• Image cloning

• Copy-on-write “snapshot” of existing image

Page 45: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

MM

MM

MM

VMVM

LIBRADOSLIBRADOS

LIBRBDLIBRBD

VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER

Page 46: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LIBRADOSLIBRADOS

MM

MM

MM

LIBRBDLIBRBD

CONTAINERCONTAINER

LIBRADOSLIBRADOS

LIBRBDLIBRBD

CONTAINERCONTAINERVMVM

Page 47: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LIBRADOSLIBRADOS

MM

MM

MM

KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)

HOSTHOST

Page 48: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 49: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

Page 50: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

MM

MM

MM

Page 51: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Metadata Server• Manages metadata for a

POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to

clients• Only required for shared

filesystem

Page 52: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

legacy metadata storage● a scaling disaster

– name → inode → block list → data

– no inode table locality

– fragmentation● inode table● directory

● many seeks● difficult to partition

usr

etc

var

home

vmlinuz

passwdmtabhosts

lib…

includebin

Page 53: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

ceph fs metadata storage● block lists unnecessary● inode table mostly useless

– APIs are path-based, not inode-based

– no random table access, sloppy caching

● embed inodes inside directories– good locality, prefetching

– leverage key/value object

102

100

1

usr

etc

var

home

vmlinuz

passwdmtabhosts

libincludebin

Page 54: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

one tree

three metadata servers

??

Page 55: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 56: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 57: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 58: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 59: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DYNAMIC SUBTREE PARTITIONING

Page 60: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

● scalable– arbitrarily partition

metadata

● adaptive– move work from busy

to idle servers

– replicate hot metadata

● efficient– hierarchical partition

preserve locality

● dynamic– daemons can

join/leave

– take over for failed nodes

dynamic subtree partitioning

Page 61: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

controlling metadata io

journal

directories

● view ceph-mds as cache– reduce reads

● dir+inode prefetching

– reduce writes● consolidate multiple writes

● large journal or log– stripe over objects

– two tiers● journal for short term● per-directory for long term

– fast failure recovery

Page 62: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

what is journaled● lots of state

– journaling is expensive up-front, cheap to recover

– non-journaled state is cheap, but complex (and somewhat expensive) to recover

● yes– client sessions

– actual fs metadata modifications

● no– cache provenance

– open files

● lazy flush– client modifications may not be durable until fsync() or visible by

another client

Page 63: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

client protocol● highly stateful

– consistent, fine-grained caching

● seamless hand-off between ceph-mds daemons– when client traverses hierarchy

– when metadata is migrated between servers

● direct access to OSDs for file I/O

Page 64: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

an example● mount -t ceph 1.2.3.4:/ /mnt

– 3 ceph-mon RT

– 2 ceph-mds RT (1 ceph-mds to -osd RT)

● cd /mnt/foo/bar

– 2 ceph-mds RT (2 ceph-mds to -osd RT)

● ls -al

– open

– readdir

● 1 ceph-mds RT (1 ceph-mds to -osd RT)

– stat each file

– close

● cp * /tmp

– N ceph-osd RT

ceph-mon

ceph-mds

ceph-osd

Page 65: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

recursive accounting● ceph-mds tracks recursive directory stats

– file sizes

– file and directory counts

– modification time

● virtual xattrs present full stats● efficient

$ ls ­alSh | headtotal 0drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomcephdrwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 lukodrwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eestdrwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzycephdrwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph

Page 66: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

snapshots● volume or subvolume snapshots unusable at petabyte scale

– snapshot arbitrary subdirectories

● simple interface– hidden '.snap' directory

– no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

Page 67: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

multiple client implementations● Linux kernel client

– mount -t ceph 1.2.3.4:/ /mnt

– export (NFS), Samba (CIFS)

● ceph-fuse● libcephfs.so

– your app

– Samba (CIFS)

– Ganesha (NFS)

– Hadoop (map/reduce)

kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Page 68: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Page 69: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

why we do this● limited options for scalable open source

storage● proprietary solutions

– expensive

– don't scale (well or out)

– marry hardware and software

● industry ready for change

Page 70: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

who we are● Ceph created at UC Santa Cruz (2004-2007)● developed by DreamHost (2008-2011)● supported by Inktank (2012)

– Los Angeles, Sunnyvale, San Francisco, remote

● growing user and developer community– Linux distros, users, cloud stacks, SIs, OEMs

Page 71: Storage Developer Conference - 09/19/2012

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

thanks

sage weil

[email protected]

@liewegas http://github.com/ceph

http://ceph.com/


Recommended