Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph

Post on 15-Jul-2015

279 views 2 download

Tags:

transcript

Big Data Applications and Tuning in Ceph

Noah WatkinsRed Hat

Who am I?

● Noah Watkins● Red Hat engineer

○ Big Data applications on Ceph● PhD candidate at UC Santa Cruz

○ Use Ceph as a research platform● Contact

○ noahwatkins@gmail.com

2

Today’s Agenda

● Non-technical talk about technical stuff● Less visible projects that deserve attention● Ceph is a big ecosystem

○ Running Hadoop on Ceph○ Tracing and debugging features○ Custom object interfaces

3

Big Data with Hadoop and Ceph

4

Big Data with Ceph and Hadoop

● Do you Hadoop?

5

Big Data with Ceph and Hadoop

● Do you Hadoop?● Are you running a Ceph cluster?

6

Big Data with Ceph and Hadoop

● Do you Hadoop?● Are you running a Ceph cluster?● Combined, they work. End of talk.

National System

Administrator Appreciation

Day

7

Why should you care? Consolidation

8

Why should you care? Consolidation

Why should you care? Consolidation

10

FooStore!

FooApp!

Why should you care? Consolidation

11

FooStore!

FooApp!

Why should you care? Consolidation

12

FooStore!

FooApp!

Why should you care? Consolidation

13

FooStore!

FooApp!

$$$

Why should you care? Consolidation

14

FooStore!

FooApp!

Why should you care? Consolidation

15

FooStore!

FooApp! FooApp!

How does it work?

● A shim layer translates file system APIs○ CephFS <-> Hadoop Common File System

● Opens up the entire Hadoop ecosystem○ MapReduce○ Spark○ Storm○ Impala○ HBase○ The list goes on and on

16

HDFS vs CephFS, 1TB Terasort

17http://www.mellanox.com/related-docs/whitepapers/wp_hadoop_on_cephfs.pdf

1 Year Old Results!

What Works and What Doesn’t (yet)

● Locality-aware scheduling○ The rumors aren’t true :)

● Variable replication and erasure coding○ Select from existing pools

● Snapshots● How’s the stability

○ Terasort, HBase, DFSIO○ Bug fixes and performance tuning (HDFS isn’t strict!)○ Gets better with each new MDS update 18

Test driving Hadoop on Ceph

● Github○ https://github.com/ceph/cephfs-hadoop

● Tutorial○ http://ceph.com/docs/master/cephfs/hadoop/○ Streamlined installation and new docs coming soon!

● Mailing list○ Best resource right now○ http://tracker.ceph.com

19

Tracing and Debugging with LTTng

20

Things always go according to plan

21

FooApp!

Things always go according to plan

22

FooApp!

Complex Systems Can’t Be Grokked

23

FooApp!

Complex Systems Can’t Be Grokked

24

FooApp!

What is tracing & why should I care?● Tracing allows us to see exactly what

happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }

[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }

[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }

[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }

[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }

[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 25

RB

D

What is tracing & why should I care?● Tracing allows us to see exactly what

happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }

[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }

[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }

[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }

[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }

[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 26

RB

D

What is tracing & why should I care?● Tracing allows us to see exactly what

happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }

[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }

[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }

[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }

[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }

[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 27

RB

D

What is tracing & why should I care?● Tracing allows us to see exactly what

happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }

[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }

[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }

[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }

[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }

[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 28

RB

D

What is tracing & why should I care?● Tracing allows us to see exactly what

happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }

[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }

[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }

[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }

[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }

[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 29

RB

D

With tracing anything is possible

30

Queue Depth over TimeLatency vs Sector Size

Example: Ceph Request Latency

31

Trace processing pipeline

32

● Processing step examines trace events● Typically written in Python● Looking for pairs is a common pattern

○ Time spent in queue○ Time spent in I/O○ Client processing time

● Requires knowledge of internal workings

Zipkin, Blkin, and LTTng

33

● Dapper is a Google system○ Traces causal

relationships● Zipkin implemented by Twitter

○ Look at the pretty GUI○ Ignores data sources

● Huge number of raw LTTng tracepoints in Ceph○ LTTng → Zipkin (Blkin)

■ Marios Kogias○ Andrew Shewmaker○ Adam Crume

Getting started with tracing!

● Lots of tracepoints exist!○ Adding new points is easy :)

● RBD-Replay○ Collect and replay RBD traces○ http://ceph.com/docs/master/rbd/rbd-replay/

● Adding points and discussionhttp://noahdesu.github.io/2014/06/01/tracing-ceph-with-

lttng-ust.html 34

TRACEPOINT_EVENT(librados, rados_write_enter,

TP_ARGS(

rados_ioctx_t, ioctx, const char*, oid,

const void*, buf, size_t, len, uint64_t, off),

TP_FIELDS(

ctf_integer_hex(rados_ioctx_t, ioctx, ioctx)

ctf_string(oid, oid)

)

)

[Scripting] Storage and Compute with RADOS

35

A different version of a better talk

● The objects in RADOS can have arbitrary code associated with them○ Think: “remotely compress object “foo”, please.”

● "Distributed Storage and Compute with Ceph’s librados”○ Great talk by Sage Weil○ Check out YouTube

● Scripting compute with librados36

How does an OSD handle a request?

37

Client OSDlibradosread-object(foo)

read-object

transaction● Client reads an object from an OSD● The OSD executes a “read” operation

○ The read operation knows how to access data managed by the OSD

● All operations are executed in a transactional context

● Exact function can be swapped out

What are RADOS object classes?

38

Client OSDlibradosget-md5(foo)

transaction● Client writes some C++ code● Compiles it into an OSD plugin● After installing this code can be invoked● Avoids data transfer

○ Can cache results● Can simplify application design

MD5-Hash

out-of-band install

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

39

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

40

Input provided by client, and output returned to client.

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

41

Input provided by client, and output returned to client.

Stat the object to query its size.

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

42

Input provided by client, and output returned to client.

Stat the object to query its size.

Read the entire object into a buffer.

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

43

Input provided by client, and output returned to client.

Stat the object to query its size.

Read the entire object into a buffer.

Pass this data buffer to the MD5 algorithm

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

44

Input provided by client, and output returned to client.

Stat the object to query its size.

Read the entire object into a buffer.

Pass this data buffer to the MD5 algorithm

Return the MD5 digest to the client.

Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest)); return 0;}

45

Input provided by client, and output returned to client.

Stat the object to query its size.

Read the entire object into a buffer.

Pass this data buffer to the MD5 algorithm

Return the MD5 digest to the client.

All in a transactional context.

Dynamic object classes with Lua

46

Client OSDlibradoscall-lua(script, foo)

transaction● Lua is great as an embedded language● LuaJIT is a high-performance

implementation● Allow clients to construct and modify

object classes without compiling or restarting OSDs

LuaJIT VM

dynamically generated interface

Example: Lua Thumbnail Generator function thumb(input, output)

-- apply thumbnail spec to original image

local spec_string = input:str()

local blob = get_orig_img()

local img = assert(magick.load_image_from_blob(blob:str()))

img = magick.thumb(img, spec_string)

-- append thumbnail to object

local obj_size = cls.stat()

local img_bl = bufferlist.new()

img_bl:append(img)

cls.write(obj_size, #img_bl, img_bl)

-- save location in leveldb

local loc_spec = #img_bl .. "@" .. obj_size

local loc_spec_bl = bufferlist.new()

loc_spec_bl:append(loc_spec)

cls.map_set_val(spec_string, loc_spec_bl)

end

47

Original Ver.1 Ver.2 Ver.3

Thumbnail Index

● Read object and apply ImageMagick transformation

● Append (cache) the new version of the image to the object

● Save the location of the version indexed by its specification

● Write a smart read function to consult the cache

● Application can dynamically alter the transformation applied

App-specific Object Interface

Example: Lua Thumbnail Generator function thumb(input, output)

-- apply thumbnail spec to original image

local spec_string = input:str()

local blob = get_orig_img()

local img = assert(magick.load_image_from_blob(blob:str()))

img = magick.thumb(img, spec_string)

-- append thumbnail to object

local obj_size = cls.stat()

local img_bl = bufferlist.new()

img_bl:append(img)

cls.write(obj_size, #img_bl, img_bl)

-- save location in leveldb

local loc_spec = #img_bl .. "@" .. obj_size

local loc_spec_bl = bufferlist.new()

loc_spec_bl:append(loc_spec)

cls.map_set_val(spec_string, loc_spec_bl)

end

48

Original Ver.1 Ver.2 Ver.3

Thumbnail Index

● Read object and apply ImageMagick transformation

● Append (cache) the new version of the image to the object

● Save the location of the version indexed by its specification

● Write a smart read function to consult the cache

● Application can dynamically alter the transformation applied

App-specific Object Interface

Example: Lua Thumbnail Generator function thumb(input, output)

-- apply thumbnail spec to original image

local spec_string = input:str()

local blob = get_orig_img()

local img = assert(magick.load_image_from_blob(blob:str()))

img = magick.thumb(img, spec_string)

-- append thumbnail to object

local obj_size = cls.stat()

local img_bl = bufferlist.new()

img_bl:append(img)

cls.write(obj_size, #img_bl, img_bl)

-- save location in leveldb

local loc_spec = #img_bl .. "@" .. obj_size

local loc_spec_bl = bufferlist.new()

loc_spec_bl:append(loc_spec)

cls.map_set_val(spec_string, loc_spec_bl)

end

49

Original Ver.1 Ver.2 Ver.3

Thumbnail Index

● Read object and apply ImageMagick transformation

● Append (cache) the new version of the image to the object

● Save the location of the version indexed by its specification

● Write a smart read function to consult the cache

● Application can dynamically alter the transformation applied

App-specific Object Interface

Example: Lua Thumbnail Generator function thumb(input, output)

-- apply thumbnail spec to original image

local spec_string = input:str()

local blob = get_orig_img()

local img = assert(magick.load_image_from_blob(blob:str()))

img = magick.thumb(img, spec_string)

-- append thumbnail to object

local obj_size = cls.stat()

local img_bl = bufferlist.new()

img_bl:append(img)

cls.write(obj_size, #img_bl, img_bl)

-- save location in leveldb

local loc_spec = #img_bl .. "@" .. obj_size

local loc_spec_bl = bufferlist.new()

loc_spec_bl:append(loc_spec)

cls.map_set_val(spec_string, loc_spec_bl)

end

50

Original Ver.1 Ver.2 Ver.3

Thumbnail Index

● Read object and apply ImageMagick transformation

● Append (cache) the new version of the image to the object

● Save the location of the version indexed by its specification

● Write a smart read function to consult the cache

● Application can dynamically alter the transformation applied

App-specific Object Interface

Example: Lua Thumbnail Generator function thumb(input, output)

-- apply thumbnail spec to original image

local spec_string = input:str()

local blob = get_orig_img()

local img = assert(magick.load_image_from_blob(blob:str()))

img = magick.thumb(img, spec_string)

-- append thumbnail to object

local obj_size = cls.stat()

local img_bl = bufferlist.new()

img_bl:append(img)

cls.write(obj_size, #img_bl, img_bl)

-- save location in leveldb

local loc_spec = #img_bl .. "@" .. obj_size

local loc_spec_bl = bufferlist.new()

loc_spec_bl:append(loc_spec)

cls.map_set_val(spec_string, loc_spec_bl)

end

51

Original Ver.1 Ver.2 Ver.3

Thumbnail Index

● Read object and apply ImageMagick transformation

● Append (cache) the new version of the image to the object

● Save the location of the version indexed by its specification

● Write a smart read function to consult the cache

● Application can dynamically alter the transformation applied

App-specific Object Interface

Getting started with scripted RADOS

● Buyer beware!○ Experimental code○ Works and fairly stable

● Code available on github○ http://github.com/ceph/ceph○ branch: cls-lua

● In-depth explanation and examples○ http://ceph.com/rados/dynamic-object-interfaces-with-lua/

52

That’s it!

● Lot’s of interesting development● Ceph is a great platform for experimentation● Q&A

53