Date post: | 16-Jan-2015 |
Category: |
Technology |
Upload: | inktank |
View: | 1,582 times |
Download: | 3 times |
Building AuroraObjects
Who am I?
● Wido den Hollander (1986)● Co-owner and CTO of a PCextreme B.V., a
dutch hosting company● Ceph trainer and consultant at 42on B.V.● Part of the Ceph community since late 2009
– Wrote the Apache CloudStack integration
– libvirt RBD storage pool support
– PHP and Java bindings for librados
PCextreme?
● Founded in 2004● Medium-sized ISP in the Netherlands● 45.000 customers● Started as a shared hosting company● Datacenter in Amsterdam
What is AuroraObjects?
● Under the name “Aurora” my hosting company PCextreme B.V. has two services:– AuroraCompute, a CloudStack based public cloud
backed by Ceph's RBD
– AuroraObjects, a public object store using Ceph's RADOS Gateway
● AuroraObjects is a public RADOS Gateway service (S3 only) running in production
The RADOS Gateway (RGW)
● Service objects using either Amazon's S3 or OpenStack's Swift protocol
● All objects are stored in RADOS, the gateway is just a abstraction between HTTP/S3 and RADOS
The RADOS Gateway
Our ideas
● We wanted to cache frequently accessed objects using Varnish– Only possible with anonymous clients
● SSL should be supported● Storage between Compute and Objects
services shared● 3x replication
Varnish
● A caching reverse HTTP proxy– Very fast
● Up to 100k requests/s
– Configurable using the Varnish Configuration Language (VCL)
– Used by Facebook and eBay
● Not a part of Ceph, but can be used with the RADOS Gateway
The Gateways
● SuperMicro 1U– AMD Opteron 6200 series CPU
– 128GB RAM
● 20Gbit LACP trunk● 4 nodes● Varnish runs locally with RGW on each node
– Uses the RAM to cache objects
The Ceph cluster
● SuperMicro 2U chassis– AMD Opteron 4334 CPU
– 32GB Ram
– Intel S3500 80GB SSD for OS
– Intel S3700 200GB SSD for Journaling
– 6x Seagate 3TB 7200RPM drive for OSD
● 2Gbit LACP trunk● 18 nodes● ~320TB of raw storage
Our problems
● When we cache Objects in Varnish, they don't show up in the usage accounting of the RGW– The HTTP request never reaches RGW
● When a Object changes we have to purge all caches to maintain cache consistency– User might change a ACL or modify a object with a
PUT request
● We wanted to make cached requests cheaper then non-cached requests
Our solution: Logstash
● All requests go from Varnish into Logstash and into ElasticSearch– From ElasticSearch we do the usage accounting
● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object
● We also store bucket storage usage in ElasticSearch so we have an average over the month
Our solution: Logstash
● All requests go from Varnish into Logstash and into ElasticSearch– From ElasticSearch we do the usage accounting
● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object
● We also store bucket storage usage in ElasticSearch so we have an average over the month
LogStash and ElasticSearch
● varnishncsa → logstash → redis → elasticsearchinput {
pipe {
command => "/usr/local/bin/varnishncsa.logstash"
type => "http"
}
}
● And we simply execute varnishncsavarnishncsa -F '%{VCL_Log:client}x %{VCL_Log:proto}x %{VCL_Log:authorization}x %{Bucket}o %m %{Host}i %U %b %s %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x'
%{Bucket}o?
● With %{<header>}o you can display the output of the return header <header>:– %{Server}o: Apache 2
– %{Content-Type}o: text/html
● We patched RGW (is in master) that it can optionally return the bucket name in the response:200 OK
Connection: close
Date: Tue, 25 Feb 2014 14:42:31 GMT
Server: AuroraObjects
Content-Length: 1412
Content-Type: application/xml
Bucket: "ceph"
X-Cache-Hit: No
● 'rgw expose bucket = true' in ceph.conf returns Bucket
Usage accounting
● We only query RGW for storage usage and also store that in ElasticSearch
● ElasticSearch is used for all traffic accounting– Allows us to differentiate between cached and
non-cached traffic
Back to Ceph: CRUSHMap
● A good CRUSHMap design should reflect the physical topology of your Ceph cluster– All machines have a single power supply
– The datacenter has a A and B powercircuit● We use a STS (Static Transfer Switch) to create a third
powercircuit
● With CRUSH we store each replica on a different powercircuit– When a circuit fails, we loose 2/3 of the Ceph cluster
– Each powercircuit has it's own switching / network
The CRUSHMaptype 7 powerfeed
host ceph03 {
alg straw
hash 0
item osd.12 weight 1.000
item osd.13 weight 1.000
..
}
powerfeed powerfeed-a {
alg straw
hash 0
item ceph03 weight 6.000
item ceph04 weight 6.000
..
}
root ams02 {
alg straw
hash 0
item powerfeed-a
item powerfeed-b
item powerfeed-c
}
rule powerfeed {
ruleset 4
type replicated
min_size 1
max_size 3
step take ams02
step chooseleaf firstn 0 type powerfeed
step emit
}
The CRUSHMap
Testing the CRUSHMap
● With crushtool you can test your CRUSHMap● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/crushmap
● $ crushtool -i /tmp/crushmap --test --rule 4 --num-rep 3 –show-statistics
● This shows you the result of the CRUSHMap:rule 4 (powerfeed), x = 0..1023, numrep = 3..3
CRUSH rule 4 x 0 [36,68,18]
CRUSH rule 4 x 1 [21,52,67]
..
CRUSH rule 4 x 1023 [30,41,68]
rule 4 (powerfeed) num_rep 3 result size == 3: 1024/1024
● Manually verify those locations are correct
A summary
● We cache anonymously accessed objects with Varnish– Allows us to process thousands of requests per
second
– Saves us I/O on the OSDs
● We use LogStash and ElasticSearch to store all requests and do usage accounting
● With CRUSH we store each replica on a different power circuit
Resources
● LogStash: http://www.logstash.net/● ElasticSearch: http://www.elasticsearch.net/● Varnish: http://www.varnish-cache.org/● CRUSH: http://ceph.com/docs/master/
● E-Mail: [email protected]● Twitter: @widodh