Presented by:
Leveraging open source tools to gain insight into OpenStack SwiftMay 20, 2015
Michael Factor, IBM Fellow, Storage and Systems, IBM Research - Haifa
Dmitry Sotnikov, System and Storage Researcher, IBM Research - Haifa
Deep dive insights into Swift
The work was done with help of:
Yaron Weinsberg George Goldberg
For more information contact: [email protected]
Swift Monitoring
• Monitoring Swift With StatsD• https://swiftstack.com/blog/2012/04/11/swift-monitoring-with-statsd/
• Unified Instrumentation and Metering of Swift• https://wiki.openstack.org/wiki/UnifiedInstrumentationMetering
• Administrator’s Guide• http://docs.openstack.org/developer/swift/admin_guide.html#cluster-telem
etry-and-monitoring
“Once you have all this great data, what do you do with it? Well, that’s going to require its own post.“
“Monitoring Swift With StatsD” by SwiftStack, Inc2
Swift Monitoring Flow
• StatsD allows deep instrumentation of the Swift code and can report over 100 metrics.
• Collectd gathers statistics about the system• Graphite is an enterprise-scale monitoring tool that
stores and displays numeric time-series data• Logstash is a data pipeline that can normalize the
data to a common format.• Elasticsearch is a search server that allows
indexing large amounts of data.• Kibana is a browser based analytics and search
interface for Elasticsearch. • Spark is a fast and general engine for large-scale
data processing. • RequestStopper catches the request and returns
success, enabling isolating overheads in a non-production system.
Pro
xy S
erve
r
Co
nta
ine
r S
erve
r
Ob
ject
Se
rver
Acco
un
t S
erve
r
StatsD
CPU Statistics
RAM Statistics
Disk Statistics
Monitoring, Analytics and Visualization Node
Swift Node
RequestStopper
3
Benchmark Tool
• COSBench, Intel’s Cloud Object Storage Bench-marking tool• https://github.com/intel-cloud/cosbench
4
Where Our Journey Starts
• Swift 1.13• 1 container• Half a million small objects • 100 COSBench workers• What should be the cluster size to run more then 1000 PUTs a
second? (with reasonable response time)
5
• 3 proxy node (Proxy servers only)• 7 storage nodes (Object, Container and Account servers)
• 20 HDD• 2 SSD• 256 GB RAM
• 3 clients machines connected to Proxy• All network connections are 10 Gbps
Our Hardware - Story #1
520 operations per second
6
Swift Data Path Flow
• The Put object request arrives to one of the Proxies.• The Proxy sends the request to R (e.g., 3) storage
nodes, that will hold the R (e.g., 3) replicas of that object.
• Next, the container database is updated asynchronously to reflect the new object in it. (https://swiftstack.com/openstack-swift/architecture/)
• It is not fully asynchronous, but on timeout of 0.5 sec – first it tries to make a synchronous update.
• When at least two of the three writes to the object servers return successfully, the proxy server process will notify the client that the upload was successful.
Proxy Server
Object Server
Container Server
Client Put Request Response
7
Swift Data Path Flow
Proxy Server
Object Server
Container Server
Client Put Request Response
Null Container Server
Null Object Server
Null Proxy Server
While nulling out a server is not useful for a production system, it is useful to diagnose performance bottlenecks 8
RequestStopper
https://gist.github.com/gilv/7e70ba055f24bcc472b6 9
Swift Data Path Flow – Put Request Response Time
Proxy Server
Object Server
Container Server
Client Put Request Response
RequestStopper at Container Server
RequestStopper at Object Server
RequestStopper at Proxy Server
192.47 ms
47.3 ms
32.89 ms
1.86 ms
10
Object PUT Operations Average Response Time per Swift Component 100 Workers, 500K Objects
SWIFT 1.13 : 1 container SWIFT 1.13 : 100 containers0
50
100
150
200
250
Network RTT to Proxy Proxy ServerObject Server Container Server
Resp
onse
Tim
e (m
s)
X 4.7 faster
11
SWIFT 1.13 : 1 container
SWIFT 2.2 : 1 container
SWIFT 1.13 : 100 containers
SWIFT 2.2 : 100 containers
0
50
100
150
200
250
Network RTT to Proxy Proxy Server Object Server Container Server
Resp
onse
Tim
e (m
s)
Object PUT Operations Average Response Time Comparison of Swift 1.13 vs Swift 2.2100 Workers, 500K Objects
X 3 faster
X 1.5 faster
12
Mixed Workload: 1 Container, 100 Workers, 500K Objects
13
SWIFT 2.2 SWIFT 1.130
200
400
600
800
1000
1200
1400
1600
1800
Mixed Workload - 1 container
Read Write Delete
Ope
ratio
ns p
er S
econ
d
At Mixed Workload SWIFT 2.2 achieves 70% performance improvement
SWIFT Scalability – Swift 2.2100 Containers, 100 Workers, 500K Objects
14
2 3 4 5 6 70
50010001500200025003000
Measured Operation Ratio
Number of Storage Servers
Put O
pera
tions
per
Sec
ond
2 3 4 5 6 70
100200300400500600
Response Time Distribution
60%-RT 80%-RT 90%-RT 95%-RT 99%-RT
Number of Storage Servers
Tim
e (m
s)
This is a maximal performance that can be achieved by 100 COSBench worker, for Swift 2.2, so adding a new node does not improves the performance.
#Workers Operations per second Average Response Time (ms)
100 workers2854.31 38.98
200 workers3455.17 62.68
400 workers4323.52 101.96
Influence of number of COSBench Workers on Performance – Swift 2.27 Storage Nodes, 500K Objects, 100 Container
15
Story #1 Conclusions: RequestStopper
• In some cases the limiting factor is not throughput but response time• Response time of the native Swift 1.13 with 1 container is 192 ms ~5.2 op/sec per COSBench worker 520 op/sec per 100 COSBench workers• Reducing the response time to 65 ms at Swift 2.2 helps to get ~1560 IOPS on
same cluster
16
Story #1 Conclusions: Container Server
• The difference in the Container Server performance between Swift 2.2 and Swift 1.13 was due in large part to the container merge_items speedup patch (https://review.openstack.org/#/c/116992/)
• Container Sharding (https://review.openstack.org/#/c/139921/) still has a potential to improve the performance for this workload by a factor of 1.5
17
System Size Influence on Performance
18
10 100 1000 100000
10
20
30
40
50
60
70
80
90
1 container 100 containers
Number of Objects per Container
Aver
age
Resp
onse
Tim
e (m
s)
5 kops 10 kops 50 kops 100 kops 500 kops0
50
100
150
200
250
300
350
1 container 100 containers
Number of Objects at System
Aver
age
Resp
onse
Tim
e (m
s)
SWIFT performance (response time) is influenced by the number of objects per container. In our environment we identified an optimum number of objects – need to evaluate what affects the optimal number of objects per container
Where Our Journey Continues
• September 2014• Swift 1.13• 1 container• Half million small objects • What should be the cluster size to run more then 1000 PUTs in a
second? ( with reasonable response time )
19
20
Kibana – Put Request Response Time Percentiles
Resp
onse
Tim
e (s
ec)
Average Response Time – for 1 seconds intervals – SWIFT 2.2
2:23:532:27:432:31:332:35:232:39:132:43:032:46:532:50:432:54:332:58:233:02:153:06:053:09:553:13:453:17:353:21:253:25:153:29:050
50
100
150
200
250
300
350
400
450
Tim
e (m
s)
2:23:51 2:31:03 2:38:15 2:45:27 2:52:39 2:59:51 3:07:03 3:14:15 3:21:27 3:28:390
50100150200250300350400450
Tim
e (m
s)
21
https://bugs.launchpad.net/swift/+bug/1450656 ?
There is some peak each 30 sec
21
Resp
onse
Tim
e (s
ec)
Time
22
Graphite – PUT Request Response Time
30 seconds30 seconds
Resp
onse
Tim
e (s
ec)
Time
Zoom-in on Swift Response time outliers (> 0.5 sec) Request Granularity PUT workload, 500K objects, 100 Containers, 100 Workers, Swift 2.2
0:36:00 0:36:43 0:37:26 0:38:09 0:38:52 0:39:36 0:40:190
2
4
6
8
10
12
14
16
30 seconds
23Time
Resp
onse
Tim
e (s
ec)
The effect of fs.xfs.xfssyncd_centisecs on PUT response timePUT workload, 500K objects, 100 Containers, 100 Workers, Swift 2.2
8:08:398:09:548:11:098:12:248:13:398:14:548:16:098:17:248:18:398:19:548:21:098:22:248:23:398:24:548:26:098:27:248:28:398:29:540
100
200
300
400
500
600
700
800
300 seconds 60 seconds
300 seconds
60 seconds
24
Resp
onse
Tim
e (m
s)
Time
The effect of fs.xfs.xfssyncd_centisecs on PUT response timePUT workload, 500K objects, 100 Containers, 100 Workers, Swift 2.2
Seconds Avg-ResTime 60%-RT 80%-RT 90%-RT 95%-RT 99%-RT 100%-RT
10 83.26 30 50 350 520 700 1,62030 43.34 30 40 50 60 530 3,69060 38.81 30 40 50 70 270 5,900
300 31.89 30 40 50 70 220 9,530
Increasing of the fs.xfs.xfssyncd_centisecs improves the 99% percentile at the price of 100% percentile degradation
25
Story #2
26
• 2 proxy node (Proxy servers only)• 4 object nodes (Object servers)
• 15 HDD• 128 GB RAM
• 2 metadata nodes (Container and Account servers)• 2 SSD• 128 GB RAM
• 2 clients machines connected to Proxy• Internal network connections are 10 Gbps
Our Hardware - Story #2
27
28
Object PUT Workload100 Workers, 100 Containers, 500K Objects
29
Clients Transmitted Throughput
Thro
ughp
ut (M
B/se
c)
Time
30
Clients Transmitted vs. Proxy Servers Received Throughput Comparison
Proxy Servers Received Throughput
Clients Transmitted Throughput =
Total Client Transmitted NetworkTotal Proxy Received Network
Thro
ughp
ut (M
B/se
c)
Time
31
Proxy Servers Received vs. Proxy Servers Transmitted Throughput Comparison
Thro
ughp
ut (M
B/se
c)
Time Proxy Server Received ThroughputProxy Server Transmitted Throughput
X3
Proxy Servers Transmitted Throughput
Proxy Servers Received Throughput
32
Proxy Servers Received and Transmitted vs. Object Servers Received Throughput Comparison
Thro
ughp
ut (M
B/se
c)
Time Proxy Server Transmitted ThroughputProxy Server Received ThroughputObject Server Received Throughput
X3
Proxy Servers Transmitted Throughput
Proxy Servers Received Throughput
Object Servers Received Throughput
33
Network vs. Disks Throughput Comparison
Thro
ughp
ut (M
B/se
c)
Time Proxy Server Transmitted ThroughputProxy Server Received ThroughputObject Server Received ThroughputObject Servers Disks Write Throughput
Object Servers Disks Write Throughput
Proxy Servers Received Throughput
Object Servers Received Throughput
X12
34
Total Disks Capacity
Disks Capacity Utilization
35
Thro
ughp
ut (M
B/se
c)
Time
New object creation part
Rewrite workload
The expected disks capacity for all the workload
Number of async_pending requests over the time
36
#asy
nc_p
endi
ng re
ques
ts p
er s
ec
Time
Object1Object2Object3Object4
37
Variable object size workload
64 KB128 KB
32 KB15 KB
512 KB 1 MBTh
roug
hput
(MB/
sec)
Time Proxy Server Received ThroughputProxy Server Transmitted ThroughputObject Servers Disks Write Throughput
38
Disk vs. Client Perceived Bandwidth
64 KB
128 KB
32 KB
15 KB
512 KB 1 MB
The overhead is not flat 3x, but instead is a function of object size
Ratio
• 2 proxy node (Proxy servers only)• 5 object nodes (Object servers)
• 15 HDD• 128 GB RAM
• 3 metadata nodes (Container and Account servers)• 3 SSD• 128 GB RAM
• 2 clients machines connected to Proxy• Internal network connections are 10 Gbps
Our Hardware - Story #3 – without async_pendings
39
40
Network vs. Disks Throughput Comparison
Thro
ughp
ut (M
B/se
c)
Time Proxy Server Received ThroughputProxy Server Transmitted ThroughputObject Servers Disks Write Throughput
41
Disks Capacity UtilizationTh
roug
hput
(MB/
sec)
Time
42
Number of async_pending requests over the time
#asy
nc_p
endi
ng re
ques
ts p
er s
ec
Time
Back to Story #2
43
44
PUT Request Average Response Time (Lower is better)
Object1Object2Object3Object4Proxy1Proxy2
PUT
Requ
est R
espo
nse
Tim
e (m
s)
Time
Object Server “Object1” has much higher response time
“Object2/3/4” have lower response times than proxies
Proxies
Disks Read and Write Throughputs
45
Object1.Read
Object1.Write
Object2.Read
Object2.Write
Object3.Read
Object3.Write
Object4.Read
Object4.Write
Processes statistics over object servers
46
#Pro
cess
Wall Clock Time
Blocked processes – processes that are waiting to IO response
Running Blocked
The idle CPU comparison over object servers (Higher is better)
CPU0-CPU9 CPU10-CPU19 CPU20-CPU29 CPU30-CPU39 Stopped47
Perc
ent (
%)
Wall Clock Time
• Micro benchmark results (Vdbench 8k random write workload):
• “Object1” shows an average response time of ~37 ms • “Object2”, “Object3”, “Object4” show an average response time of
~30 ms
• Our investigation revealed that “Object1” server consists of older hardware components even though all servers were “supposed” to be the same
48
4 Object Servers
3 Object Servers – without Object1
~10% Throughput improvement, although 25% object servers reduction
49
Object Size Avg-Res Time Avg-Proc Time Throughput Bandwidth15 KB 38.98 ms 38.94 ms 2854.31 op/s 42.81 MB/S 1 MB 105.37 ms 103.6 ms 967.04 op/s 967.04 MB/S
10 MB 852.06 ms 578.62 ms 117.43 op/s 1.17 GB/S
Effect of Object Size on Cluster Bandwidth
50
Back of the envelope calculation:User Bandwidth * Number of Replicas < Total Proxy Backend Bandwidth
At our case we have 3 proxy servers, and 10 Gbit network:User Bandwidth*3 < 3*10 Gbit User Bandwidth < 1.25 GB/sec