Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | ernest-gallagher |
View: | 219 times |
Download: | 3 times |
1
Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan RaicuIllinois Institute of Technology, Chicago, U.S.A
ZHTA Fast, Reliable and Scalable Zero-hop Distributed Hash Table
A supercomputer is a device for turning compute-bound problems into I/O-bound problems.
Ken Batcher
3
Big problem: file systems scalability
Parallel file system (GPFS, PVFS, Lustre) Separated computing resource from
storage Centralized metadata management
Distributed file system(GFS, HDFS) Specific-purposed design (MapReduce
etc.) Centralized metadata management
4
The bottleneck of file systems
MetadataConcurrent file creates
1
10
100
1,000
10,000
100,000
1 4 16 64 256 1024 4096 16384
Tim
e pe
r O
pera
tion
(ms)
Scale (# of Cores)
File Create (GPFS Many Dir)File Create (GPFS One Dir)
5
Proposed work
A distributed hash table (DHT) for HEC
As building block for high performance distributed systems
Performance Latency Throughput
Scalability Reliability
6
Related work: Distributed Hash Tables
Many DHTs: Chord, Kademlia, Pastry, Cassandra, C-MPI, Memcached, Dynamo ...
Why another?Name Impl. Routin
g TimePersiste
nce
Dynamic members
hip
Append Operati
onCassandra Java Log(N) Yes Yes NoC-MPI C Log(N) No No No
Dynamo Java 0 to Log(N) Yes Yes No
Memcached C 0 No No No
ZHT C++ 0 to 2 Yes Yes Yes
7
Zero-hop hash mapping
Node1 Node
2...
NodenNode
n-1
Client 1 … n
hash
Key j
Value jReplica
1
hash
Key k
Value jReplica
2
Value jReplica
3
Value kReplica
1 Value kReplica
2
Value kReplica
3
8
2-layer hashing
9
Architecture and terms
Name space: 264
Physical node Manager ZHT Instance Partition: n
(fixed) n = max(k)
Instance
Manager
Update
Response to request
Partition
Instance
Partition
Responseto request
Broadcast
Physical node
Membership table
UUID(ZHT)KeyIPPortCapacityworkload
10
How many partition per node can we do?
1 10 100 10000.6
0.620.640.660.68
0.70.720.740.760.78
Average latency
Number of partitions per instance
Late
ncy (
ms)
11
Membership management
Static: Memcached, ZHT Dynamic
Logarithmic routing: most of DHTs Constant routing: ZHT
12
Membership management
Update membership Incremental broadcasting
Remap k-v pairs Traditional DHTs: rehash all influenced
pairs ZHT: Moving whole partition
▪ HEC has fast local network!
13
Consistency
Updating membership tables Planed nodes join and leave: strong
consistency Nodes fail: eventual consistency
Updating replicas Configurable Strong consistency: consistent, reliable Eventual consistency: fast, availability
14
Persistence: NoVoHT
NoVoHT persistent in-memory hash map Append operation Live-migration
1 million 10 million 100 million0
2
4
6
8
10
12
14
16
18
20 NoVoHT
NoVoHT (No persistence)
KyotoCabinet
BerkeleyDB
unordered_map
Scale( number of key/value pairs)
La
ten
cy
(m
icro
se
co
nd
s)
15
Failure handling
Insert and append Send it to next replica Mark this record as primary copy
Lookup Get from next available replica
Remove Mark record on all replicas
16
Evaluation: test beds
IBM Blue Gene/P supercomputer Up to 8192 nodes 32768 instance deployed
Commodity Cluster Up to 64 node
Amazon EC2 M1.medium and Cc2.8xlarge 96 VMs, 768 ZHT instances
deployed
17
Latency on BG/P
0
0.5
1
1.5
2
2.5TCP without Connection Caching
TCP connection cachig
UDP
Memcached
Number of Nodes
La
ten
cy
(m
s)
18
Latency distribution
SCALES 75% 90% 95% 99%64 713 853 961 1259
256 755 933 1097 18481024 820 1053 1289 3105
19
Throughput on BG/P
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921,000
10,000
100,000
1,000,000
10,000,000
TCP: no connection caching
ZHT: TCP connection caching
UDP non-blocking
Memcached
Scale (# of Nodes)
Th
rou
gh
pu
t (o
ps
/s)
20
Aggregated throughput on BG/P
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81920
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
1 instances/node
2 instances/node
4 instances/node
8 instances/node
Number of Nodes
Th
rou
gh
pu
t (o
ps
/s)
21
Latency on commodity cluster
1 2 4 8 16 32 640
0.5
1
1.5
2
2.5
3
ZHT
Cassandra
Memcached
Scale (# of nodes)
La
ten
cy
(m
s)
22
ZHT on cloud: latency
1 2 4 8 16 32 64 960
2000
4000
6000
8000
10000
12000
14000
ZHT on m1.medium instance (1/node)
ZHT on cc2.8xlarge instance (8/node)
DynamoDB
Node number
Avera
ge late
ncy in
mic
ro
secon
ds
ZHT on cloud: latency distribution
SCALES 75% 90% 95% 99% AVG THROUGHP
UT8 11942 13794 20491 35358 12169 83.39
32 10081 11324 12448 34173 9515 3363.11128 10735 12128 16091 37009 11104 11527512 9942 13664 30960 38077 28488 ERROR
SCALES 75% 90% 95% 99% AVG THROUGHP
UT8 186 199 214 260 172 46421
32 509 603 681 1114 426 75080
128 588 717 844 2071 542 236065
512 574 708 865 3568 608 841040
ZHT on cc2.8xlarge instance
8 s-c pair/instance
DynamoDB: 8 clients/instance
DynamoDB readDynamoDB write
ZHT 4 ~ 64 nodes
0.99
0.9
24
ZHT on cloud: throughput
1 2 4 8 16 32 64 9610
100
1000
10000
100000
1000000
10000000
0
5
10
15
20
25
ZHT cost, m1ZHT cost, cc2DynamoDB cost (10k ops/s provision)
Node number
Ag
reg
gate
d t
hro
ug
hp
ut
op
s/se
con
d
Hou
rly c
ost
in
US
dollar
25
Amortized cost
2 4 8 16 32 64 960.01
0.1
1
10
ZHT on m1.medium instance (1/node)
Hou
rly c
ost
for
1K
op
s/s
th
rou
gh
pu
t in
US
dollar
26
Applications
FusionFS A distributed file system Metadata: ZHT
IStore A information dispersal storage system Metadata: ZHT
MATRIX A distributed many-Task computing execution
framework ZHT is used to submit tasks and monitor the
task execution status
27
FusionFS result: Concurrent File Creates
1 2 4 8 16 32 64 128 256 5121
10
100
1000
FusionfsGPFS
Number of Nodes
Tim
e P
er
Op
era
tio
n (
ms)
28
Istore results
0
100
200
300
400
500
600
8 16 32
Thro
ughp
ut (c
hunk
s/se
c)
Scale (# of Nodes)
1GB100MB10MB1MB100KB10KB
29
MATRIX results
0
1000
2000
3000
4000
5000
6000
1 10 100 1000 10000
Th
rou
gh
pu
t (t
ask
s/s
ec)
Number of Processors
Falkon (Linux Cluster - C)Falkon (SiCortex)Falkon (BG/P)Falkon (Linux Cluster - Java)MATRIX (BG/P)
30
Future work
Larger scale Active failure detection and
informing Spanning tree communication Network topology-aware routing Fully synchronized replicas and
membership: Paxos protocol More protocols support (UDT, MPI…) Many optimizations
31
Conclusion
ZHT : A distributed Key-Value store light-weighted high performance Scalable Dynamic Fault tolerant Versatile: works from clusters, to clouds,
to supercomputers