ZHT 1 A Fast, Reliable and Scalable Zero-hop Distributed Hash Table.

1

Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Zhao Zhang, Ioan RaicuIllinois Institute of Technology, Chicago, U.S.A

ZHTA Fast, Reliable and Scalable Zero-hop Distributed Hash Table

A supercomputer is a device for turning compute-bound problems into I/O-bound problems.

Ken Batcher

3

Big problem: file systems scalability

Parallel file system (GPFS, PVFS, Lustre) Separated computing resource from

storage Centralized metadata management

Distributed file system(GFS, HDFS) Specific-purposed design (MapReduce

etc.) Centralized metadata management

4

The bottleneck of file systems

MetadataConcurrent file creates

1

10

100

1,000

10,000

100,000

1 4 16 64 256 1024 4096 16384

Tim

e pe

r O

pera

tion

(ms)

Scale (# of Cores)

File Create (GPFS Many Dir)File Create (GPFS One Dir)

5

Proposed work

A distributed hash table (DHT) for HEC

As building block for high performance distributed systems

Performance Latency Throughput

Scalability Reliability

6

Related work: Distributed Hash Tables

Many DHTs: Chord, Kademlia, Pastry, Cassandra, C-MPI, Memcached, Dynamo ...

Why another?Name Impl. Routin

g TimePersiste

nce

Dynamic members

hip

Append Operati

onCassandra Java Log(N) Yes Yes NoC-MPI C Log(N) No No No

Dynamo Java 0 to Log(N) Yes Yes No

Memcached C 0 No No No

ZHT C++ 0 to 2 Yes Yes Yes

7

Zero-hop hash mapping

Node1 Node

2...

NodenNode

n-1

Client 1 … n

hash

Key j

Value jReplica

1

hash

Key k

Value jReplica

2

Value jReplica

3

Value kReplica

1 Value kReplica

2

Value kReplica

3

8

2-layer hashing

9

Architecture and terms

Name space: 264

Physical node Manager ZHT Instance Partition: n

(fixed) n = max(k)

Instance

Manager

Update

Response to request

Partition

Instance

Partition

Responseto request

Broadcast

Physical node

Membership table

UUID(ZHT)KeyIPPortCapacityworkload

10

How many partition per node can we do?

1 10 100 10000.6

0.620.640.660.68

0.70.720.740.760.78

Average latency

Number of partitions per instance

Late

ncy (

ms)

11

Membership management

Static: Memcached, ZHT Dynamic

Logarithmic routing: most of DHTs Constant routing: ZHT

12

Membership management

Update membership Incremental broadcasting

Remap k-v pairs Traditional DHTs: rehash all influenced

pairs ZHT: Moving whole partition

▪ HEC has fast local network!

13

Consistency

Updating membership tables Planed nodes join and leave: strong

consistency Nodes fail: eventual consistency

Updating replicas Configurable Strong consistency: consistent, reliable Eventual consistency: fast, availability

14

Persistence: NoVoHT

NoVoHT persistent in-memory hash map Append operation Live-migration

1 million 10 million 100 million0

2

4

6

8

10

12

14

16

18

20 NoVoHT

NoVoHT (No persistence)

KyotoCabinet

BerkeleyDB

unordered_map

Scale( number of key/value pairs)

La

ten

cy

(m

icro

se

co

nd

s)

15

Failure handling

Insert and append Send it to next replica Mark this record as primary copy

Lookup Get from next available replica

Remove Mark record on all replicas

16

Evaluation: test beds

IBM Blue Gene/P supercomputer Up to 8192 nodes 32768 instance deployed

Commodity Cluster Up to 64 node

Amazon EC2 M1.medium and Cc2.8xlarge 96 VMs, 768 ZHT instances

deployed

17

Latency on BG/P

0

0.5

1

1.5

2

2.5TCP without Connection Caching

TCP connection cachig

UDP

Memcached

Number of Nodes

La

ten

cy

(m

s)

18

Latency distribution

SCALES 75% 90% 95% 99%64 713 853 961 1259

256 755 933 1097 18481024 820 1053 1289 3105

19

Throughput on BG/P

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81921,000

10,000

100,000

1,000,000

10,000,000

TCP: no connection caching

ZHT: TCP connection caching

UDP non-blocking

Memcached

Scale (# of Nodes)

Th

rou

gh

pu

t (o

ps

/s)

20

Aggregated throughput on BG/P

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 81920

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

1 instances/node

2 instances/node

4 instances/node

8 instances/node

Number of Nodes

Th

rou

gh

pu

t (o

ps

/s)

21

Latency on commodity cluster

1 2 4 8 16 32 640

0.5

1

1.5

2

2.5

3

ZHT

Cassandra

Memcached

Scale (# of nodes)

La

ten

cy

(m

s)

22

ZHT on cloud: latency

1 2 4 8 16 32 64 960

2000

4000

6000

8000

10000

12000

14000

ZHT on m1.medium instance (1/node)

ZHT on cc2.8xlarge instance (8/node)

DynamoDB

Node number

Avera

ge late

ncy in

mic

ro

secon

ds

ZHT on cloud: latency distribution

SCALES 75% 90% 95% 99% AVG THROUGHP

UT8 11942 13794 20491 35358 12169 83.39

32 10081 11324 12448 34173 9515 3363.11128 10735 12128 16091 37009 11104 11527512 9942 13664 30960 38077 28488 ERROR

SCALES 75% 90% 95% 99% AVG THROUGHP

UT8 186 199 214 260 172 46421

32 509 603 681 1114 426 75080

128 588 717 844 2071 542 236065

512 574 708 865 3568 608 841040

ZHT on cc2.8xlarge instance

8 s-c pair/instance

DynamoDB: 8 clients/instance

DynamoDB readDynamoDB write

ZHT 4 ~ 64 nodes

0.99

0.9

24

ZHT on cloud: throughput

1 2 4 8 16 32 64 9610

100

1000

10000

100000

1000000

10000000

0

5

10

15

20

25

ZHT cost, m1ZHT cost, cc2DynamoDB cost (10k ops/s provision)

Node number

Ag

reg

gate

d t

hro

ug

hp

ut

op

s/se

con

d

Hou

rly c

ost

in

US

dollar

25

Amortized cost

2 4 8 16 32 64 960.01

0.1

1

10

ZHT on m1.medium instance (1/node)

Hou

rly c

ost

for

1K

op

s/s

th

rou

gh

pu

t in

US

dollar

26

Applications

FusionFS A distributed file system Metadata: ZHT

IStore A information dispersal storage system Metadata: ZHT

MATRIX A distributed many-Task computing execution

framework ZHT is used to submit tasks and monitor the

task execution status

27

FusionFS result: Concurrent File Creates

1 2 4 8 16 32 64 128 256 5121

10

100

1000

FusionfsGPFS

Number of Nodes

Tim

e P

er

Op

era

tio

n (

ms)

28

Istore results

0

100

200

300

400

500

600

8 16 32

Thro

ughp

ut (c

hunk

s/se

c)

Scale (# of Nodes)

1GB100MB10MB1MB100KB10KB

29

MATRIX results

0

1000

2000

3000

4000

5000

6000

1 10 100 1000 10000

Th

rou

gh

pu

t (t

ask

s/s

ec)

Number of Processors

Falkon (Linux Cluster - C)Falkon (SiCortex)Falkon (BG/P)Falkon (Linux Cluster - Java)MATRIX (BG/P)

30

Future work

Larger scale Active failure detection and

informing Spanning tree communication Network topology-aware routing Fully synchronized replicas and

membership: Paxos protocol More protocols support (UDT, MPI…) Many optimizations

31

Conclusion

ZHT ： A distributed Key-Value store light-weighted high performance Scalable Dynamic Fault tolerant Versatile: works from clusters, to clouds,

to supercomputers

32

Questions?Tonglin Li

[email protected]://datasys.cs.iit.edu/projects/ZHT/

mailto:[email protected]

http://datasys.cs.iit.edu/projects/ZHT/

Date post:	24-Dec-2015
Category:	Documents
Upload:	ernest-gallagher
View:	219 times
Download:	3 times

ZHT 1 A Fast, Reliable and Scalable Zero-hop Distributed Hash Table.

Documents