Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Click to edit Present’s Name

Efficient Bootstrapping for Decentralised Shared-nothing Key-value StoresHan Li, Srikumar Venugopal

School of Computer Science and Engineering

Agenda

• Motivations for Node Bootstrapping• Research Gap• Challenges and Solutions• Evaluations• Conclusion


On-demand Provisioning

The Capacity versus Utilisation Curve


Key-value Stores

• The standard component for cloud data management

• Increasing workload Node bootstrapping– Incorporate a new, empty node as a member of KVS

• Decreasing workload Node decommissioning– Eliminate an existing member with redundant data off the KVS


Goals for Efficient Node Bootstrapping

• Minimise the overhead of data movement– How to partition/store data?

• Balance the load at node bootstrapping– Both data volume and workload– How to place/allocate data?

• Maintain data consistency and availability– How to execute data movement?


Background: Storage model

• Shared Storage– Access same storage

• Distributed file systems• Networked attached storage

– E.g. GFS, HDFS– Simply exchange metadata

• Albatross, by S. Das, UCSB

• Shared Nothing– Use individual local storage– Decentralised, peer-to-peer– E.g. Dynamo, Cassandra,

Voldemort, etc.– Require data movement

• Lightweight solutions?


Background: Split-Move Approach

Partition at node bootstrapping


Background: Virtual-Node Approach

Partition at system startup

Data skew: e.g., the majority of data is stored in a minority of partitions.Moving around giant partitions is not a good idea.


Research Gap

• Shared Storage vs. Shared Nothing– Require data movement

• Centralised vs. Decentralised– Require coordination

• Split-Move vs. Virtual-node Based– Partition at node bootstrapping is heavyweight – Partition at system startup causes data skew

• The Gap: A scheme of data partitioning and placement that improves the efficiency of bootstrapping in shared-nothing KVS


Our Solution

• Virtual-node based movement– Each partition of data is stored in separated files – Reduced overhead of data movement– Many existing nodes can participate in bootstrapping

• Automatic sharding– Split and merge partitions at runtime– Each partition stores a bounded volume of data

• Easy to reallocate data• Easy to balance the load


The timing for data partitioning• Shard partitions at writes (insert and delete)

– Split: Size(Pi) ≤ Θmax

– Merge: Size(Pi) + Size(Pi+1) ≥ Θmin

Θmax ≥ 2Θmin

Avoid oscillation!


Challenge 1: Sharding coordination

• Issues– Totally decentralised– Each partition has multiple replicas– Each replica is split or merged locally

• Question– How to guarantee that all the replicas of certain partition are

simultaneously sharded?


Challenge 1: Sharding coordination

• Solution: Election-based coordination


Challenge 2: Node failover during sharding


Challenge 3: Data consistency during sharding• Use two sets of replicas at sharding

– Original partition and future partition– Data from different partitions is stored separate files

• Approach 1– Write to future partition, roll back at failure– Read from both partitions

• Approach 2– Write to both partitions, abandon future partition at failure – Read from original partition


Challenge 3: Data consistency during movement• Use a pair of tokens for each partition

– A Boolean token to approve and disapprove read/write


Replica Placement at Node Bootstrap

• Partition re-allocation and sharding are mutually exclusive;• Maintain data availability

– Each partition has at least R replicas

• Balance the load (e.g., number of requests)– Heavily loaded nodes have higher priority to “move out” data

• Balance the data– Balance the number of partitions across nodes

• Each partition, via sharding, is of similar size

• Two-phase bootstrap– Phase 1: guarantee R replicas, shift load from heavily loaded nodes– Phase 2: achieve load and data balancing in low-priority threads


Evaluation Setup

• ElasCass: An implemention of auto-sharding, building on Apache Cassandra (version 1.0.5), which uses Split-Move approach.

• Key-value stores: ElasCass vs. Cassandra (v1.0.5)• Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram• Benchmark: YCSB• System scale: Start from 1 node, with 100GB of data, R=2. Scale up

to 10 nodes.


Evaluation – Bootstrap Time

• In Split-Move, data volume transferred reduces by half from 3 nodes onwards.

• In ElasCass, data volume transferred remains below 10GB from 2 nodes.

• Bootstrap time is determined by data volume transferred. ElasCass exhibits a consistent performance at all scales.


Evaluation – Data Volume

• ElasCass uses two-phase bootstrap. More data is pulled in at phase 2.• Imbalance Index = standard deviation / average. Data is well balanced in ElasCass.• ElasCass occupies less storage space than Split-Move approach.


Evaluation – Query Processing

• ElasCass is scalable, while Split-Move is not.

• Write throughput is higher than read throughput.

• ElasCass has better resources utilisation.

• ElasCass achieves balanced load.


Key Takeaways

• Using virtual nodes introduces less overhead in data movement, and reduces the bootstrap time to below 10 mins.– Apache Cassandra v.1.1 uses virtual nodes

• Consolidating the partitions into bounded ranges simplifies replica placement and facilitates load-balancing– MySQL, MongoDB start to auto-shard partitions

• A balanced load leads to 80% resource utilisation and increasing throughput scalable to #nodes.


Contributions and Acknowledgments

• We have designed and implemented a decentralised auto-sharding scheme that– consolidates each partition replica into single transferable units to

provide efficient data movement;– automatically shards the partitions into bounded ranges to address data

skew;– reduces the time to bootstrap nodes, achieves more balancing load and

better performance of query processing.

• The authors would like to thank Smart Services CRC Pty Ltd for the grant of Services Aggregation project that made this work possible.


Thank You!

Date post:	20-May-2015
Category:	Technology
Upload:	han-li
View:	676 times
Download:	1 times

Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Technology