Date post: | 20-May-2015 |
Category: |
Technology |
Upload: | han-li |
View: | 676 times |
Download: | 1 times |
Click to edit Present’s Name
Efficient Bootstrapping for Decentralised Shared-nothing Key-value StoresHan Li, Srikumar Venugopal
School of Computer Science and Engineering
Agenda
• Motivations for Node Bootstrapping• Research Gap• Challenges and Solutions• Evaluations• Conclusion
School of Computer Science and Engineering
On-demand Provisioning
The Capacity versus Utilisation Curve
School of Computer Science and Engineering
Key-value Stores
• The standard component for cloud data management
• Increasing workload Node bootstrapping– Incorporate a new, empty node as a member of KVS
• Decreasing workload Node decommissioning– Eliminate an existing member with redundant data off the KVS
School of Computer Science and Engineering
Goals for Efficient Node Bootstrapping
• Minimise the overhead of data movement– How to partition/store data?
• Balance the load at node bootstrapping– Both data volume and workload– How to place/allocate data?
• Maintain data consistency and availability– How to execute data movement?
School of Computer Science and Engineering
Background: Storage model
• Shared Storage– Access same storage
• Distributed file systems• Networked attached storage
– E.g. GFS, HDFS– Simply exchange metadata
• Albatross, by S. Das, UCSB
• Shared Nothing– Use individual local storage– Decentralised, peer-to-peer– E.g. Dynamo, Cassandra,
Voldemort, etc.– Require data movement
• Lightweight solutions?
School of Computer Science and Engineering
Background: Split-Move Approach
Partition at node bootstrapping
School of Computer Science and Engineering
Background: Virtual-Node Approach
Partition at system startup
Data skew: e.g., the majority of data is stored in a minority of partitions.Moving around giant partitions is not a good idea.
School of Computer Science and Engineering
Research Gap
• Shared Storage vs. Shared Nothing– Require data movement
• Centralised vs. Decentralised– Require coordination
• Split-Move vs. Virtual-node Based– Partition at node bootstrapping is heavyweight – Partition at system startup causes data skew
• The Gap: A scheme of data partitioning and placement that improves the efficiency of bootstrapping in shared-nothing KVS
School of Computer Science and Engineering
Our Solution
• Virtual-node based movement– Each partition of data is stored in separated files – Reduced overhead of data movement– Many existing nodes can participate in bootstrapping
• Automatic sharding– Split and merge partitions at runtime– Each partition stores a bounded volume of data
• Easy to reallocate data• Easy to balance the load
School of Computer Science and Engineering
The timing for data partitioning• Shard partitions at writes (insert and delete)
– Split: Size(Pi) ≤ Θmax
– Merge: Size(Pi) + Size(Pi+1) ≥ Θmin
Θmax ≥ 2Θmin
Avoid oscillation!
School of Computer Science and Engineering
Challenge 1: Sharding coordination
• Issues– Totally decentralised– Each partition has multiple replicas– Each replica is split or merged locally
• Question– How to guarantee that all the replicas of certain partition are
simultaneously sharded?
School of Computer Science and Engineering
Challenge 1: Sharding coordination
• Solution: Election-based coordination
School of Computer Science and Engineering
Challenge 2: Node failover during sharding
School of Computer Science and Engineering
Challenge 3: Data consistency during sharding• Use two sets of replicas at sharding
– Original partition and future partition– Data from different partitions is stored separate files
• Approach 1– Write to future partition, roll back at failure– Read from both partitions
• Approach 2– Write to both partitions, abandon future partition at failure – Read from original partition
School of Computer Science and Engineering
Challenge 3: Data consistency during movement• Use a pair of tokens for each partition
– A Boolean token to approve and disapprove read/write
School of Computer Science and Engineering
Replica Placement at Node Bootstrap
• Partition re-allocation and sharding are mutually exclusive;• Maintain data availability
– Each partition has at least R replicas
• Balance the load (e.g., number of requests)– Heavily loaded nodes have higher priority to “move out” data
• Balance the data– Balance the number of partitions across nodes
• Each partition, via sharding, is of similar size
• Two-phase bootstrap– Phase 1: guarantee R replicas, shift load from heavily loaded nodes– Phase 2: achieve load and data balancing in low-priority threads
School of Computer Science and Engineering
Evaluation Setup
• ElasCass: An implemention of auto-sharding, building on Apache Cassandra (version 1.0.5), which uses Split-Move approach.
• Key-value stores: ElasCass vs. Cassandra (v1.0.5)• Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram• Benchmark: YCSB• System scale: Start from 1 node, with 100GB of data, R=2. Scale up
to 10 nodes.
School of Computer Science and Engineering
Evaluation – Bootstrap Time
• In Split-Move, data volume transferred reduces by half from 3 nodes onwards.
• In ElasCass, data volume transferred remains below 10GB from 2 nodes.
• Bootstrap time is determined by data volume transferred. ElasCass exhibits a consistent performance at all scales.
School of Computer Science and Engineering
Evaluation – Data Volume
• ElasCass uses two-phase bootstrap. More data is pulled in at phase 2.• Imbalance Index = standard deviation / average. Data is well balanced in ElasCass.• ElasCass occupies less storage space than Split-Move approach.
School of Computer Science and Engineering
Evaluation – Query Processing
• ElasCass is scalable, while Split-Move is not.
• Write throughput is higher than read throughput.
• ElasCass has better resources utilisation.
• ElasCass achieves balanced load.
School of Computer Science and Engineering
Key Takeaways
• Using virtual nodes introduces less overhead in data movement, and reduces the bootstrap time to below 10 mins.– Apache Cassandra v.1.1 uses virtual nodes
• Consolidating the partitions into bounded ranges simplifies replica placement and facilitates load-balancing– MySQL, MongoDB start to auto-shard partitions
• A balanced load leads to 80% resource utilisation and increasing throughput scalable to #nodes.
School of Computer Science and Engineering
Contributions and Acknowledgments
• We have designed and implemented a decentralised auto-sharding scheme that– consolidates each partition replica into single transferable units to
provide efficient data movement;– automatically shards the partitions into bounded ranges to address data
skew;– reduces the time to bootstrap nodes, achieves more balancing load and
better performance of query processing.
• The authors would like to thank Smart Services CRC Pty Ltd for the grant of Services Aggregation project that made this work possible.
School of Computer Science and Engineering
Thank You!