Date post: | 01-Dec-2014 |
Category: |
Technology |
Upload: | accumulo-summit |
View: | 652 times |
Download: | 2 times |
Docker + Accumulo = <3Dynamically Scaling Accumulo with Docker
Cloud DatabaseCurrently there are 2 primary methods for delivering a Database as a Service (DBaaS):● Virtual Machine: Cloud platforms allow users to purchase virtual machine
(VM) instances for a limited time. It is possible to run a database on these virtual machines. Users can either upload their own machine image with a database installed on it, or use ready-made machine images that already include an optimized installation of a database.
● DBaaS: In this configuration, application owners do not have to install and maintain the database on their own. Instead, the database service provider takes responsibility for installing and maintaining the database, and application owners pay according to their usage.
DBaaSPros● All users hit the same API so upgrades to the backend database should
not be noticed by users● Administration is easier as administrators only have to maintain a single
version of the database
Cons● During spikes resources can be scarce as analytics can not be run during
heavy query times
Virtualized DB
Pros● Allows users to take full advantage of the databases feature without
stomping over each other’s resources● Easier for the underlying database as there needs to be no new code
created to support
Cons● Wasting of some resources in the case of a database which is not hit very
often then the resources are wasted on idling databases● Support staff must be more familiar with different versions of the DB
What is a Linux container?● Inception Linux running within Linux● Provides resource isolation (CPU, Network,
Disk I/O, and RAM) and namespacing● Looks like a VM on the inside of a container● Looks like a process on the outside of a
container
Why Docker?● Docker provides portable deployment across machines, by
providing a mechanism to bundle an application and all its
dependencies into a single format
● Docker has a built in versioning system which is similar to git
● Docker provides component reuse as any docker format can be
used as a “base image”
● Sharing docker containers comes with a public repository
(http://index.docker.io/)
Networking in DockerTo have two docker containers talk to each other simply use the docker link command:sudo docker run -d -P --name web --link db:db training/webapp python app.py
What this command is doing is creating a secure tunnel between the two containers without having them expose ports to the control system. It does this in two ways
● Using Environment Variables
● Updating /etc/hosts
This will only run across containers which are running on the same host
MultiHost NetworkingThere are numerous ways to link MultiHost Docker containers including (VPN, Bridges, and VPNs). One such method to link containers across multiple hosts we will apply the docker ambassador pattern. An ambassador is a container in between two containers which can take care of the talking so that containers can move hosts but an application will never have to restart itself on change just the ambassador container It looks like:(consumer) --> (redis-ambassador) ---network---> (redis-ambassador) --> (redis)
Accumulo on Docker● Inspired by Slider (Formerly: Hoya) to allow spinning up
databases on the fly (Check out: https://github.com/apache/incubator-slider)
● Also inspired by (https://bitbucket.org/fourtwosix/bdaassrc)● Allows users to spin up their own Accumulo while sharing a
global HDFS and ZooKeeper, for their own application● Currently allows scaling by adding more tablet servers
manually
Advantages● Security between data is simpler as users literally have
their own version of the database, users do not risk data slippages by sharing the same system (nobody knows what a particular iterator is going to spit out in a log)
● Monitoring a CIO/Program Director can easily monitor which databases are getting hit the hardest and make sure that those DBs are allocated more hardware
Advantages (ctd)● Docker makes it trivial to backup snapshots of the
running database servers● Allows users to configure the database however they
choose, which means users can have their own scheme for how they want to do compactions
● Users can figure out peek times to which there data is being hit and can schedule analytic jobs to be run during off peak hours
Advantages (ctd)● Allows applications to maintain different versions of their
databases and with the work on #ACCUMULO 378 users can possibly create 2 copies of their databases one for analytics and one for real time query
● Lets administrators kill off application databases which are behaving badly without having to effect all the applications running on the system
Disadvantages● Administrators have to potentially understand different
versions of the databases as both can coexist on the system
● With the HDFS permission scheme user and group management becomes a difficult
● Port allocation becomes a bit tricky and IPTables rules may become a bit unruly
Accumulo: Built To ScaleWith Accumulo 1.5 and beyond even more smarts have been built in to make scaling easier
● Load balancing built into the master to recognize when a tablet server dies
● Iterators are now stored in HDFS so they do not have to be pushed to every machine
● Accumulo allows for multiple masters which makes failover better
● Accumulo WAL stored in HDFS also making it easier to scale as tablets can read these from one location in HDFS
Future Improvements● Tie this into SLIDER allow YARN to be the resource
allocator and docker to be the container that things are deployed too (see YARN 1964)
● Use the JMX statistics that the accumulo monitor gets to dynamically scale up and down tablet servers based on load
● Add container creation and deletion via Ambari and Cloudera Manager
Future Improvements (ctd)● Make a GUI to make deploying containers and health
monitoring of the system easier● Make a GUI to view system health and to see what
databases are deployed onto the system● Add HBase support
Questions (???)