11
Apache Accumulo OverviewBill HavankiSolutions Architect, Cloudera Government Solutions
2 ©2014 Cloudera, Inc. All rights reserved.2
•Quick History•Storage Model•Loading and Querying•Daemons•Getting Started, a.k.a., the Pitch
Agenda
3
A Quick History
3
4 ©2014 Cloudera, Inc. All rights reserved.
Google BigTable
Compressed, high-performance, scalable, distributed sorted map
4
5 ©2014 Cloudera, Inc. All rights reserved.
Google BigTable
• Began development in 2004• Built on Google File System• Non-relational• Byte-oriented and schemaless• Stores data in the petabyte range• Research paper published in 2006
5
6 ©2014 Cloudera, Inc. All rights reserved.
Child(ren) of BigTable
• Apache HBase (begun 2006, top-level 2010)• Apache Cassandra (begun 2008-ish, top-level 2010)• Apache Accumulo ...
6
7 ©2014 Cloudera, Inc. All rights reserved.
From Cloudbase to Accumulo
• Started in 2008 as National Security Agency project• Submitted to Apache Incubator in 2011 (and renamed)• Top-level project in 2012
7
8
Storage Model
8
9 ©2014 Cloudera, Inc. All rights reserved.
Key / Value Store
Accumulo stores tables of key / value pairs
9
10
©2014 Cloudera, Inc. All rights reserved.
Key / Value Store
A row is a sorted sequence of key / value pairsEach pair is a cell
10
11
©2014 Cloudera, Inc. All rights reserved.
The Key
11
row
column
timestamp
family qualifier visibility
12
©2014 Cloudera, Inc. All rights reserved.
An example key
12
bhavanki
column1401041295
personal middle PII
13
©2014 Cloudera, Inc. All rights reserved.
Another example key
13
brees
column1401041296
employment salary FIN
14
©2014 Cloudera, Inc. All rights reserved.
It’s all bytes
All key and value data are stored as bytesexcept timestamp is a long
There are no built-in data typesbut lexicoders help with common types
Key components are usually UTF-8 strings
14
15
©2014 Cloudera, Inc. All rights reserved.
Some rows for you
15
row cf cq cv ts value
bhavanki job employer 2013-09-01 Cloudera
bhavanki personal beer 2013-09-15 Omission
bhavanki personal house NOMUGGL 2014-01-25 Ravenclaw
brees job employer 2013-10-01 White Cliffs
brees personal house NOMUGGL 2014-01-01 Hufflepuff
16
©2014 Cloudera, Inc. All rights reserved.
Visibility Labels
Boolean expression
Specialist | (Management & SpecTraining)
Authorizations are provided in each scan
16
17
©2014 Cloudera, Inc. All rights reserved.
Locality Groups
You can identify sets of one or more column families as locality groups
Data in a locality group is stored together for improved read performance
17
18
©2014 Cloudera, Inc. All rights reserved.
Tablets
A table is comprised of one or more tablets
18
employeesemployees
employees;Semployees;Semployees;Hemployees;H employees;~employees;~
19
©2014 Cloudera, Inc. All rights reserved.
Tablets
Tablets maps to data files in HDFS
19
employees;Semployees;Semployees;Hemployees;H employees;~employees;~
rfile 2rfile 2rfile 1rfile 1 rfile 3rfile 3
20
©2014 Cloudera, Inc. All rights reserved.
Tablets
Data also kept in write-ahead logs and memtable
20
employees;Hemployees;H
rfile 1rfile 1
walogswalogs
memtablememtable
21
Loading and Querying
21
22
©2014 Cloudera, Inc. All rights reserved.
Java Client API
22
23
©2014 Cloudera, Inc. All rights reserved.
Java Client API
Read using scanners
Scanner s = conn.createScanner(“employees”, new Authorizations());s.setRange(“alice”, “eve”);s.setColumnFamily(“personal”);for (Entry<Key, Value> e : s) employeeIds.add(e.getKey().getRow());
23
24
©2014 Cloudera, Inc. All rights reserved.
Java Client API
Read access via iterator pattern• server-side system iterators handle timestamps,
authorization checks, and lots more• iterators almost always wrap other iterators, forming a chain• you can define your own, client-side or server-side
24
25
©2014 Cloudera, Inc. All rights reserved.
Java Client API
Scanners fetch sorted rows from one rangeBatch scanners fetch unsorted rows from multiple
ranges in parallelIsolated scanners ensure that you do not see a row
mid-change
25
26
©2014 Cloudera, Inc. All rights reserved.
MapReduce
AccumuloInputFormatAccumuloOutputFormat
26
27
©2014 Cloudera, Inc. All rights reserved.
MapReduce
AccumuloRowInputFormatAccumuloRowOutputFormat
27
28
©2014 Cloudera, Inc. All rights reserved.
Shell
Command-line / manual access to Accumulo data• scan, insert, delete• iterator management• table management (creation, deletion, cloning)• user and authorization management• table splitting and merging• ... more
28
29
©2014 Cloudera, Inc. All rights reserved.
Bulk Import
Got lots of data to import quickly?• Use MR job to format data using AccumuloFileOutputFormat• Import files using shell
Trade off latency / availablity for throughput
29
30
Daemons
30
31
©2014 Cloudera, Inc. All rights reserved.
Tablet Server
Serves tablets (table data)• writes data to walog, memtable; deals with compaction• serves data for reads from files, memtable• handles recovery from walogs in case of server failure
Most client calls go to tablet servers
31
32
©2014 Cloudera, Inc. All rights reserved.
Master
• assigns tablets to tablet servers• detects tablet server failures and reassigns tablets• balances tablet assignments over time• coordinates table operations
Multiple supported for failover, only one active
32
33
©2014 Cloudera, Inc. All rights reserved.
Everybody Else in Accumulo
Garbage Collector (GC) - identifies and deletes files in HDFS that are no longer neededTracer - listens for and stores distributed trace messages using a special table
33
34
©2014 Cloudera, Inc. All rights reserved.
Everybody Else in Accumulo
• Monitor - collects and serves status information• server status• log inspection• performance data• table inspection
34
35
©2014 Cloudera, Inc. All rights reserved.
Everybody Else outside Accumulo
• HDFS (as part of Apache Hadoop)• stores tablet files• stores write-ahead logs (1.5+)
• MapReduce (Hadoop)• bulk import• batch processing
• Apache ZooKeeper
35
36
Getting Starteda.k.a. the Pitch
36
37
©2014 Cloudera, Inc. All rights reserved.
Easy as 1-2-3?
1.Install Hadoop (HDFS and MapReduce)2.Install ZooKeeper3.Install Accumulo!
37
38
©2014 Cloudera, Inc. All rights reserved.
Making Steps 1 and 2 Easier
Use a complete, pre-packaged Hadoop distribution... like CDH!
a leading commercial distribution centered on Apache Hadoop
•many ecosystem components•configured / updated to work together
38
39
©2014 Cloudera, Inc. All rights reserved.
Making Steps 1 and 2 Easier
Cloudera Manager•deployment•configuration•operation•security
39
40
©2014 Cloudera, Inc. All rights reserved.
Making Step 3 Easier
Standard Apache Accumulo installation is via tarball• no longer shipping RPM / DEB / ...
Using CDH/CM you can use:• a tarball, RPM or DEB with Accumulo packaged for CDH • a parcel (like RPM / ZIP) for easier upgrades
• 1.4.4 and 1.4.5 available now• 1.6.0 soon
40
41
©2014 Cloudera, Inc. All rights reserved.
Where to Go for More
• http://accumulo.apache.org/• http://www.cloudera.com/content/cloudera/en/products-and-s
ervices/cdh.html• http://www.cloudera.com/content/cloudera/en/products-and-s
ervices/cloudera-enterprise/cloudera-manager.html• http://www.cloudera.com/content/cloudera/en/products-and-
services/cdh/accumulo.html
41
42
©2014 Cloudera, Inc. All rights reserved.
Accumulo Summit
Join us on June 12
42
43
©2014 Cloudera, Inc. All rights reserved.
Quick Thanks
• My slide reviewers• Sean Busbey• Mike Drob
• Accumulo community• You all for listening
43