Date post: | 07-Aug-2015 |
Category: |
Data & Analytics |
Upload: | sqrrl |
View: | 57 times |
Download: | 0 times |
Securely explore your data
WHAT'S NEXT FOR BIGTABLE?
Adam Fuchs, CTO Sqrrl Data, Inc. May 22, 2014
TODAY’S TALK
• History of the World: Part 3 • Bigtable/Accumulo Technology Overview • Accumulo Demonstration • Database Technology Survey
© 2014 Sqrrl Data, Inc. | All Rights Reserved 2
TIMELINE OF RELEVANT EVENTS
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Google’s BigTable Paper
2006
NSA Builds Accumulo
2008 Sqrrl Founded
2012 1st Sqrrl Release and Customers
2013
NSA Open Sources
Accumulo 2011
3
Accumulo is a: • Apache Software Foundation (ASF) Open-
Source Software Project • Clone of Google’s Bigtable • Secure, Sorted Key-Value Store • Row-level ACID (locally) Distributed NoSQL
Database
© 2014 Sqrrl Data, Inc. | All Rights Reserved 4
Sqrrl is: • A commercial software company located in
Cambridge, MA • A search and Exploration Platform built with
Apache Accumulo • An exciting startup with a long roadmap of
challenging problems to solve • Hiring!
© 2014 Sqrrl Data, Inc. | All Rights Reserved 5
6
BIGTABLE & ACCUMULO TECH OVERVIEW
1. Data Model & API 2. Underlying Architecture 3. Distinguishing Features
© 2014 Sqrrl Data, Inc. | All Rights Reserved 7
An Accumulo key is a 5-tuple, consisting of: • Row: Controls Atomicity • Column Family: Controls Locality • Column Qualifier: Controls Uniqueness • Visibility Label: Controls Access • Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183 John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo Key/Value Example
ACCUMULO DATA FORMAT
© 2014 Sqrrl Data, Inc. | All Rights Reserved 8
Instance new ZooKeeperInstance(...)
new MockInstance()
Connector
getConnector(...)
TableOperations
InstanceOperations
SecurityOperations Scanner BatchScanner
createScanner(...) createBatchScanner(...)
Range
IteratorOption
Map.Entry
Key Value
iterator()
BatchWriter
createBatchWriter(...)
Mutation
addMutation(...)
THE ACCUMULO CLIENT API
© 2014 Sqrrl Data, Inc. | All Rights Reserved 9
• Collections of KV pairs form Tables • Tables are partitioned into Tablets
• Metadata tablets hold info about other tablets, forming a 3-level hierarchy
• A Tablet is a unit of work for a Tablet Server
Data Tablet -‐∞ : thing
Data Tablet thing : ∞
Data Tablet -‐∞ : Ocelot
Data Tablet Ocelot : Yak
Data Tablet Yak : ∞
Data Tablet -‐∞ to ∞
Table: Adam’s Table Table: Encyclopedia Table: Foo
ACCUMULO TABLETS
Well-‐Known Loca9on
(zookeeper)
Root Tablet -‐∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞
Metadata Tablet 1 -‐∞ to “Encyclopedia:Ocelot”
© 2014 Sqrrl Data, Inc. | All Rights Reserved 10
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
Applica9on
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Delegate Authority
Applica9on
Applica9on
ACCUMULO PROCESSES
© 2014 Sqrrl Data, Inc. | All Rights Reserved 11
In-‐Memory Map
Write Ahead Log
(For Recovery)
Sorted, Indexed File
Sorted, Indexed File
Sorted, Indexed File
Tablet Reads
Iterator Tree
Minor Compac<on
Merging / Major Compac<on
Iterator Tree
Writes Iterator Tree
Scan
TABLET DATA FLOW
© 2014 Sqrrl Data, Inc. | All Rights Reserved 12
Iterator Operations: • File Reads • Block Caching • Merging • Deletion • Isolation • Locality Groups • Range Selection • Column Selection • Cell-level Security • Versioning • Filtering • Aggregation • Partitioned Joins
ITERATOR FRAMEWORK
© 2014 Sqrrl Data, Inc. | All Rights Reserved 13
WORD COUNT: SUMMING AGGREGATING ITERATOR
Input Corpus
© 2014 Sqrrl Data, Inc. | All Rights Reserved 14
Ingesters Queriers Tablet Servers
ACCUMULO LATENCIES
Input Batch Writer
In-Memory
Map
Scan Iterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
Scan Iterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
Compaction
Iterators
Scan Iterators
Output
~ms ~ms ~ms
ms
- min
© 2014 Sqrrl Data, Inc. | All Rights Reserved 15
ACCUMULO THROUGHPUT
Ingesters Queriers Tablet Servers
Input Batch Writer
In-Memory
Map
Scan Iterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
Scan Iterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
Compaction
Iterators
Scan Iterators
Output
~ms ~ms ~ms
ms
- min
Scan: ~1M entries/s per
node
Ingest: ~200K entries/s
per node
Read-Modify-Write Latency: ~ms ê
>1K entries/s challenging with R-M-W
© 2014 Sqrrl Data, Inc. | All Rights Reserved 16
Securely explore your data
DEMO
R-M-R VS. COMPACTION-TIME AGGREGATION
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)
© 2014 Sqrrl Data, Inc. | All Rights Reserved 18
SURVEY OF DATABASE TECHNOLOGY
• Exercises in Center-Seeking • SQL vs. NoSQL • Ingest-time vs. Query-time Analytics • ACID vs. BASE • Normalized vs. Denormalized Data Models
• Primary Use Cases for Sqrrl+Accumulo
© 2014 Sqrrl Data, Inc. | All Rights Reserved 19
SQL VS. NOSQL
NoSQL • Optimized for get/put
operations • Specialized for client
languages • High concurrency • More client-side
control
Hybrid • Extend and evolve
SQL • Standardize and
incorporate NoSQL paradigms
SQL • Optimized for joins • Strong mathematical
roots in set theory • Automatic query
optimization
© 2014 Sqrrl Data, Inc. | All Rights Reserved 20
INGEST-TIME VS. QUERY-TIME ANALYTICS
Ingest-Time • Optimized for online
statistics • Can reduce storage
footprint • Can be indexed for
low latency • Leverages a variety
of indexes • Requires extensive
data organization at ingest
Hybrid • Create partial
summary at ingest (Question-focused datasets, knowledge bases, etc.)
• Support ad-hoc queries over summaries
• Leverage all known indexing strategies **
Query-Time • Can compute holistic
statistics, like ranking, topN, etc.
• Ad-hoc analytics: don’t know the query ahead of time
• High latency and low concurrency at scale
• Leverages block indexes, columnar layout
• Ingest can be “stream to disk”
© 2014 Sqrrl Data, Inc. | All Rights Reserved 21
ACID VS. BASE
ACID • Atomicity: all or
nothing for a group of operations
• Consistency and Isolation: support simple reasoning for distributed, multithreaded clients
• Durability: simple reasoning for whether data might be lost
Hybrid • Must make some
relaxations for performance at scale (under failure modes)
• Many options for “Lightweight” transaction support
• Accumulo limits atomicity, consistency, and isolation to row-level operations
BASE • Basically Available:
ensure that core operations always complete in an advertised time
• Soft-State: relaxation of referential integrity, etc.
• Eventual Consistency: relaxation of
© 2014 Sqrrl Data, Inc. | All Rights Reserved 22
NORMALIZED VS. DENORMALIZED DATA MODELS
Normalized • “Normal Form
Relational Database” • Minimizes data
footprint • Minimizes cost of
data maintenance • Can lead to
expensive joins at query time
Hybrid • Start with document
store • Introduce links/edges
for quick joins • Dynamically adapt to
flexible or sparse schemas
• Similar to property graphs
Denormalized • “Document Store” • Flexible schema lets
applications adapt quickly to changing environments
• Pre-joined to eliminate joins at query-time
• Optimized for “append-only” data
• Can inflate data sizes and slow data ingest
© 2014 Sqrrl Data, Inc. | All Rights Reserved 23
KNOWLEDGE-BASE USE CASE
2014-04-14 06:36:09 429 73.105.179.202 [email protected] 500 POST application/json
2014-04-14 06:36:09 429 73.105.179.202 [email protected] 500 POST application/json HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201
HR
Netflow
Proxy Logs
HTTPS “wikipedia.org:443/grouchinesses/?215=felled&297=wading&768=shimmies...” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31” 208.80.152.201
Social Media
© 2014 Sqrrl Data, Inc. | All Rights Reserved 24
STREAM PROCESSING USE CASE
© 2014 Sqrrl Data, Inc. | All Rights Reserved
Dashboards
Actions
Interactive Analysis Tools (Discovery + Forensics)
1. SPE queries Sqrrl to enrich streaming data 2. SPE persists results in Sqrrl for future query 3. SPE takes action automatically 4. SPE issues data-driven alerts
5. Sqrrl provides context for dashboards 6. Analysis tools query use Sqrrl to search and
manipulate historical data
DATA
SPE
25
SQRRL OPERATIONALIZES ACCUMULO WITH...
© 2014 Sqrrl Data, Inc. | All Rights Reserved 26
Data-Centric Security
Petabyte Scale and Operational Speeds
Document and Graph Data Models
SqrrlQL, including Aggregates, Secure Full-Text Search, and Secure Graph Search
Analytics, including Real-Time Statistics and Hadoop Integrations
MODERNIZING VISUALIZATION
© 2014 Sqrrl Data, Inc. | All Rights Reserved 27
Sqrrl is building the next generation of operational analytics visualizations
UPCOMING EVENTS Accumulo Summit 2014 • June 12 in College Park, MD • http://accumulosummit.com • Multiple tracks of talks from the leaders of the Accumulo community
IEEE HPEC Conference 2014 • September 9-11 in Waltham, MA • http://www.ieee-hpec.org/ • Accumulo Users Group Meeting as a Special Event • Accumulo tutorial
Watch for more meetup opportunities coming soon!
© 2014 Sqrrl Data, Inc. | All Rights Reserved 28