@slamdata @jdegoes
John A. De Goes — CTO SlamData Inc.
Give Me My Damn Report: Making NoSQL Data Accessible to the
Business
@slamdata @jdegoes
Agenda
1. The Rise of NoSQL2. The Dark Side of NoSQL3. Options for Reporting
a. Extract-Transform-Loadb. Fat Driversc. Code to NoSQL APIsd. Native NoSQL Analytics
4. Why NoSQL Analytics is Hard5. NoSQL Databases: Not Equal6. Question & Answer
@slamdata @jdegoes
The Rise of NoSQL
@slamdata @jdegoes
The Rise of NoSQL
@slamdata @jdegoes
The Rise of NoSQL
● Massively scalable
● Operational Ease-of-Use
● Native support for rich data structures
● Native Support for heterogeneity
● Rapid Time-to-Deployment
@slamdata @jdegoes
The Rise of NoSQL
@slamdata @jdegoes
The Dark Side of NoSQL
@slamdata @jdegoes
The Dark Side of NoSQLOverview
@slamdata @jdegoes
The Dark Side of NoSQL
Give Me My Damn Report!
● Ad hoc analytics
● Exploratory analytics
● Operational analytics
● Analytics dashboards
● Batch reporting
● IoT / Event analytics
Need for Analytics
@slamdata @jdegoes
The Dark Side of NoSQLSQL Analytics
@slamdata @jdegoes
The Dark Side of NoSQL
1. ETL2. Fat Drivers
3. Code to NoSQL API4. Native NoSQL ANalytics
Choices
@slamdata @jdegoes
Options for Reporting
@slamdata @jdegoes
Extract-Transform-Load
{"user_id": "[email protected]",
"profile": {
"name": "Mary Jane",
"addresses": [{
"city": "London",
"country": "UK"
}],
"band_plays": {
"Squirrel Nut Zippers": 56,
"Red Hot Tomatoes": 19,
"Big Bad Voodoo Daddy": 102
}
}
SQL /Hadoop
Overview
@slamdata @jdegoes
Extract-Transform-Load1. Flattening
users
user_id
...
...
band_plays
user_id band_name play_count
[email protected] Squirrel Nut Zippers 56
[email protected] Red Hot Tomatoes 19
[email protected] Big Bad Voodoo Daddies 102
profiles
profile_id user_id name
1 [email protected] Mary Jane
addresses
profile_id city country
1 London UK
@slamdata @jdegoes
Extract-Transform-Load2. Homogenization
events
type user_id genre_name artist_name band_name play_count
“band_play” ... NULL NULL “Squirrel Nut Zippers” 56
“artist_play” ... NULL “Frank Sinatra” NULL 19
“genre_play” ... “New Age” NULL NULL 102
@slamdata @jdegoes
Extract-Transform-Load3. Incremental ETL
1. Last_modified Field2. Import changed data*
* Less relevant for Hadoop
@slamdata @jdegoes
Extract-Transform-LoadTools
@slamdata @jdegoes
Extract-Transform-LoadReport Card
✗ Slow
✗ Painful
✗ Brittle
✓ Tunable Performance
✓ Unlimited Flexibility in Reporting / Analytics
@slamdata @jdegoes
Fat DriversOverview
Driver
Embedded SQL Engine
Real-Time ETL(Filtered Table Scan)
@slamdata @jdegoes
Fat DriversApproaches
Magic Config
@slamdata @jdegoes
Fat DriversVendors
@slamdata @jdegoes
Fat DriversReport Card
✗ Slow
✗ Limited to Small Data
✗ Limited to Simple Analytics
✗ Limited to Simple Data
✓ Low Friction
✓ Flexibility in Analytics / Reporting
@slamdata @jdegoes
Code to NoSQL APIOverview
Code
CSV
HTML5/Javascript
@slamdata @jdegoes
Code to NoSQL APIReport Card
✗ Slow
✗ Painful
✗ Brittle
✗ Performance
✓ No ETL
@slamdata @jdegoes
Native NoSQL AnalyticsOverview
Native NoSQL Analytics
@slamdata @jdegoes
Native NoSQL AnalyticsTools
SQL (+/-)
Visual Analytics
ETL (+/-) Native
ZoomData
Cloud 9 Charts
JSON Studio
Apache Drill
Quasar
SlamData
Impala
@slamdata @jdegoes
Native NoSQL AnalyticsReport Card
✗ Immature
✗ Learning Curve
✗ Limited Choices
✓ No ETL
✓ Flexible & Fast
✓ Any data, Anywhere
✓ Tunable Performance
@slamdata @jdegoes
Why NoSQL Analytics Is Hard
@slamdata @jdegoes
The
Eight
Deadly Obstacles
to NoSQL Analytics
@slamdata @jdegoes
CHaracteristics1. Generic Data Model
@slamdata @jdegoes
CHaracteristics2 Isomorphic Data Model
Data SQL²
{
"userId": 8927524,
"profile": {
"name": "Mary Jane",
"age": 29,
"gender": "female"
},
"comments": [{
"id": "F2372BAC",
"text": "I concur.",
"replyTo": [9817361, "F8ACD164F"],
"time": "2015-02-03"
}, {
"id": "GH732AFC",
"replyTo": [9654726, "A44124F"],
"time": "2015-03-01"
}]
}
SELECT comments[*].replyTo[*] FROM data
@slamdata @jdegoes
CHaracteristics3. Multidimensionality
Data SQL²
{"user_id": 928347234,
"email": null,
"events": [
{"impression":{
"ts": 912348934,
"page": "index.html"}}]}
SELECT user_id, [events[_] WHERE events[_].ts < 9347234 ...] AS events FROM visitors
@slamdata @jdegoes
CHaracteristics4. Unified Schema/Data
Data SQL²
{"user_id": "[email protected]",
"band_plays":{
"Squirrel Nut Zippers": 56,
"Red Hot Tomatoes": 19,
"Big Bad Voodoo Daddy": 102}}
SELECT band_plays{*:} AS artistName, SUM(band_plays{*}) AS votes FROM music GROUP BY band_plays{*:}
@slamdata @jdegoes
CHaracteristics5. Polymorphic Queries
Data SQL²
{"type": "click",
"link": "http://foo.com"
"timestamp": 123987172}
{"type": "impression",
"page": "index.html"
"timestamp": 92372}
SELECT COUNT(*) AS count, timestamp FROM data GROUP BY timestamp
@slamdata @jdegoes
CHaracteristics6. Post-Relational
Data SQL²
{"name": "John Doe",
"blog_posts": [
{"post_id": "89934"},
{"post_id": "92371"}
]}
SELECT authors.name, posts.title FROM authors JOIN posts ON authors.blog_posts[*].post_id = posts._id
@slamdata @jdegoes
CHaracteristics7. Runtime Type Id & ConverSION
Data SQL²
{"email": ["[email protected]",
{"email": {
"home": "[email protected]",
"work": "[email protected]"}}
SELECT
CASE TYPEOF email
-- old: email stored in 2nd el:
WHEN 'array' THEN email[1]
-- new format:
WHEN 'map' THEN email.work
ELSE email
END AS email
FROM users
@slamdata @jdegoes
CHaracteristics8. Structural Pattern Matching
Data SQL²
{"user_id": "[email protected]",
"events": [{"type": "purchase",
"timestamp": 12392342,
"order_id": "2ffa34aa"},
{"type": "click",
"timestamp": 92327123,
"link": "http://foo.com"}]}
SELECT
CASE user_events
WHEN […, e1, e2, …] THEN
e1.timestamp - e2.timestamp
END AS delta
FROM users
@slamdata @jdegoes
NoSQL Databases: Not Equal
@slamdata @jdegoes
NoSQL Databases: Not Equal
Desired Characteristics
1. DUal Operations & Analytics2. In-Database Analytics
3. General-Purpose Analytics4. Native Report Tooling
@slamdata @jdegoes
NoSQL Databases: Not Equal
Couchbase
✓ Dual Operations / Analytics
✓ In-Database Analytics
✓ General-Purpose Analytics
✗ Native Report Tooling
Best Reporting Option: Fat DriversRunner-Up: Code to NoSQL APIs
@slamdata @jdegoes
NoSQL Databases: Not Equal
MarkLogic
✓ Dual Operations / Analytics
✓ In-Database Analytics
✓ General-Purpose Analytics
✗ Native Report Tooling
Best Reporting Option: ETLRunner-Up: Code to NoSQL APIs
@slamdata @jdegoes
NoSQL Databases: Not Equal
MongoDB
✓ Dual Operations / Analytics*
✓ In-Database Analytics
✗ General-Purpose Analytics
✓ Native Report Tooling
Best Reporting Option: Native NoSQL AnalyticsRunner-Up: Code to NoSQL APIs
* Further maturation needed
@slamdata @jdegoes
NoSQL Databases: Not Equal
ElasticSearch
✓ Dual Operations / Analytics
✓ In-Database Analytics
✗ General-Purpose Analytics
✗ Native Report Tooling
Best Reporting Option: Code to NoSQL APIsRunner-Up: ETL to Hadoop
@slamdata @jdegoes
NoSQL Databases: Not Equal
Cassandra
✗ Dual Operations / Analytics*
✗ In-Database Analytics*
✗ General-Purpose Analytics
✗ Native Report Tooling
Best Reporting Option: ETLRunner-Up: Code to NoSQL APIS*
* Real-time analytics
@slamdata @jdegoes
THE ENDQuestions?