Date post: | 15-Jan-2015 |
Category: |
Technology |
Upload: | acunu |
View: | 3,544 times |
Download: | 1 times |
NOSQL and Big Data Analytics
Tim MoretonFounder and CTO
In the beginning, NOSQL was about storage
Google Personalized Search, 2006
profiles
Serve customised search results using user profiles
(read only, low latency)
Collect user queries, clickstream(write only, high throughput)
user_id
searches clicks
BigTable
MapReduce via GFS
Out-of band batch analysis to produce user profiles
Discovery Analytics
UnstructuredWarehouses
Data Mining
?Machine Learning
Operational Intelligence
Dashboards Real-time Decisions
Alerting
!
Complex, long-runningTotal lack of structure
Low latency, fresh data Some structure to exploit
When NOSQL, when Hadoop?
Normalization and its limits
For each update:A few random writes
For each query:Many random reads
Denormalization
For each query:One sequential read
For each update:Many writes, sequential IO
Building block: Distributed counters
+1
+1
+1+1
Total tweets
@timmoreton
2013-08-12
By date
By user
752
+1
+1
CASSANDRA
HBASE
RIAK
UPDATE table SET col = col + 1 WHERE id = 2;
curl -i http://host:8098/buckets/x/counters/count2 -X POST -d "1"
table.incrementColumnValue(row, cf, col, 1);
Twitter’s Rainbird
Source: Twitter
Facebook’s Puma, ODS, Claspin
Source: Facebook
"I believe firmly that ... you should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements."
C J Date 2005
Denormalization and agility
‘Lambda Architecture’
http://www.josemalvarez.es/web/wp-content/uploads/2013/03/toy-lambda-arch.png
Acunu Analytics
count by day count by hour of day
uniques by hashtagraw events
2 New events update cubes
1 Define aggregate cubesCREATE CUBE APPROX TOP(hashtag) WHERE browser, time GROUP BY time
3 Rich instant queries over cubesSELECT TOP(x) FROM t WHERE ..GROUP BY d1, d2, ... JOIN ... HAVING.. ORDER BY ..
+
4 Drilldown to raw events5 Backfill new cubes using historic data
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interfaceAPI
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Cassandra stores raw events and aggregates
Acunu Analytics manages cubes and maps inserts and SQL-like queries to Cassandra reads and writes
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
PROCESSING AT INGEST
JSON, CSV, log ingest
via RESTful HTTP API, Flume, Storm, AMQP
Storm, MQ HTTP
Acunu Dashboards provides rich, real-time, embeddable visualizations
SELECT AVG(r) FROM metrics GROUP BY host;
AQL Alerting
!Cubes
MILLISECOND QUERIES
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interfaceAPI for rich queries,threshold alerting
Acunu Analytics
Conclusions
NoSQL is a great fit for collecting or serving datasetswith some structure at high scale, performance, availability
Real-time Big Data apps can’t use unplanned rich queries
Use atomic counters to pre-materialize quantitative results in real-time -- but think carefully about flexibility
Do analytics out-of-band if timeliness is unimportant
A lambda architecture combines real-time with richer processing, but adds complexity
Acunu Analytics offers real-time OLAP-style queries