Post on 21-Jan-2017
transcript
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Real-Time Analytics with Apache CassandraCassandra Day Berlin, 11.2.2016
Guido Schmutz
Guido Schmutz
Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.comBlog: http://guidoschmutz.wordpress.comSlideshare: http://de.slideshare.net/gschmutzTwitter: gschmutz
2
Our company.
© Trivadis – The Company3 2/11/16
Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNEBERN
ZURICHBRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
© Trivadis – The Company4 2/11/16
14 Trivadis branches and more than600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:CHF 5.0 million
Financially self-supporting andsustainably profitable
Experience from more than 1,900 projects per year at over 800customers
Agenda
1. Customer Use Case and Architecture2. Cassandra Data Modeling3. Cassandra for Timeseries Data4. Titan:db for Graph Data
5
Customer Use Case andArchitecture
6
Data Science Lab @ Armasuisse W&T
W+T flagship project, standing for innovation & tech transfer
Building capabilities in the areas of:• Social Media Intelligence
(SOCMINT)
• Big Data Technologies & Architectures
Invest into new, innovative and not widely-proven technology• Batch / Real-time analysis
• NoSQL databases
• Text analysis (NLP)• Graph Data
• …
3 Phases: June 2013 – June 2015
7
SOCMINT System – Time Dimension
Major data model: Time series (TS)
TS reflect user behaviors over time
Activities correlate with events
Anomaly detectionEvent detection & prediction
8
SOCMINT System – Social Dimension
User-user networks (social graphs);
Twitter: follower, retweet and mention graphs
Who is central in a social network?
Who has retweeted a given tweet to whom?
9
SOCMINT System - “Lambda Architecture” for Big Data
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
BatchResultStoreDataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Real-TimeResultStore
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest10
SOCMINT System – Frameworks & Components in Use
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
BatchResultStoreDataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Real-TimeResultStore
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest11
Streaming Analytics Processing Pipeline
Kafka provides reliable and efficient queuing
Storm processes (rollups, counts)
Cassandra stores results at same speed
StoringProcessingQueuing
12
TwitterSensor 1
TwitterSensor 2
TwitterSensor 3
VisualizationApplication
VisualizationApplication
Cassandra Data Modeling
13
Cassandra Data Modelling
14
• Don’t think relational !
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront• Control physical storage structure
“Static” Tables – “Skinny Row”
15
rowkey
CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY,c2 text,c3 text,
PRIMARY KEY (rowkey));
Growsup
toBillionofRow
s
rowkey-1 c1 c2 c3value-c1 value-c2 value-c3
rowkey-2 c1 c3value-c1 value-c3
rowkey-3 c1 c2 c3value-c1 value-c2 value-c3
c1 c2 c3
PartitionKey
“Dynamic” Tables – “Wide Row”
16
rowkey
Billion
ofR
ows rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text, ckey text,c1 text,c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2value-c1 value-c2
ckey-3:c1 ckey-3:c2value-c1 value-c2
ckey-1:c1 ckey-1:c2value-c1 value-c2
ckey-2:c1 ckey-2:c2value-c1 value-c2
ckey-1:c1 ckey-1:c2value-c1 value-c2
ckey-2:c1 ckey-2:c2value-c1 value-c2
ckey-3:c1 ckey-3:c2value-c1 value-c2
1 2Billion
PartitionKey Clustering Key
Cassandra for Timeseries Data
17
Show Timeseries: Provide list of metrics
18
CREATE TABLE tweet_count (sensor_id text,bucket_id text,key text,time_id timestamp,count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Use of “Static” Table
bucket-id defines buckets of values • HOUR-2015-10 = values
collected hourly in one partition for one month
ABC-001:HOUR-2015-10 dse:10:00:count1’550
ABC-001:DAY-2015-10 dse:14-OCT:count105’999
dse:13-OCT:count120’344
nosql:14-OCT:count2’532
dse:09:00:count2’299
nosql:10:00:count25
30d*24h*nkeys=n*720cols
OpenSourceTimeSeriesDBsoverCassandra:KairosDB: https://kairosdb.github.io/Heroic: http://spotify.github.io/heroicPartitionKey Clustering Key
Show Timeseries: Provide list of metrics
19
UPDATE tweet_count SET count = count + 1WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_countWHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count----------+--------------+-----+--------------------------+-------ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230
PartitionKey Clustering Key
Titan:db & Cassandra for Graph Data
20
Supporting Graph Data with Titan:db and Cassandra
21
http://thinkaurelius.github.io/titan/
Gremlin in Action – Creating the Graph
22
Gremlin in Action – Graph Traversal
23
Gremlin in Action – Graph Traversal (II)
24
Summary - Know your domain
Connectedness ofDatalow high
DocumentDataStore
Key-ValueStores
Wide-ColumnStore
GraphDatabases
RelationalDatabases
Guido SchmutzEmail: guido.schmutz@trivadis.com+41 79 412 05 39
26