How jKool Analyzes Streaming Data in Real Time with DataStaxCharles RichVP of Product ManagementjKool – jKoolcloud.com
Thank you for joining. We will begin shortly.
All attendees placed on mute
Input questions at any timeusing the online interface
Webinar Housekeeping
© 2015 jKool, All Rights Reserved. 3
Agenda
• jKool Overview
• jKool Technology
• Challenges
• Why We Selected Cassandra and DataStax
• Demo
jKool Overview
© 2015 jKool, All Rights Reserved. 4
• jKool – Founded 2014 as an spin-off from Nastel Technologies– Expertize in building scalable real-time analytics
• Initial Vision– Address the big data problems we saw at customers
• Inability to analyze data fast enough to take action and address problems• Too much data – Too little time
– Provide real-time, in-memory analytics (our heritage)– Leverage open-source– SaaS (or on-premises)– Simplicity
© © 2015 jKool, All Rights Reserved. 5
What is jKool?
A solution to Find and Fix Problems Faster (operational intelligence)
DevOps can use jKool to get real-time diagnostics for entire applications: logs, metrics and transactions.
– Detect anomalies, 2-clicks to root-cause– Discover log, transaction topologies– Analyze app behavior– Diagnose and determine causality
• An alternative to Splunk or Elasticsearch– Fraction of the cost of Splunk– Much easier to use than Elasticsearch
© 2015 jKool, All Rights Reserved. 6
Business Value: Instant Insight
Provide high quality app experiences for customers - Improve customer satisfaction
Enable DevOps to:– Fix problems faster
• Faster problem resolution, eliminate false alarms– Deliver releases sooner
• Less time patching and more time innovating– Be proactive
• Spot trends and prevent problems
© 2015 jKool, All Rights Reserved 7
Features
• Web-based, mobile-friendly dashboard– Designed for simplicity and power
• Real-time & historical visualization– Flexible, user configurable
• Analytics immediately detect outliers– Aggregation, summarization, comparison, including: count, min,
max, avg., bucketing, filtering and Bollinger
• Ease of use– Talk to your data using English-like query language
• Scale to handle the largest volumes of data– NoSQL architecture provides elastic scalability
© 2015 jKool, All Rights Reserved. 8
jKool Does Machine Data
• Sequence, Order, Group, Store
• Relationships
• Compute Timing
• Summarization, comparisons
• Triggers based on continuous queries (CEP)– Subscribe to events min elapsedtime, avg elapsedtime, max
elapsedtime where eventname="Buy" show as linechart
© 2015 jKool, All Rights Reserved 9
Real-time, In-Memory Analytics
jKool Analyzes Time-Series Data
Technology
• Elastic Architecture– Linear scalability – Highly extensible– Fast, in-memory analysis
•Open Source– NoSQL DB, tools and
instrumentation– No schema to maintain
•FatPipes– Micro-services for ultimate flexibility,
change and configuration
© 2015 jKool, All Rights Reserved. 10
RESTful
© 2015 jKool, All Rights Reserved. 11
Key to Real-time Analytics
• Process streams as they come while at the same time avoiding IO
– Streams are split into real-time queue and persistence queue with eventual consistency
• Both have to be processed in parallel– Writing to persistence layer and then analyzing will not achieve
near real-time processing
© 2015 jKool, All Rights Reserved 12
Why clustered computing platforms?
• STORM paired with Kafka/JMS and CEP– Clustered way to process incoming real-time streams
• STORM handles clustering/distribution
• Kafka/JMS for a messaging between grids
– Split streaming workload across the cluster
– Achieve linear scalability for incoming real-time streams• Apache Spark (alternative to MapReduce)
– For distributing queries and trend analysis– Micro batching for historical analytics– Loading large dataset into memory (across different nodes)– Running queries against large data-sets
Web Interface: DevOps Application Owner
13© 2015 jKool, All Rights Reserved
© 2015 jKool, All Rights Reserved. 14
Challenges: Meeting our Objectives
• Store everything, analyze everything…• Combined real-time & historical analytics• Fast response, flexible query capabilities
– Target – for business user– Insulate us from underlying software– Hide complexity
• Scale for ingesting data-in-motion• Scale for storing data-at-rest• Elasticity & Operational efficiency• Ease of monitoring & management
© 2015 jKool, All Rights Reserved 15
Challenges: What we experienced
• So many technology options (…so little time…)
– Deciding on the right combination is key early on
• Cassandra/Solr deployment — (it was a learning experience for us)
– Lots of configuration, memory management, replication options
• Monitoring, managing clusters– Cassandra/Solr, STORM, Zookeeper, Messaging– +Leverage parent company’s AutoPilot Technology
• Achieving near real-time analytics proved extremely challenging – but we did it!
– Keeping track of latencies across cluster– Estimating computational capacity required to crunch incoming
streams
© 2015 jKool, All Rights Reserved 16
Challenges: DB was the bottleneck
• Needed high performance DB platform
• SQL (Oracle, MySQL, etc.)– No scale. We have had a lot of experience our customer’s issues with
this at our parent company Nastel…– RAM was “the” bottleneck. Commits take too long and while that is
happening everything else stops
• NoSQL– Cassandra/Solr (DSE)– Hadoop/MapReduce– MongoDB
• Clustered Computing Platforms– STORM– MapReduce– Spark (we learned about this while building jKool)
17
Why we chose Cassandra/Solr?
• Pros:– Simple to setup & scale for clustered deployments– Scalable, resilient, fault-tolerant (easy replication)– Ability to have data automatically expire (TTL – necessary for our pricing model)– Configurable replication strategy– Great for heavy write workloads
• Write performance was better than Hadoop.• Insert rate was of paramount importance for us – get data in as fast as possible was our goal• Java driver balances the load amongst the nodes in a cluster for us (master-slave would never have worked for us)
– Solr provides a way to index all incoming data - essential– DSE provides a nice integration between Cassandra and Solr
• Cons:– Susceptible to GC pauses (memory management)
• The more memory the more GC pauses• Less memory and more nodes seems a better approach than one big “honking” server (we see 6-8GB optimal, so
far)– Data compaction tasks may hang
© © 2015 jKool, All Rights Reserved
© 2015 jKool, All Rights Reserved 18
Why not Hadoop MapReduce?
• MapReduce too slow for real-time workloads– Ok for batch, not so great for real-time– Need to be paired with other technologies for query (Hive/Pig)– Complex to setup, run and operate
• Our goals were simplicity first…
• Opted for STORM/Spark wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality
© 2015 jKool, All Rights Reserved 19
Why we chose Cassandra/Solr vs. Mongo?
• Why not Mongo?– Global write-lock performance concerns…
• Cassandra/Solr – Java based (our project was in Java) – Easy to scale, replicate data, – Flexible write & write consistency levels (ALL, QUORUM, ANY, etc.)– Did we say Java? Yes.(we like Java…)
• Flexible choice of platform coverage– Great for time-series data streams (market focus for jKool)
• Inherent query limitations in Cassandra solved via Solr integration (provided with DSE – as mentioned earlier)
© 2015 jKool, All Rights Reserved 20
What we learned
• Consider your application – Read heavy or write heavy? Both?
• Evaluate performance of course, but consider the user– We needed simplicity: setup and scale (us and end user)– We needed reliability – not planning on targeting data engineers– We needed auto pruning (TTL)– We needed easy search
• DSE had this…the others did not provide all of this– We choose DSE.
© 2015 jKool, All Rights Reserved 21
jKool in Real Time – A Live Demo
Thank you!
Input questions at any timeusing the online interface
More information on jKool at: jKoolCloud.com