Spark Summit - Stratio Streaming

transcript

Stratio is the only Big Data platform able to combine, in one query, stored data withstreaming data in real-time (in less than 30 seconds).

We are polyglots as well: Weuse Spark over two noSQLdatabases, Cassandra & Mongo DB.

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data, and in fact is represented as a sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.

Shark(SQL)

SparkStreaming

Mllib(machine learning)

GraphX(graph)

• map(func), flatMap(func), filter(func), count()

• repartition(numPartitions)

• union(otherStream)

• reduce(func),countByValue(), reduceByKey(func, [numTasks])

• join(otherStream, [numTasks]), cogroup(otherStream, [numTasks])

• transform(func)

• updateStateByKey(func)

• window(windowLength, slideInterval)

• countByWindow(windowLength, slideInterval)

• reduceByWindow(func, windowLength, slideInterval)

• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

• countByValueAndWindow(windowLength, slideInterval, [numTasks])

• print()

• foreachRDD(func)

• saveAsObjectFiles(prefix, [suffix])

• saveAsTextFiles(prefix, [suffix])

• saveAsHadoopFiles(prefix, [suffix])

Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances

CEP as a technique helps discover complex events by analyzing and correlating other events

A CEP engine should provide operators over streams, keeping in mind that events and streams in a CEP are first-class citizens. In CEP, we think in terms of event streams: event stream is a sequence of events that arrives over time.

Users provide queries to the CEP engine whose main mission is matching those queries against events coming through event streams.

A CEP engine thus has notion of time and it allows working with temporal queries that reason in terms of temporal concepts, such as “time windows” or “before and after” event relationships… among others

• Filter

• Join

• Aggregation (Avg, Sum , Min, Max, Custom)

• Group by

• Having

• Conditions and Expressions (and, or, not, true/false, ==,!=, >=, >, <=, <),

• Data types (boolean, string, int, long, float, double)

• Pattern processing

• Sequence processing (zero to many, one to many, and zero to one)

• You still have to integrate it in your code

• There is nothing like an interactive console

• If you want to do something with the streams, you guessed it, you have to code it!

• There is no way to remotely listen to a stream

• There are no solution patterns ready-to-use with the engine

• No statistics, no auditing

• Hard to integrate with other tools (dashboarding, log stream, batch processing)

With this solution you can use our API in order to request commands to StratioStreaming engine in your code.

And you can also work with the interactive shell in order to test your queries or interact with the engine on demand.

Both tools, in fact, hide that you are sending messages to a complex engine, built with Zookeeper, Kafka, Spark Streaming and Siddhi CEP Engine.

zookeeper

requests

events

CASSANDRA

• create --stream testStream --definition

• "name.string,data.double“

• insert --stream testStream --values

• "name.Temperature, field.testValue,data.33“

• save cassandra start --stream testStream

• alter --stream testStream --definition

"field.string"

CREATE --stream testStream –definition(name.string, data.double,data2.int, data3.float, data4.double, trueorfalse.boolean)

Filtering

Projection

In-built functions

Windows (time and length)

Event Sequences

There are a lot of CEP operators that you can use in your queries:

Event Patterns

Output rate limiting

Custom windows, customfunctions

from sensor_grid #window.length(10) select name, ind, avg(data) as data group by name insert into sensor_grid_avg for current-events

1. >, <, ==, >=, <=, !=

2. contains, instanceof

3. and, or, not

1. sum, avg, max, min, count: when aggregated (group by, having)

2. Field Type Conversion

3. Coalesce: if field null then takeanother field

4. IsMatch: true or false if match regex

from orders[price >= 20 and price < 100]…

from orders select * insert into ordersB…from orders select client, price insert into ordersB…

1. Length window - a sliding window that keeps the last N events.

2. Time window - a sliding window that keeps events that have arrived within the last T time period.

3. Time and Length batch window : same concept but outputs events only at the end of the given window

4. Unique window - keeps only the latest events that are unique according to the given unique attribute.

5. First unique window - keeps the first events that are unique according to the given unique attribute.

6. External Time Window - a sliding window that processes according to timestamps defined externally

from payments[channel == ‘Paypal']#window.time( 1 min )

• With “on <condition>” joins only the events that matches the condition

• With “within <time>”, joins only the events that are within said time of each other

from errorStream#window.length(1) as errorStream joinallStream#window.length(1) as allStreamon errorStream.numberOfErrors > allStream.totalNumberOfEvents*0.05 select * insert into alarmByThreshold;

from every (a1 = infoStock[action == "buy"]-> a2 = confirmOrder[command == "OK"] )-> b1 = StockExchangeStream [price > infoStock.price]

within 3000select a1.action as action, b1.price as priceinsert into StockQuote

from every a1 = infoStock[action == "buy"]+,b1 = StockExchangeStream[price > 70]?,

b2 = StockExchangeStream[price >= 75]select a1[0].action as action, b1.price as priceA, b2.price as priceBJoininsert into StockQuote

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Stratio Ingestion is an ETL for Big Data product, based on Flume.

Design your workflows (wysiwyg) with useful and improved sources and sinks, transform your data on the fly

• Create the stream if it doesn’t exist

• It is possible to send filtered event-flows only to streaming engine

• Built on the StratioStreaming API.

• Call-center Real-time monitoring

Real-time detection of client churn riskNatural Language Processing Analysis to detect incidents in real-timeAnomaly detection in the service based on patterns

• IT services monitoring

DoS attack detection, hotlinking, etc in real-timeWarnings in monitoring of heterogeneous servicesPreventive detection of downtime based on patterns

• Sensor grid monitoring

Alarms when thresholds are reachedComplex alarms involving several sensorsReal-time monitoring (landing support devices in an airport, for example)

Data Machine Intelligence

SELECT sum(order.quantity), company_data.countryFROM streaming.order WITH WINDOW 15 minutes INNER JOIN batch.company_dataON order.company = company_data.company_name;

• With an powerful query planner

• Able to perform mixed queries with streaming and batch data

SQL query example, mixing real-time data (coming from Stratio Streaming Engine) and batch data (stored in a noSQL database)

We are first going to use

the Shell to create

streams and queries.

Spark Summit - Stratio Streaming

Engineering