Date post: | 16-Jul-2015 |
Category: |
Software |
Upload: | davorin-vukelic |
View: | 214 times |
Download: | 1 times |
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Spark Meetup, 27.04.2015. Zagreb, Davorin Vukelić
Agenda
• Real-Time Streaming • Apache Storm• Apache Spark Streaming• Demo• Conclusion
Real-Time Streaming • Continious processing, aggregation and analysing data when they are created• System are make like directed acyclic graph• Data is reduced before being centralized• Processing messages one at time• Collect structured, semi-structured and unstructured data from different
sources
• directed acyclic graph is graphical presentation of developed chain of tasks
• Displey order of task execution• Modeling data flows through
transformations
Real-Time Streaming • Actions in real time:
• Monitoring trends• Communicate• Recommendation• Searching
• Expected respone is from 500 ms to one minute• Data message flow through a chain of processors until the result reaches the
final destination.• To guarantee reliable data processing, it is necessary to restart processing in
case of failures• Delivery Semantics (Message Guarantees):
• At most once: messages may be lost but never redelivered.• At least once: messages will never be lost but may be redelivered.• Exactly once: messages are never lost and never redelivered, perfect message delivery.
Apache Storm• distributed real-time computation system• Concept of processing is developed in Storm API• it is neccessery to deploy to Storm cluster for continiously running • Storm cluster is system of couple different daemons run on separate nodes (servers)• Event-Stream Processing - Stream processing is a one-at-a-time processing model
Storm• Main abstractions in Storm:
• Spout • source node of streams in a computation• Implemented code to read data from:
o Twitter api, web crawlers, FB apio queueing broker: Kafka, Kestrel, RabbitMQ,
• Bolt• Node for implement logic of a computation process• functions, filters, streaming joins, streaming aggregations, store and lookup
to databases and filesystems• Topology
• network of spouts and bolts• run indefinitely when is deployed• edge in the network representing a bolt subscribing to the output like :
relation db, filesystem,NoSql db, other topology or queueing system• Tuple
• data message model for communication beetween nodes in topology• define immutable list of objects of any types
Storm Tuple• needs to know how to serialize and deserialize objects when
they're passed between tasks• uses Kryo for serialization• need to register a custom serializer• Storm by default can serialize:
• primitive types, • strings, • byte arrays, • ArrayList, • HashMap, • HashSet • Clojure collection types
Storm TopologyTopologyBuilder builder = new TopologyBuilder();
builder.setSpout("reader", new WordReader(),4);
builder.setBolt("normalizer", new WordNormalizer(),2).shuffleGrouping("reader") .setNumTasks(2).;
builder.setBolt("counter", new WordCounter(), 2).fieldsGrouping("normalizer", new Fields("word"));
Config conf = new Config();
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 8);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("Toplogie",conf,builder.createTopology());
Storm
Ramasamy, Karthik : Audicity Real-Time Analytics with Apache Storm
• Sharding
Storm
• SCALING
Ramasamy, Karthik : Audicity Real-Time Analytics with Apache Storm
Storm Grouping• define how data is exchanged between nodes in topology• use when bolt has several paralele instance (several tasks)• Shuffle
• Tuple emitted by the source to a randomly chosen bolt instance• bolt warranting that each bolt instance will receive the same number of tuples
• Fields• Controls how tuples are sent to bolt instances with field s in tuple• Tuple with same fiield value will be sent into same bolt instance
• Partial Key • Controls how tuples are sent to bolt instances with field s in tuple• load balanced between two downstream bolts instance, which provides better utilization of
resources• use when the incoming data is skewed
• All• Replicate and sends a each tuple to all instances of the receiving bolt
• Custom• create your own custom stream grouping
• Global Grouping• all instances of the source send tuples to a single target iinstance • For example the lowest id
Storm Grouping• Local or shuffle grouping
• If the target bolt has one or more instances (tasks) in the same worker process, tuples will be shuffled to just those in-process tasks
• Will act like a normal shuffle grouping if that isn’t case
• Direct grouping• producer of the tuple decides which bolt instanceof the consumer will receive this tuple.• It must be specified the task ID, task ID can be gotten from OutputCollector
Storm Spoutpublic class WordReader implements IRichSpout { private SpoutOutputCollector collector;
private FileReader fileReader;
private boolean completed = false;
private TopologyContext context;
public boolean isDistributed() {return false;}
public void ack(Object msgId) {System.out.println("OK:" + msgId);}
public void close() { }
public void fail(Object msgId) {System.out.println("FAIL:" + msgId);}
public void nextTuple() {
if (completed) {
String str;
BufferedReader reader = new BufferedReader(fileReader);
try { while ((str = reader.readLine()) != null) {this.collector.emit(new Values(str), str);}
} catch (Exception e) {throw new RuntimeException("Error reading tuple", e);
} finally {completed = true;} }
public void open( Map conf, TopologyContext context,SpoutOutputCollector collector) {
this.context = context;
this.collector = collector;
try {this.fileReader = new FileReader(conf.get("wordsFile").toString());
} catch (FileNotFoundException e) {throw new RuntimeException("Error reading file[" conf.get("wordFile") + "]");
} }
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
public void activate() {}
public void deactivate() {}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Storm Boltpublic class WordNormalizer implements IRichBolt {
private OutputCollector collector;
public void cleanup() {
}
public void execute(Tuple input) {
String sentence = input.getString(0);
String[] words = sentence.split(" ");
for (String word : words) {
word = word.trim();
if (!word.isEmpty()) {
word = word.toLowerCase();
collector.emit( new Values(word));
}
}
collector.ack(input);
}
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Storm demo
Storm demo
Storm demo
Storm - cluster• Master node – run a daemon Nimbus
• distributing code around the cluster• assigning tasks to each worker node• monitoring for failures
• Worker nodes - run a daemon Supervisor• executes a portion of a topology
• Zookeeper• keeps all nodes states in cluster
• The supervisor’s computing resource can be partitioned into multiple worker slots. • Worker slot, Storm can spawn multiple threads, referred to as executors
Fan Jiang, Enabling Site-Aware Scheduling for Apache Storm in ExoGENI
Storm - cluster• Tasks
• Each spout or bolt executes as many tasks across the cluster. • Each task corresponds to one thread of execution• Tasks are instances of spouts and bolts whose nextTuple() and execute() methods are called by executor
threads• stream groupings define how to send tuples from one set of tasks to another set of tasks.• ComponentConfigurationDeclarer: .setNumTasks(#) – how much tasks per executor (how much thread in one
executor) • Workers
• Topologies execute across one or more worker processes.• Each worker process is a physical JVM and executes a subset of all the tasks for the topology. • Each worker execute runs executors for a specific topology• For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker
will execute 6 tasks• Config: setNumWorkers
• Executors• Each executor runs one or more tasks of the same component• These are Java threads running within a worker JVM process. Multiple tasks can be assigned to a single
executor.• TopologyBuilder: setSpout(,,#) – how much executors • TopologyBuilder: setBolt(.,#) -how much executors
• The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time
Storm - cluster
• builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);
• builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID);
• builder.setBolt(COUNT_BOLT_ID, countBolt, 4).fieldsGrouping(SPLIT_BOLT_ID, newFields("word"));
P. Taylor Goetz,Brian O'Neill :Storm Blueprints: Patterns for Distributed Real-time Computation
Storm• SCALABLE:
• it can process very high throughputs of messages with very low latency• one million 100 byte messages per second per node (node conf: Processor: 2x Intel
[email protected]; Memory: 24 GB )
• Fault-tolerant:• automatically restart workers who dieds. If a node dies, the worker will be restarted on
another node.• Failur is expected and embraced• they will restart like nothing happened• State is stored on Zookeeper• Actions befor they died
• Guarantees data processing (Reliable) :• track the lineage of a tuple as it makes its way through the topology• Messages are only replayed when there are failures.• Anchoring is specifying a link in the tuple tree. It is done at the same time you emit a new
tuple.• At least once by default
Storm• Language:
• core of Storm is a Thrift definition for defining and submitting topologies
• Open source:• large and growing ecosystem of libraries and tools to use in conjunction with Storm• spouts integrate with queueing systems such as JMS, Kafka, Redis pub/sub• helper bolts for integrating with databases, such as MongoDB, RDBMS's, Cassandra,
Hbase and filesystem like HDFS
• Transactional• can get exactly once messaging semantics for any computation.
Spark Streaming• open source data streaming and processing engine • built around speed, ease of use and sophisticated analytics• extension of the core Spark API• scalable, high-throughput, fault-tolerant stream processing data streams• Batch processing is concept of processing data in masse. Micro-batching is case of
batch processing where the batch size is orders smaller. • Runs on:
• Hadoop YARN• Mesas• Spark cluster• EC2
• Read data from:• Kafka• Flume• ZeroMQ• TCP sockets• Twitter• Kinesis • HDFS
Spark Streaming• Store data to:
• HDFS • Databases• Dashboards• Spark’s machine learning • Spark’s graph processing algorithms
• Input data stream in mini-batches and performs transformations on those mini-batches of data or on grupe of mini-batches
Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Learning Spark Lightning-Fast Big Data Analysis
Spark Streaming• Abstraction in Spark Streaming:
• Dstream• Discretized stream• arriving sequence of data • a continuous series of RDDs
• RDD• Resilient Distributed Dataset• Collections of objects spread accross a cluster• Partitioned and Distributed• Partitions Recomputed on Failure• Saved in RAM or Disk• Spark Streaming context works with small RDDs• Each RDD in a DStream contains data from a certain interval• On every mini RDD it can be execute transformation like on regular RDD
• PairDstream• DStream of key-value pairs, which provides extra methods like reduceByKey and join
• StreamingContext API
Spark StreamingSparkConf conf = new SparkConf().setAppName("twitter-stream").setMaster("local[2]");
JavaStreamingContext jssc = new JavaStreamingContext(conf,Durations.seconds(10));
jssc.checkpoint("/home/cloudera/Desktop/spark_tweets");
jssc.start() jssc.awaitTermination()
Spark Streaming
• Operations• is applied on each RDD in DStream• Transformations
•create new Dstream
• Output operations•write data to other systems•run periodically on each time step, producing output in batches.
• Checkpointing• must be resilient to system failures, JVM crashes.• it can recover from failures. • types of data that are checkpointed
•Metadata checkpointing – Configuration, DStream operations, Incomplete batches•Data checkpointing - Saving of the generated RDDs to reliable storage.
Spark Streaming - Input• built-in streaming sources:
•create multiple receivers which will simultaneously receive multiple data streams• Basic sources:
•file systems • reading data from files on file system compatible with the HDFS API,• monitor the directory and process files created in that directory • Files must be moved ,not continuously appended
•socket connections• Data stream separated into time intervales
•Akka actors•Queue of RDDs as a Stream
• RDD pushed into the queue • treated as a batch of data in the DStream
• Advanced sources:•external non-Spark libraries•Custom Receiver:
• Implement by developers•Kafka•Flume•Kinesis •Twitter•ZeroMQ•MQTT
Spark Streaming - InputAuthorization auth = twitter.getAuthorization();
final String[] filters = { "#KCA", "#kca" };
JavaDStream<Status> tweets = TwitterUtils.createStream(jssc, auth, filters);
JavaDStream<String> statuses = tweets.map(new Function<Status, String>(){
public String call(Status status) {
return status.getText();
}
});
Spark Streaming - Transformations Transformations on Dstreams Stateless transformations
• processing of each batch separately • doesn’t depend on the data of its previous batches.• provide any arbitrary RDD-to-RDD function to act on the Dstream
• transform()
Transformation Meaning
map(func)Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
Spark Streaming - Transformations
Transformation Meaning
count()Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)
Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
reduceByKey()Combine values with the same key in each batch. It is necessary to create a JavaPairDStream
groupByKey() Group values with the same key in each batch.
Spark Streaming - Transformations
Spark Streaming - TransformationsJavaDStream<String> words = statuses.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String in) {return Arrays.asList(in.split(" "));}
});
JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String in) throws Exception {return new Tuple2<String, Integer>(in, 1);}
});
JavaPairDStream<String, Integer> counts = pairs .reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) {return a + b;}
});
Spark Streaming - Transformations• Transformations on Dstreams Stateful transformations
• Use previous batches is used to generate the results for a new batch• tracking state across time• Checkpointing must be set, (fault tolerance) • updateStateByKey():
• Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.
• Sliding windows:• transformations over a sliding window of data
• Parameters:• window length - The duration of the window• sliding interval - The interval at which the window operation is performed
Spark Streaming - Transformations
Transformation Meaning
window(windowLength, slideInterval)Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval)
Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config propertyspark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameter invFunc. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that [checkpointing](#checkpointing) must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
Spark Streaming - Transformations
Spark Streaming - Transformations
Function2<List<Integer>, Optional<Integer>, Optional<Integer>> updateFunction =
new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() {
public Optional<Integer> call(List<Integer> values,Optional<Integer> state) {
Integer newSum = state.or(0);
for (int i : values) {
newSum += i;
}
return Optional.of(newSum);
}
};
JavaPairDStream<String, Integer> runningCounts = pairs.updateStateByKey(updateFunction);
Spark Streaming – Transformations join• Works JavaPairDStream• combine data from multiple DStreams with transformation:
• join()• leftOuterJoin()• rightOuterJoin, • fullOuterJoin
• merge contents of two different Dstreams:• union()
• Stream-stream joins• Stream-dataset joins
Spark Streaming - Output• Output transformations store finale transformed data into external
database, file system, screen
Output Operation Meaning
print() Prints first ten elements of every batch of data in a DStream on the driver node running the streaming application.
saveAsTextFiles(prefix, [suffix])
Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])
Save this DStream's contents as a SequenceFile of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])
Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func)
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
Spark Streaming - Output
sortedCounts.foreach(new Function<JavaPairRDD<Integer, String>, Void>() {
public Void call(JavaPairRDD<Integer, String> rdd) {
Jedis jedis = new Jedis("#.#.#.#");
for (Tuple2<Integer, String> t : rdd.collect()) {
jedis.publish("spark_words", t._2 + "|" + Integer.toString(t._1));
}
return null;
});
Spark Streaming demo
Spark Streaming - OutputJavaPairDStream<Text, IntWritable> writableDStream = runningCounts.mapToPair(
new PairFunction<Tuple2<String, Integer>, Text, IntWritable>() {
public Tuple2<Text, IntWritable> call(Tuple2<String, Integer> e) {
return new Tuple2(new Text(e._1()), new IntWritable(e._2()));
}
});
class OutFormat extends TextOutputFormat<Text, Integer> {};
writableDStream.saveAsHadoopFiles("hdfs://#.#.#.#/user/hdfs/tweets_spark/", "",
Text.class, IntWritable.class, OutFormat.class);
Spark Streaming Parallelism• Increasing the number of receivers
• multiple input Dstreams• union to merge them
• Explicitly repartitioning received data• repartitioning the input stream• DStream.repartition
• Increasing parallelism in aggregation• can specify the parallelism• Operations which reduce dataset
Spark Streaming• Scalable• High-throughput• Fault-tolerant• Guarantees data processing (Reliable):
• exactly once
Spark Streaming• MLlib
• streaming machine learning algorithms which can simultaneously learn from the streaming data as well as apply the model on the streaming data.
• Streaming Linear Regression • Streaming KMeans,
• DataFrame• create a SQLContext using the SparkContex• Declare JavaRow• apply the model online on streaming data
Storm vs Spark - use case• realtime analytics• online machine learning• continuous computation• distributed RPC (Remote Procedure Call)• ETL• Look for trends that can indicate a problem.
• Alert or provide automated corrections
• Provide an interface to visualize• Current data • Historical data
Storm vs Spark approach• Storm:
• tends to be driven by creating classes and implementing interfaces• has the advantage of broader language support ( code written in R or any other language not
natively supported by Spark)• DAG’s is natural to the processing model, Tuple is natural interface for the data passed
between nodes• processing excels at computing transformations as data are ingested with sub-second
latencies.
• Spark:• has more of a “functional” flavor, where working with the API is driven more by chaining
successive method calls to invoke primitive operations• Tuples can feel awkward in Java but with this is giong benefit of compile-time • Use existing Hadoop or Mesos cluster • micro-batching trivially gives stateful computation, making windowing an easy task.
• Neither approach is better or worse
Storm vs Spark
Storm• Event-Streaming• At most once / At least
once• sub-second• Java, Clojure, Scala,
Python, Ruby• Use other tool for batch
Spark• Micro-Batching / Batch
(Spark Core)• Exactly Once• Seconds• Java, Scala, Python• batching and streaming
are very similar
vsProcessing Model
Delivery
Guarantees
Latency
Language
Options
Development
Storm vs Spark - recomendation • Storm:
• Latency < 1 second (500 ms)• Real Time:
• Analytics• Budgeting• ML
• Spark:• ETL• iterative machine learning • interactive analytics • Interactive Queries• batch processing• graph processing
Storm vs Spark • Storm:
• Lower Level API• No concept of look back aggregations (slideing windows)• combine batch with streaming
• Spark:• < 1 TB size of cluster• latency from 500 milisekundi to 1 sec (micro-batching incurs a cost of latency)• streaming inputs are replicated in memory
Storm vs Spark • Jonathan Leibiusky, Gabriel Eisbruch, Dario Simonassi: Getting Started with Storm - Continuous
streaming computation with Twitter's cluster technology
• Anderson, Quinton: Storm Real-time Processing Cookbook - Efficiently Process Unbounded Streams of Data in Real Time
• Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Learning Spark - Lightning-Fast Big Data Analysis
• Apache Spark Streaming Programming Guide: • https://spark.apache.org/docs/latest/streaming-programming-guide.html
• Apache Storm• https://storm.apache.org/documentation/Home.html
• P. Taylor Goetz Brian O'Neill: Storm Blueprints - Patterns for Distributed Real-time Computation