Sandeep GiriHadoop
SPARK STREAMINGExtension of the core Spark API: high-throughput, fault-tolerant
Sandeep GiriHadoop
SPARK STREAMING
Workflow • Spark Streaming receives live input data streams • Divides the data into batches • Spark Engine generates the final stream of results in batches.
Provides a discretized stream or DStream - a continuous stream of data.
Sandeep GiriHadoop
SPARK STREAMING - DSTREAMInternally represented using RDD
Each RDD in a DStream contains data from a certain interval.
pairs.reduceByKeyAndWindow(reduceFunc, new Duration(30000), new Duration(10000)); // Reduce last 30 seconds of data, every 10 seconds
Window Operations
Sandeep GiriHadoop
SPARK STREAMING - EXAMPLEProblem: You can to do the word count every second.
Step 1: Create a connection to the service
JavaStreamingContext jssc = new JavaStreamingContext( "local[2]", "JavaNetworkWordCount", new Duration(1000) ) JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
Sandeep GiriHadoop
SPARK STREAMING - EXAMPLEProblem: You can to do the word count every second.
Step 2: Split each line into words
//Run a split function on each line with the help of flatMap //Create an Stream on top of the Array of Words JavaDStream<String> words = lines.flatMap( new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String x) { return Arrays.asList(x.split(" ")); } });
Sandeep GiriHadoop
SPARK STREAMING - EXAMPLEProblem: You can to do the word count every second.
Step 3: With the help of Map() function create key-value on each word. Key is word and value is 1
JavaPairDStream<String, Integer> pairs = words.map( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } });
Sandeep GiriHadoop
SPARK STREAMING - EXAMPLEProblem: You can to do the word count every second.
Step 4: Using reduceByKey Action, find sum of counts (1’s). Create a DStream on top of the counts’ array
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2){ return i1 + i2; } });
Step 5: wordCounts.print();
Sandeep GiriHadoop
SPARK STREAMING - DSTREAM