21.04.2016 Meetup: Spark vs. Flink

transcript

Spark vs FlinkRumble in the (Big Data) Jungle

, München, 2016-04-20

Konstantin Knauf Michael Pisula

Background

The Big Data EcosystemApache Top-Level Projects over Time

2008 2010 2013 2014 2015

The New Guard

Berkeley University Origin TU Berlin

2013 ApacheIncubator

04/2014

02/2014 Apache Top-Level

01/2015

databricks Company data Artisans

Scala, Java, Python, R Supportedlanguages

Java, Scala, Python

Scala Implementedin

Stand-Alone, Mesos,EC2, YARN

Cluster Stand-Alone, Mesos, EC2, YARN

Lightning-fast clustercomputing

Teaser Scalable Batch and StreamData Processing

The Challenge

Real-Time Analysis of a Superhero Fight Club

hitter: Inthittee: Inthitpoints: Int

Segment

id: Intname: Stringsegment: String

Detail

name: Stringgender: IntbirthYear: IntnoOfAppearances: Int

hitter: Inthittee: Inthitpoints: Int

id: Intname: Stringsegment: Stringgender: IntbirthYear: IntnoOfAppearances: Int

{Stream

{Batch

The Setup

AWS Cluster

KafkaCluster

Stream ProcessingBatch Processing

Heroes

Combining Stream and Batch

Segment Detail Data Generator

Round 1Setting up

Dependencies

compile "org.apache.flink:flink-java:1.0.0"compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"//For Local Execution from IDEcompile "org.apache.flink:flink-clients_2.11:1.0.0"

Skeleton

//Batch (DataSetAPI)ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();//Stream (DataStream API)StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

//Processing Logic

//For Streamingenv.execute()

Dependencies

compile 'org.apache.spark:spark-core_2.10:1.5.0'compile 'org.apache.spark:spark-streaming_2.10:1.5.0'

Skeleton Batch

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);// BatchJavaSparkContext sparkContext = new JavaSparkContext(conf);// StreamJavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));// Processing Logic

jssc.start(); // For Streaming

First ImpressionsPractically no boiler plateEasy to get started and play aroundRuns in the IDEHadoop MapReduce is much harder to get into

Round 2Static Data Analysis

Combine both static data parts

Read the csv file and transform it

JavaRDD<String> segmentFile = sparkContext.textFile("s3://...");JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2<>(name, new SegmentTableRecord(id, name, segment)); });

Join with detail data, filter out humans and write output

segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");

Loading Files from S3 into POJO

DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment");

Join and Filter

DataSet<Hero> humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human"));

Write back to S3

humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());

PerformanceTerasort1: Flink ca 66% of runtimeTerasort2: Flink ca. 68% of runtimeHashJoin: Flink ca. 32% of runtime(Iterative Processes: Flink ca. 50% of runtime, ca. 7% withDelta-Iterations)

2nd Round PointsGenerally similar abstraction and feature setFlink has a nicer syntax, more sugarSpark is pretty bare-metalFlink is faster

Round 3Simple Real Time Analysis

Total Hitpoints over Last Minute

Configuring Environment for EventTime

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

ExecutionConfig config = env.getConfig();config.setAutoWatermarkInterval(500);

Creating Stream from Kafka

Properties properties = new Properties();properties.put("bootstrap.servers", KAFKA_BROKERS);properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);properties.put("group.id", KAFKA_GROUP_ID);

DataStreamSource<FightEvent> hitStream = env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic", new FightEventDeserializer(), properties));

Processing Logic

hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction<FightEvent>() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://...");

Example Output3> (1448130670000,1448130730000,290789)4> (1448130680000,1448130740000,289395)5> (1448130690000,1448130750000,291768)6> (1448130700000,1448130760000,292634)7> (1448130710000,1448130770000,293869)8> (1448130720000,1448130780000,293356)1> (1448130730000,1448130790000,293054)2> (1448130740000,1448130800000,294209)

Create Context and get Avro Stream from Kafka

JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic");

HashMap<String, String> kafkaParams = new HashMap<String, String>();kafkaParams.put("metadata.broker.list", "xxx:11211");kafkaParams.put("group.id", "spark");

JavaPairInputDStream<String, FightEvent> kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);

Analyze number of hit points over a sliding window

kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });

Output20:19:32 Hitpoints in the last minute [80802]20:19:42 Hitpoints in the last minute [101019]20:19:52 Hitpoints in the last minute [141012]20:20:02 Hitpoints in the last minute [184759]20:20:12 Hitpoints in the last minute [215802]

3rd Round PointsFlink supports event time windowsKafka and Avro worked seamlessly in bothSpark uses micro-batches, no real streamBoth have at-least-once delivery guaranteesExactly-once depends a lot on sink/source

Round 4Connecting Static Data with Real

Time DataTotal Hitpoints over Last Minute Per Gender

Read static data using object File and map genders

JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2<>(user.getId(), gender);});

Analyze number of hit points per hitter over a sliding window

JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2<>(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));

Join with static data to find gender for each hitter

hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional<String> maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null;});

Output20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)]20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)]20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)]20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]

Loading Static Data in Every Map

public FightEventEnricher(String bucket, String keyPrefix) { this.bucket = bucket; this.keyPrefix = keyPrefix;}

@Overridepublic void open(Configuration parameters) { populateHeroMapFromS3(bucket, keyPrefix);}

@Overridepublic EnrichedFightEvent map(FightEvent event) throws Exception { return new EnrichedFightEvent(event, idToHero.get(event.getHitterId()), idToHero.get(event.getHitteeId()));}

private void populateHeroMapFromS3(String bucket, String keyPrefix) { // Omitted}

Processing Logic

hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(enrichedFightEvent -> enrichedFightEvent.getHittingHero().getGender()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, Integer>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent() .getHitPoints(); } })

Example Output2> (1448191350000,1448191410000,1,28478)3> (1448191350000,1448191410000,2,264650)2> (1448191360000,1448191420000,1,28290)3> (1448191360000,1448191420000,2,263521)2> (1448191370000,1448191430000,1,29327)3> (1448191370000,1448191430000,2,265526)

4th Round PointsSpark makes combining batch and spark easierWindowing by key works well in bothJava API of Spark can be annoying

Round 5More Advanced Real Time

AnalysisBest Hitter over Last Minute Per Gender

Processing Logic

hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(fightEvent -> fightEvent.getHittingHero().getName()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, String>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent().getHitPoints(); } }) .assignTimestamps(new AscendingTimestampExtractor<...>() { @Override public long extractAscendingTimestamp(Tuple4<...<tuple, long l) { return tuple.f0; } }) .timeWindowAll(Time.of(10, TimeUnit.SECONDS)) .maxBy(3) .print();

Example Output1> (1448200070000,1448200130000,Tengu,546)2> (1448200080000,1448200140000,Louis XIV,621)3> (1448200090000,1448200150000,Louis XIV,561)4> (1448200100000,1448200160000,Louis XIV,552)5> (1448200110000,1448200170000,Phil Dexter,620)6> (1448200120000,1448200180000,Phil Dexter,552)7> (1448200130000,1448200190000,Kalamity,648)8> (1448200140000,1448200200000,Jakita Wagner,656)1> (1448200150000,1448200210000,Jakita Wagner,703)

Read static data using object File and Map names

JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD<String, String> userNameLookup = staticRdd .mapToPair(user -> new Tuple2<>(user.getId(), user.getName()));

Analyze number of hit points per hitter over a sliding window

JavaPairDStream<String, Long> hitters = kafkaStream .mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(), kafkaTuple._2().getHitPoints())) .reduceByKeyAndWindow((accum, current) -> accum + current, (accum, remove) -> accum - remove, Durations.seconds(60), Durations.seconds(10));

Join with static data to find username for each hitter

hitters.foreachRDD((rdd, time) -> { JavaRDD<Tuple2<String, Long>> namedHitters = rdd .leftOuterJoin(userNameLookup) .map(joinedTuple -> { String username = joinedTuple._2()._2().or("No name"); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(username, hitpoints); }) .sortBy(Tuple2::_2, false, PARTITIONS); namedHitters.saveAsTextFile(outputPath + "/round3-" + time); LOGGER.info("Five highest hitters (total: {}){}", namedHitters.count(), namedHitters.take(5)); return null;});

Output

15/11/25 20:34:23 Five highest hitters (total: 200)[(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539), 15/11/25 20:34:33 Five highest hitters (total: 378)[(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646), 15/11/25 20:34:43 Five highest hitters (total: 378)[(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699), 15/11/25 20:34:53 Five highest hitters (total: 558)

15/11/25 20:34:53 Five highest hitters (total: 558)[(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,

PerformanceYahoo Streaming Benchmark

5th Round PointsSpark makes some things easierBut Flink is real streamingIn Spark you often have to specify partitions

The Judges' Call

DevelopmentCompared to Hadoop, both are awesomeBoth provide unified programming model for diverse scenariosComfort level of abstraction varies with use-caseSpark's Java API is cumbersome compared to the Scala APIWorking with both is funDocs are ok, but spotty

TestingTesting distributed systems will always be hardFunctionally both can be tested nicely

Monitoring

Community

The Judge's CallIt depends...

Use Spark, ifYou have Cloudera, Hortonworks. etc support and depend on itYou want to heavily use Graph and ML librariesYou want to use the more mature project

Use Flink, ifReal-Time processing is important for your use caseYou want more complex window operationsYou develop in Java onlyIf you want to support a German project

Benchmark References[1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/[2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html[3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html [4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Thank You!Questions?

michael.pisula@tng.tech konstantin.knauf@tng.tech

21.04.2016 Meetup: Spark vs. Flink

Data & Analytics