Apache Flink Berlin Meetup May 2016

transcript

Stephan Ewen@stephanewen

What's coming up inApache Flink?Quick teaser of some of the upcoming features

Disclaimer

This list of threads is incomplete

This is not an Apache Flink roadmap!

What's coming up?

Integration Operations

Stream SQL

Queryable State

Cassandra

Deployment and Management(YARN, Mesos, Docker, …)

Dynamically ScalingStreaming Programs

Metrics

File System Sources

Side InputsJoining streamsand static data

BigTopIntegration

KinesisState Scalability

Stream SQL

Two definitions of Stream SQL

1. Run a continuous SQL query that reads an infinitestream and continuously produces results

2. Continuously ingest streams into a warehouse.Query the real time data in the warehouse.

Two definitions of Stream SQL

1. Run a continuous SQL query that reads an infinitestream and continuously produces results

2. Continuously ingest streams into a warehouse.Query the real time data in the warehouse.

That's Flink's Stream SQL

Good use case for Kafka + Flink + Druid

An Example

val execEnv = StreamExecutionEnvironment.getExecutionEnvironmentval tableEnv = TableEnvironment.getTableEnvironment(execEnv)

// define a JSON encoded Kafka topic as external tableval sensorSource = new KafkaJsonSource[(String, Long, Double)]("sensorTopic", kafkaProps, ("location", "time", "tempF"))

// register external tabletableEnv.registerTableSource("sensorData", sensorSource)

// define query in external tableval roomSensors: Table = tableEnv.sql(""" SELECT STREAM time, location AS room, (tempF - 32) * 0.556 AS tempC FROM sensorData WHERE location LIKE 'room%' """)

// write the table back to Kafka as JSONroomSensors.toSink(new KafkaJsonSink(...))

The Implementation

8Flink 1.0 Flink 1.1 +

Queryable State

Sharing State with Applications

Access to the stream aggregates with a latency bound Write them to a key/value store

Sharing State with Applications

Access to the stream aggregates with a latency bound Write them to a key/value store

Often the biggestbottleneck

Queryable State

Optional, andonly at the end of

windows

Send queries to Flink's internal state

What does it bring? Fewer moving parts in the infrastructure Performance!

From an extension of Yahoo!'s streaming benchmark:• With key/value store: 280,000 events/s• Queryable state: 15,000,000 events/s

What's the secret?• No synchronous distributed communication• Persistence via Flink's checkpoint (async snapshots)

Dynamic Scaling

Adjust parallelism of Streaming Programs

Initialconfiguration

Scale Out(for load)

Scale In(save resources)

Adjust parallelism of Streaming Programs Adjusting parallelism without (significantly) interrupting the

program

Initial version:• Savepoint -> stop -> restart-with-different-parallelism

Stateless operators: Trivial Stateful operators: Repartition state

• State reorganized by key for key/value state and windows

Consistent Hashing

Redistribution via Key Groups

Redistribution via Key Groups Flink 1.0: Hash keys into parallel partitions. Finest granularity is a partition.

Flink 1.1: Hash keys into KeyGroups. Assign KeyGroups to parallel partitions Change of parallelism means change of assignment of

KeyGroups to parallel partitions

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

We are hiring!data-artisans.com/careers

Apache Flink Berlin Meetup May 2016

Software