Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

transcript

Real-time Stream Processing on EMR: Apache Flink vs Apache Spark Streaming

Keith Steward, Ph.D.

Specialist (EMR) Solution Architect

What we’ll cover:1. The need for real-time stream processing, and challenges in

accomplishing it

2. Flink stream processor (versus Spark Streaming):• What are its aims?

• How does it address real-time stream processing challenges?

• How does it differ from Spark Streaming?

• Real-world Flink examples

• When to use Flink vs Spark Streaming?

3. Flink Demo: How to deploy & run a Flink stream processing architecture in AWS?

The Need for Real-Time Stream Processing

Increasingly, data is arriving as continuous flows of events:

• cars in motion emitting GPS signals • financial transactions • interchange of signals between cell phone towers

and people busy with their smartphones • web traffic • machine logs • measurements from industrial sensors and wearable

devices

Streaming data is a better fit for the way we live.

Challenges in Processing Streams:

• Event time (rather than data processing time); out of order events

• Consistency, fault tolerance, and high availability

• Rich forms of window queries; real-time alerts

• Low latency and high throughput

Reference architecture

COLLECT STORE CONSUMEPROCESS / ANALYZEETL

Traditional Big Data Pipeline (without Stream Processing)

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

KCLapps

AWS Lambda

Amazon Redshift

Amazon Machine Learning

Presto

AmazonEMR

Amazon EC2

Amazon EMR

Amazon QuickSight

Apps & Services

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

Data centers

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWS CloudTrail

AWS Direct Connect

AWS IoT

Apache Kafka

Amazon DynamoDB

Amazon ElastiCache

RECORDS

STREAMS

Streaming

KCLapps

AWS Lambda

Amazon EC2

Amazon EMR

Apps & Services

Mobile apps

Web apps

Devices

Message

Data centers

Logging

Amazon CloudWatch

AWS CloudTrail

AWS Direct Connect

AWS IoT

Stream ProcessingReference architecture

“Apache Flink is an open source platform for distributed stream and batch data processing.”

Flink is a Stream-First Architecture, that happens to also do batch processing as special case of bounded stream processing.

Apache Flink

Flink Handles Challenging Scenarios:

1. Application code upgrades

2. Flink version upgrades

3. Maintenance and migration

4. What-if simulations (reinstatements)

5. A/B testing

Flink addresses all Streaming Challenges Simultaneously

Flink Performance Tests

• Amazon Kinesis Streams

• Apache Kafka

• Elasticsearch

• Twitter Streaming API

• Cassandra

There are connectors for third-party data sources:

Where is Flink being used in production?

When to use Flink vs Spark Streaming?

Flink might be best when:• workload demands true real-time

stream processing performance with low latency, high throughput, and fault-tolerance.

• not yet heavily invested in Spark Streaming (existing systems, staff training/experience)

• want convenience of replaying and reprocessing streams after code/system changes

Spark Streaming might be best when:• Primarily do batch processing• Already invested in Spark Streaming

(existing deployments, staff)• The micro-batching is acceptable for

your workload• Need to code in Python or R• Want to “wait and see” how Flink

matures before adopting.

How to deploy & run Flink in AWS ?

Flink Demo: Analyzing NYC Taxi Rides in Real Time

Demo Event Processing Architecture

“Replayable Log”

Stream Processing Visualization

Demo Event Processing Architecture

Amazon Kinesis

Amazon EMR

EC2 instance(bastion host)

Amazon Elasticsearch

Service

Real-time streamingHigh throughput; elasticKeeps a ‘replayable log’ of your eventsEasy to useS3, Redshift, DynamoDB Integrations

Amazon Kinesis

Dynamically Scalable transient or persistent Hadoop clusters as a serviceHadoop, Hive, Spark, Presto, Hbase, … (17 applications)

Easy to use; fully managedOn demand, reserved, spot pricingHDFS, S3, and Amazon EBS filesystemsEnd to end security: access controls, firewalls, encryption

Amazon EMR

Provisions & maintains an Elasticsearch cluster (distributed index)Complete ELK stack, including KibanaFully managed service; zero adminHighly available & reliableScalable

{show demo!}

1. For predominantly stream-processing workloads but also for batch processing, Apache Flink has much to offer:• Simultaneously addresses high throughput, low latency,

and fault-tolerance• Do both stream processing & batch processing with a

single technology• Powerful windowing functions• Convenient capabilities to pause, restart, and change

Flink applications without data loss.

2. Flink is still new and adoption is not as far advanced as Spark Streaming.

3. AWS makes it easy to run streaming workloads with Amazon Kinesis and either Spark Streaming or Flink running on EMR clusters.

Questions?

stewardk@amazon.com

Keith Steward, Ph.D.Specialist (EMR) SAAWS

Additional Details

Approximate Event Time – Kinesis details• Each Amazon Kinesis record includes an

ApproximateArrivalTimestamp

• The timestamp is set when an Amazon Kinesis stream successfully receives and stores a record

• By default the event time of Flink uses this timestamp when reading from a Kinesis stream

StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Event Time and Watermarks• With event time the time of an event is determined by the

producer

• Flink measures progress in event time by means of Watermarks

• Watermarks must be ingested to each individual Kinesis shard

DataStream<Event> kinesis = env.addSource(new FlinkKinesisConsumer<>(...)).assignTimestampsAndWatermarks(new

PunctuatedAssigner())

Data Encryption with Amazon EMR and FlinkSecurity configuration supports encryption

• for data stored within the file system

• Hadoop Distributed File System (HDFS) block-transfer and RPC

• S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom)

• Local disk (except boot volumes)

• In-transit data (no Flink support yet)

env.readTextFile("s3://...")env.setStateBackend(new FsStateBackend("hdfs://..."))

Connecting to the Flink Dashboard• Use dynamic port forwarding to the Master node

ssh -D 8157 hadoop@...

• Use FoxyProxy to redirect URLs to localhost

*ec2*.amazonaws.com*

*.compute.internal*

• Navigate to the YARN Resource Manager and select the Tracking UI

Starting Flink and Submitting JobsUse steps to interact with Flink through the AWS API

Extending Flink Functionality• Flink Elasticsearch sink merely supports TCP transport

• A custom Elasticsearch sink with HTTP support requires only a few dozens lines of code using• Jest (io.searchbox)

• aws-signing-request-interceptor (vc.inreach.aws)

Amazon SQS apps

Streaming

KCLapps

AWS Lambda

Amazon Redshift

Amazon Machine Learning

Presto

AmazonEMR

Apache Kafka

Amazon SQS

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Data centersAWS Direct

Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Amazon EMR

Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Technology