+ All Categories
Home > Technology > Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Date post: 20-Mar-2017
Category:
Upload: amazon-web-services
View: 808 times
Download: 2 times
Share this document with a friend
34
Real-time Stream Processing on EMR: Apache Flink vs Apache Spark Streaming Keith Steward, Ph.D. Specialist (EMR) Solution Architect AWS
Transcript
Page 1: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Real-time Stream Processing on EMR: Apache Flink vs Apache Spark Streaming

Keith Steward, Ph.D.

Specialist (EMR) Solution Architect

AWS

Page 2: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

What we’ll cover:1. The need for real-time stream processing, and challenges in

accomplishing it

2. Flink stream processor (versus Spark Streaming):• What are its aims?

• How does it address real-time stream processing challenges?

• How does it differ from Spark Streaming?

• Real-world Flink examples

• When to use Flink vs Spark Streaming?

3. Flink Demo: How to deploy & run a Flink stream processing architecture in AWS?

Page 3: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

The Need for Real-Time Stream Processing

Increasingly, data is arriving as continuous flows of events:

• cars in motion emitting GPS signals • financial transactions • interchange of signals between cell phone towers

and people busy with their smartphones • web traffic • machine logs • measurements from industrial sensors and wearable

devices

Streaming data is a better fit for the way we live.

Page 4: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Challenges in Processing Streams:

• Event time (rather than data processing time); out of order events

• Consistency, fault tolerance, and high availability

• Rich forms of window queries; real-time alerts

• Low latency and high throughput

Page 5: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Hot

Reference architecture

COLLECT STORE CONSUMEPROCESS / ANALYZEETL

Traditional Big Data Pipeline (without Stream Processing)

Reference architecture

Page 6: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

File

Mes

sage

Stre

amSe

arch

S

QL

N

oSQ

L

Cach

e

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Reference architecture

ETL

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

KCLapps

AWS Lambda

Amazon Redshift

Amazon Machine Learning

Presto

AmazonEMR

Fast

Slow

Fast

Batc

hM

essa

geIn

tera

ctiv

eSt

ream

ML

Amazon EC2

Amazon EC2

Amazon EMR

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ks

IDE

API

Appl

icat

ions

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

Data centers

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWS CloudTrail

Logg

ing

IoT

Tran

spor

tM

essa

ging

AWS Direct Connect

AWS IoT

Reference architecture

Page 7: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Elasticsearch Service

Apache Kafka

Amazon KinesisStreams

Amazon DynamoDB

Amazon ElastiCache

Amazon DynamoDB Streams

Hot

Hot

War

m

Stre

amSe

arch

NoS

QL

Cac

he

RECORDS

STREAMS

Reference architecture

ETL

Streaming

Amazon Kinesis Analytics

KCLapps

AWS Lambda

Fast

Batc

hIn

tera

ctiv

eSt

ream

ML

Amazon EC2

Amazon EMR

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ks

IDE

API

Appl

icat

ions

Mobile apps

Web apps

Devices

Message

Sensors & IoT platforms

Data centers

Logging

Amazon CloudWatch

AWS CloudTrail

Logg

ing

IoT

Tran

spor

t

AWS Direct Connect

AWS IoT

Stream ProcessingReference architecture

Fast

Page 8: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks
Page 9: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

“Apache Flink is an open source platform for distributed stream and batch data processing.”

Flink is a Stream-First Architecture, that happens to also do batch processing as special case of bounded stream processing.

Page 10: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Apache Flink

Page 11: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Flink Handles Challenging Scenarios:

1. Application code upgrades

2. Flink version upgrades

3. Maintenance and migration

4. What-if simulations (reinstatements)

5. A/B testing

Page 12: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Flink addresses all Streaming Challenges Simultaneously

Page 13: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Flink Performance Tests

Page 14: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

• Amazon Kinesis Streams

• Apache Kafka

• Elasticsearch

• Twitter Streaming API

• Cassandra

There are connectors for third-party data sources:

Page 15: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Where is Flink being used in production?

Page 16: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

When to use Flink vs Spark Streaming?

Flink might be best when:• workload demands true real-time

stream processing performance with low latency, high throughput, and fault-tolerance.

• not yet heavily invested in Spark Streaming (existing systems, staff training/experience)

• want convenience of replaying and reprocessing streams after code/system changes

Spark Streaming might be best when:• Primarily do batch processing• Already invested in Spark Streaming

(existing deployments, staff)• The micro-batching is acceptable for

your workload• Need to code in Python or R• Want to “wait and see” how Flink

matures before adopting.

Page 17: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

How to deploy & run Flink in AWS ?

Page 18: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Flink Demo: Analyzing NYC Taxi Rides in Real Time

Page 19: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Demo Event Processing Architecture

“Replayable Log”

Stream Processing Visualization

Page 20: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Demo Event Processing Architecture

Amazon Kinesis

Amazon EMR

+

EC2 instance(bastion host)

Amazon Elasticsearch

Service

Page 21: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Real-time streamingHigh throughput; elasticKeeps a ‘replayable log’ of your eventsEasy to useS3, Redshift, DynamoDB Integrations

Amazon Kinesis

Page 22: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Dynamically Scalable transient or persistent Hadoop clusters as a serviceHadoop, Hive, Spark, Presto, Hbase, … (17 applications)

Easy to use; fully managedOn demand, reserved, spot pricingHDFS, S3, and Amazon EBS filesystemsEnd to end security: access controls, firewalls, encryption

Amazon EMR

Page 23: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Provisions & maintains an Elasticsearch cluster (distributed index)Complete ELK stack, including KibanaFully managed service; zero adminHighly available & reliableScalable

Page 24: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

{show demo!}

Page 25: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

1. For predominantly stream-processing workloads but also for batch processing, Apache Flink has much to offer:• Simultaneously addresses high throughput, low latency,

and fault-tolerance• Do both stream processing & batch processing with a

single technology• Powerful windowing functions• Convenient capabilities to pause, restart, and change

Flink applications without data loss.

2. Flink is still new and adoption is not as far advanced as Spark Streaming.

3. AWS makes it easy to run streaming workloads with Amazon Kinesis and either Spark Streaming or Flink running on EMR clusters.

Page 26: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Questions?

[email protected]

Keith Steward, Ph.D.Specialist (EMR) SAAWS

Page 27: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Additional Details

Page 28: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Approximate Event Time – Kinesis details• Each Amazon Kinesis record includes an

ApproximateArrivalTimestamp

• The timestamp is set when an Amazon Kinesis stream successfully receives and stores a record

• By default the event time of Flink uses this timestamp when reading from a Kinesis stream

StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Page 29: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Event Time and Watermarks• With event time the time of an event is determined by the

producer

• Flink measures progress in event time by means of Watermarks

• Watermarks must be ingested to each individual Kinesis shard

DataStream<Event> kinesis = env.addSource(new FlinkKinesisConsumer<>(...)).assignTimestampsAndWatermarks(new

PunctuatedAssigner())

Page 30: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Data Encryption with Amazon EMR and FlinkSecurity configuration supports encryption

• for data stored within the file system

• Hadoop Distributed File System (HDFS) block-transfer and RPC

• S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom)

• Local disk (except boot volumes)

• In-transit data (no Flink support yet)

env.readTextFile("s3://...")env.setStateBackend(new FsStateBackend("hdfs://..."))

Page 31: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Connecting to the Flink Dashboard• Use dynamic port forwarding to the Master node

ssh -D 8157 hadoop@...

• Use FoxyProxy to redirect URLs to localhost

*ec2*.amazonaws.com*

*.compute.internal*

• Navigate to the YARN Resource Manager and select the Tracking UI

Page 32: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Starting Flink and Submitting JobsUse steps to interact with Flink through the AWS API

Page 33: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Extending Flink Functionality• Flink Elasticsearch sink merely supports TCP transport

• A custom Elasticsearch sink with HTTP support requires only a few dozens lines of code using• Jest (io.searchbox)

• aws-signing-request-interceptor (vc.inreach.aws)

Page 34: Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Fast

Slow

Fast

Batc

hM

essa

geIn

tera

ctiv

eSt

ream

ML

Sear

ch

SQ

L

NoS

QL

Ca

che

File

Mes

sage

Stre

am

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centersAWS Direct

Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ks

IDE

API

Reference architecture

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

Amazon EMR


Recommended