ScalableData Ingestion for Stream Processingand...

Post on 28-May-2020

6 views 0 download

transcript

Scalable Data Ingestion for Stream Processing and Beyond

Gabriel AntoniuJoint work with Ovidiu Marcu, Alexandru Costan, Maria S. Pérez

9th JLESC workshop, UTK, Knoxville, April 15, 2019

Velocity

Data in motion

FluidDynamic

Volume

Data at rest

StationaryStatic

2

From Big Data to Fast Data

Correctness Exact results Approximate results

Latency High-latency Low-latency

Cost Stateless Stateful

Batch Streaming

3

State of the art until recently:Lambda Architectures

Historical events

Real-time events

Exact historical

model

Approximate real-time

model

Periodic queries

Continuous queries

Batch processing

Stream processing

Results &

Actions

What?

Why?

4

The streaming pipeline: latency happens

Unified batch and stream processing

Ingest delay (write latency)

Throughput(read latency)

Network delay or unavailable Backlog

Poor storage design

Starved resources

Hardware failure

DATATRANSFER

Edge Cloud

5

What is ingestion ?

•Collect data from various sources → producers

•Deliver them for processing / storage → consumers

•Optionally: buffer, log, pre-process

Ingestion determines the processing performance6

State of the art: Apache Kafka

Limitations• Scalability• Data duplication

400 nodes, peak 3.2M events/s50 nodes, average 200K events/s

7

The KerA approach to ingestion • Scalability → Dynamic partitioning• Enables seamless elasticity

• Data duplication → Unified ingestion and storage • Support for both

• Streams (unbounded data)

• Objects (bounded data)

8

Zoom: scalability

Each partition is statically associated with one consumer: limited scalability

Producers Brokers Consumers

Partitions

Partitions

Partitions

9

KerA: dynamic partitioning

10

• Streamlets: logical stream containers; #streamlets > #brokers• Groups: created and processed dynamically; maximum #active groups per broker

Streamlets Streamlets

Streamlets Streamlets

Streamlets Streamlets

GroupsGroups

GroupsGroups

Groups

Increased network and storage overheads 11

Zoom: data duplication

KerA: unified ingestion and storage

KerA

INGESTIONBrokers

Acquire Push/Pulldata access

Streams

Objects

Common data model for streams and objects

STORAGEBackups

Move less data, process them faster

12

Evaluating scalability

1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106

4 8 16 32

Aver

age

Aggr

egat

ed

Clie

nt T

hrou

ghpu

t (re

cord

s/s)

Clients Number

KeraProdKeraConsKafkaProdKafkaCons

1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106

4 8 12 16

Ave

rage

Agg

rega

ted

Clie

nt T

hrou

ghpu

t (re

cord

s/s)

Nodes Number

KeraProdKeraConsKafkaProdKafkaCons

64 clients, 32 partitions, 1MB request size, 100B records

# Brokers

2x better throughputwith 75% less resources

Vertical Horizontal

8x10x

# Clients

4 brokers, 32 partitions, 128KB request size, 100B records

21*105

19*105

17*105

15*105

13*105

11*105

9*105

7*105

5*105

3*105

1*105

Thro

ughp

ut

21*105

19*105

17*105

15*105

13*105

11*105

9*105

7*105

5*105

3*105

1*105

13

14

Our vision: hybrid analytics architecture

Present data

Stream processing

Past data

Historicalmodel

Real-timemodel

ComputationalModel

Batch processing

FuturemodelSimulation

Proactive control

Continuous update

In situ processing

In transit processing

HybridAnalytics

Hybrid analytics: processing architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

Historicaldata

BetterDecisionLearning

15

Data processing

16

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

BetterDecisionLearning

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

Historicaldata

Postdoc (ANR OverFlow project)• Investigating Edge vs. Cloud

computing trade-offs for stream processing

• Methodology for benchmarking Edge processing frameworks

Ph.D. (to hire)• Uniform Cloud and Edge stream

processing for Fast Data analytics

Pedro Silva

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

Historicaldata

BetterDecisionLearning

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest) Research Engineer (Inria ADT project)• Enable support for in situ Big Data

analytics• Elastic allocation of dedicated resources

(cores/nodes)Ovidiu Marcu

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

BetterDecisionLearning

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

KerA+++seamless integration with in situ/in transit

+large state management

Historicaldata

Ovidiu Marcu

Startup (ZettaFlow)• Low and consistent latency

(lightweight offset indexing, independent memory management)

• Model applications not partitioning/stream storage

Ph.D. (Inria IPL project)• HPC – Big Data processing

convergence• Bridge in situ/in transit and

stream/batch processing

H2020 project in preparation

Thank You!