ScalableData Ingestion for Stream Processingand...

transcript

Scalable Data Ingestion for Stream Processing and Beyond

Gabriel AntoniuJoint work with Ovidiu Marcu, Alexandru Costan, Maria S. Pérez

9th JLESC workshop, UTK, Knoxville, April 15, 2019

Velocity

Data in motion

FluidDynamic

Volume

Data at rest

StationaryStatic

From Big Data to Fast Data

Correctness Exact results Approximate results

Latency High-latency Low-latency

Cost Stateless Stateful

Batch Streaming

State of the art until recently:Lambda Architectures

Historical events

Real-time events

Exact historical

Approximate real-time

Periodic queries

Continuous queries

Batch processing

Stream processing

Results &

Actions

The streaming pipeline: latency happens

Unified batch and stream processing

Ingest delay (write latency)

Throughput(read latency)

Network delay or unavailable Backlog

Poor storage design

Starved resources

Hardware failure

DATATRANSFER

Edge Cloud

What is ingestion ?

•Collect data from various sources → producers

•Deliver them for processing / storage → consumers

•Optionally: buffer, log, pre-process

Ingestion determines the processing performance6

State of the art: Apache Kafka

Limitations• Scalability• Data duplication

400 nodes, peak 3.2M events/s50 nodes, average 200K events/s

The KerA approach to ingestion • Scalability → Dynamic partitioning• Enables seamless elasticity

• Data duplication → Unified ingestion and storage • Support for both

• Streams (unbounded data)

• Objects (bounded data)

Zoom: scalability

Each partition is statically associated with one consumer: limited scalability

Producers Brokers Consumers

Partitions

KerA: dynamic partitioning

• Streamlets: logical stream containers; #streamlets > #brokers• Groups: created and processed dynamically; maximum #active groups per broker

Streamlets Streamlets

GroupsGroups

Groups

Increased network and storage overheads 11

Zoom: data duplication

KerA: unified ingestion and storage

INGESTIONBrokers

Acquire Push/Pulldata access

Streams

Objects

Common data model for streams and objects

STORAGEBackups

Move less data, process them faster

Evaluating scalability

1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106

4 8 16 32

Clients Number

KeraProdKeraConsKafkaProdKafkaCons

1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106

4 8 12 16

Nodes Number

KeraProdKeraConsKafkaProdKafkaCons

64 clients, 32 partitions, 1MB request size, 100B records

# Brokers

2x better throughputwith 75% less resources

Vertical Horizontal

# Clients

4 brokers, 32 partitions, 128KB request size, 100B records

21*105

19*105

17*105

15*105

13*105

11*105

21*105

19*105

17*105

15*105

13*105

11*105

Our vision: hybrid analytics architecture

Present data

Stream processing

Past data

Historicalmodel

Real-timemodel

ComputationalModel

Batch processing

FuturemodelSimulation

Proactive control

Continuous update

In situ processing

In transit processing

HybridAnalytics

Hybrid analytics: processing architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

Historicaldata

BetterDecisionLearning

Data processing

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Computation

Sensor

Historicaldata

Postdoc (ANR OverFlow project)• Investigating Edge vs. Cloud

computing trade-offs for stream processing

• Methodology for benchmarking Edge processing frameworks

Ph.D. (to hire)• Uniform Cloud and Edge stream

processing for Fast Data analytics

Pedro Silva

DATAfrom the

Real World

DATAfrom the

Computation

Sensor

Historicaldata

(data in-motion + data at-rest) Research Engineer (Inria ADT project)• Enable support for in situ Big Data

analytics• Elastic allocation of dedicated resources

(cores/nodes)Ovidiu Marcu

DATAfrom the

Real World

DATAfrom the

Computation

Sensor

KerA+++seamless integration with in situ/in transit

+large state management

Historicaldata

Ovidiu Marcu

Startup (ZettaFlow)• Low and consistent latency

(lightweight offset indexing, independent memory management)

• Model applications not partitioning/stream storage

Ph.D. (Inria IPL project)• HPC – Big Data processing

convergence• Bridge in situ/in transit and

stream/batch processing

H2020 project in preparation

Thank You!

ScalableData Ingestion for Stream Processingand...

Documents