+ All Categories
Home > Documents > CS 744: SPARK STREAMING

CS 744: SPARK STREAMING

Date post: 14-Nov-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
21
CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2020 good morning
Transcript
Page 1: CS 744: SPARK STREAMING

CS 744: SPARK STREAMING

Shivaram VenkataramanFall 2020

goodmorning

Page 2: CS 744: SPARK STREAMING

ADMINISTRIVIA

- Midterm grades this week- Course Projects feedback

→ ASAP

→ Hot CRP

ltlopefdly youare working on this !

d) Assign grades for project proposals

(2) mid semester update →Nov 20

"

Page 3: CS 744: SPARK STREAMING

CONTINUOUS OPERATOR MODELLong-lived operators

Distributed Checkpointsfor Fault Recovery

NaiadTask

Control MessageDriver

Network Transfer

Mutable State

Stragglers ?

① EH ④

framing averagecount per window

At Bi

Ar rollbackall

A 's 132 operatorscheckpoint

fu

d-Avoid

-

stragglers-

Page 4: CS 744: SPARK STREAMING

CONTINUOUS OPERATORSReplication to provide fault tolerance

⇒ Multiple copies ( say 2 ? ) of each

Sz operatorX. →d) Overhead of

2xresources required

si e ) Replicas need to be in sync

⇒ Szand Sd should be the

Asame

⇒ need to make sure replicasare synchronized during

normal computationtoverhead !

Page 5: CS 744: SPARK STREAMING

SPARK STREAMING: GOALS

1. Scalability to hundreds of nodes

2. Minimal cost beyond base processing (no replication)

3. Second-scale latency

4. Second-scale recovery from faults and stragglers

→To handle high e-put streams

( - ←→ enema: in ::ne÷: : outta

Running Average

- -

-.

.12 ,

l,4

,5 I

→ testit,← -

-- - s t,

ta - t. = latency

Page 6: CS 744: SPARK STREAMING

DISCRETIZED STREAMS (DSTREAMS)- every

micro batch Y" batch duration -

- → qgfawadtwwfuah.amrun short , deterministic

tasks

to computeincremental output

→ -0

everyhatch operation is stateless input

, =

J ) / state

-

-E-X-

state is stored as immutable→

dataset

- each part ofthe ttate can be

recovered independently -

tbf non - deterministic is opposite

→ if youre - run output might

can be Taskradom as : output 0 else output 1made ← be difficultdeterministic

Page 7: CS 744: SPARK STREAMING

EXAMPLEpageViews = readStream(http://...,

"1s")

ones = pageViews.map(event =>(event.url, 1))

counts =ones.runningReduce(

(a, b) => a + b)

(google.com ,5

a filesystem . or fromItoh"-

coma.

.

" )HTTP etc . -

-

hash,shuffle

↳ microbatch duration Rj

i

-

-

- - tRDD

=p .

-

-

.

Titmice f↳ read one chunk from storage qgoogte.com ,

't

]

Page 8: CS 744: SPARK STREAMING

DSTREAM API

TransformationsStateless: map, reduce, groupBy, join

Stateful: window(“5s”) à RDDs with data in [0,5), [1,6), [2,7)

reduceByWindow(“5s”, (a, b) => a + b)

→ similar to RDD API

sliding -

---

ideatesa sliding window and aggregates RDDS that

belong to it .

Page 9: CS 744: SPARK STREAMING

SLIDING WINDOW

Add previous 5 each time

micro batch duration -

- is

window duration -

-5s

Et ,tt5)

f>overhead

0-

.

.

-

-

i

-

optimization-

'

.

A' improve

reduce By -

I① performance

window-

O - = -

- -

Page 10: CS 744: SPARK STREAMING

STATE MANAGEMENT

Tracking State: streams of (Key, Event) à (Key, State)

events.track((key, ev) => 1,

(key, st, ev) => ev == Exit ? null : 1,

"30s”)

Session which has

→ all events for a

user satisfying some

MID-

← I → Initialize statecriteria [login → logout]

-

uptake : given prev. stateand a new event

÷÷ .... ::÷::*:*. ¥¥i÷¥.

Page 11: CS 744: SPARK STREAMING

SYSTEM IMPLEMENTATIONpersist data safely④ Disk locally

leimemoryremotel-iskyieadspan.int?fn- tf-

y

"

windowing

state ←.

.

- T-persist

(2 machi- memory- -

Inherit ←- -

fromspark ← -

-

Page 12: CS 744: SPARK STREAMING

OPTIMIZATIONS

Timestep PipeliningNo barrier across timesteps unless neededTasks from the next timestep scheduled before current finishes

CheckpointingAsync I/O, as RDDs are immutable Forget lineage after checkpoint

map

§ , l ) O.. .

..

O

Clt) E schedule before→ fuse together map operations Q prer finishes

-

-

can be done by storing to remote memory

Page 13: CS 744: SPARK STREAMING

FAULT TOLERANCE: PARALLEL RECOVERY

Worker failure- Need to recompute state RDDs stored on worker- Re-execute tasks running on the worker

Strategy- Run all independent recovery tasks in parallel- Parallelism from partitions in timestep and across timesteps

→these might be

- -

used for futureoutputs

- - - -

÷: ÷÷÷.

""

÷.

:÷÷.

Page 14: CS 744: SPARK STREAMING

EXAMPLEpageViews = readStream(http://...,

"1s")

ones = pageViews.map(event =>(event.url, 1))

counts =ones.runningReduce(

(a, b) => a + b)

in parallel .

j::÷::*

"

.

D ÷

Page 15: CS 744: SPARK STREAMING

FAULT TOLERANCE

Straggler MitigationUse speculative executionTask runs more than 1.4x longer than median task à straggler

Master Recovery- At each timestep, save graph of DStreams and Scala function objects- Workers connect to a new master and report their RDD partitions- Note: No problem if a given RDD is computed twice (determinism).

→ fall back

Driver

- →Runs forever 1

MR Master → retry the job on failure

[art master recovery is similar !

Page 16: CS 744: SPARK STREAMING

SUMMARY

Micro-batches: New approach to stream processing

Simplifies fault tolerance, straggler mitigation

Unifying batch, streaming analytics

Page 17: CS 744: SPARK STREAMING

DISCUSSIONhttps://forms.gle/eiqbjJTU95bMQLtm9

Page 18: CS 744: SPARK STREAMING

slopeindicates

try ! o .×s×" ' / overhead ⇒ moremachines

^

[higher throughput for larger

mini batch size

①overhead per mini

-batch

:|i ' ' o←.

-

✓-

Crtof -

--

- /p

no co - ordinationlinear growth v for group !

with cluster size ✓

more f-put for grief !

Page 19: CS 744: SPARK STREAMING

If the latency bound was made to 100ms, how do you think the above figure would change? What could be the reasons for it?

too low latency →low that

overheads in task scheduling

tracking Reps etc .

if we go to1000 machines ⇒ overheads could be

1 large !

linear scaling might not last ?

Page 20: CS 744: SPARK STREAMING

Consider the pros and cons of approaches in Naiad vs Spark Streaming. What application properties would you use to decide which system to choose?

Waid spank streaming

latency sensitive failures

stragglersiterative t streaming

workflows

Page 21: CS 744: SPARK STREAMING

NEXT STEPS

Next class: Graph processing!Midterm grades ASAP!

Batching ? !

↳ continuous operator1 event .

=not optimal-

↳ ÷:÷c÷÷:÷÷÷f ¥:e¥÷.-

very low latency→ MPI - based

→ Ctt Actor model

↳ Erlang → Telephonecompanies


Recommended