+ All Categories
Home > Documents > 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September...

1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September...

Date post: 13-Dec-2015
Category:
Upload: ross-wilkins
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
45
1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/ classes/cs294/15/) September 14, 2015
Transcript
Page 1: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

1

CS 294-110: Project Suggestions

Ion Stoica and Ali Ghodsi(http://www.cs.berkeley.edu/~istoica/classes/cs294/15/)

September 14, 2015

Page 2: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Projects

This is a project-oriented class Reading papers should be a means to a great

project, not a goal in itself! Strongly prefer groups of two students Today, I’ll present some suggestions

But, you are free to come up with your own proposal

Main goal: just do a great project

2

Page 3: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Projects

Many projects around Spark Local expertise Great platform to disseminate your work Short review based on log mining example to provide

context

3

Page 4: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

4

Page 5: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

5

Page 6: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

lines = spark.textFile(“hdfs://...”)

6

Page 7: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

lines = spark.textFile(“hdfs://...”)

Base RDDBase RDD

7

Page 8: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

8

Page 9: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

Transformed RDDTransformed RDD

9

Page 10: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

messages.filter(lambda s: “mysql” in s).count()

10

Page 11: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

messages.filter(lambda s: “mysql” in s).count() ActionAction

11

Page 12: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorker

DriverDriver

messages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 312

Page 13: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

DriverDrivertasks

tasks

tasks

13

Page 14: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

DriverDriver

ReadHDFSBlock

ReadHDFSBlock

ReadHDFSBlock

14

Page 15: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

DriverDriver

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

Process& Cache

Data

Process& Cache

Data

Process& Cache

Data

15

Page 16: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

DriverDriver

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

results

results

results

16

Page 17: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

DriverDriver

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

messages.filter(lambda s: “php” in s).count()

17

Page 18: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

DriverDriver

18

Page 19: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

messages.filter(lambda s: “php” in s).count()

DriverDriver

Processfrom

Cache

ProcessfromCache

Processfrom

Cache19

Page 20: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

messages.filter(lambda s: “php” in s).count()

DriverDriverresults

results

results

20

Page 21: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark Example: Log MiningLoad error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

WorkerWorker

WorkerWorker

WorkerWorkermessages.filter(lambda s: “mysql” in s).count()

Block 1Block 1

Block 2Block 2

Block 3Block 3

Cache 1Cache 1

Cache 2Cache 2

Cache 3Cache 3

messages.filter(lambda s: “php” in s).count()

DriverDriver

Cache your data Faster ResultsFull-text search of Wikipedia•60GB on 20 EC2 machines•0.5 sec from mem vs. 20s for on-disk

Cache your data Faster ResultsFull-text search of Wikipedia•60GB on 20 EC2 machines•0.5 sec from mem vs. 20s for on-disk 21

Page 22: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Spark

22

{JSON}

Data Sources

Spark Core

DataFrames ML Pipelines

Spark StreamingSpark SQL MLlib GraphX

?

Page 23: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Pipeline Shuffle

Problem Right now shuffle senders write data on storage after

which the data is shuffled to receivers Shuffle often most expensive communication pattern,

sometimes dominates job comp. time Project

Start sending shuffle data as it is being produced Challenge

How do you do recovery and speculation? Could store data as being sent, but still not easy….

23

Page 24: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Fault Tolerance & Perf. Tradeoffs Problem:

Maintaining lineage in Spark provides fault recovery, but comes at performance cost E.g., hard to support super small tasks due to lineage overhead

Project: Evaluate how much you can speed up Spark by ignoring

fault tolerance Can generalize to other cluster computing engines

Challenge What do you do for large jobs, how do you treat

stragglers? Maybe a hybrid method, i.e., just don’t do lineage for small jobs?

Need to figure out when a job is small…24

Page 25: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

(Eliminating) Scheduling Overhead

Problem: with Spark, driver schedules every task Latency 100s ms or higher; cannot run ms queries Driver can become a bottleneck

Project: Have workers perform scheduling

Challenge: How do you handle faults?

Maybe some hybrid solution across driver and workers?

25

Page 26: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Cost-based Optimization in SparkSQL

Problem: Spark employs a rule-based Query Planner (Catalyst) Limited optimization opportunities especially when

operator performance varies widely based on input data E.g., join and selection on skewed data

Project: cost-based optimizer Estimate operators’ costs, and use these costs to

compute the query plan

26

Page 27: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Streaming Graph Processing

Problem: With GraphX, queries can be fast but updates are

typically in batches (slow) Project:

Incrementally update graphs Support window based graph queries

Note: Discuss with Anand Iyer and Ankur Dave if interested

27

Page 28: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Streaming ML

Problem: Today ML algorithms typically performed on static data Cannot update model in real-time

Project: Develop on-line ML algorithms that update the model

continuously as new data is streamed

Notes: Also contact Joey Gonzalez if interested

28

Page 29: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Beyond JVM: Using Non-Java Libraries

Problem: Spark tasks are executed within JVMs Limits performance and use of non-Java popular libraries

Project: General way to add support for non-Java libraries Example: use JNI to call arbitrary libraries

Challenges: Define interface, shared data formats, etc

Notes Contact Guanhua and Shivaram, if needed

29

Page 30: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Beyond JVM: Dynamic Code Generation

Problem: Spark tasks are executed within JVMs Limits performance and use of non-Java popular

libraries Project:

Generate non-Java code, e.g., C++, CUDA for GPUs Challenges:

API and shared data format Notes

Contact Guanhua and Shivaram, if needed

30

Page 31: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Beyond JVM: Resource Management and Scheduling

Problem Need to schedule processes hosting non-Java code GPU cannot be invoked by more than one process

Project: Develop scheduling, and resource management

algorithms Challenge:

Preserve fault tolerance, straggler mitigation Notes

Contact Guanhua and Shivaram, if needed

31

Page 32: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Time Series for DataFrames

Insprired by Pandas and R DataFrames, Spark recently introduced DataFrames

Problem Spark DataFrames don’t support time series

Project: Develop and contribute distributed time series

operations for Data Frames Challenge:

Spark doesn’t have indexes http://pandas.pydata.org/pandas-docs/stable/timeseries.html

32

Page 33: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

ACID transactions to Spark SQL

Problem Spark SQL is used for Analytics and doesn’t support

ACID Project:

Develop and add row-level ACID tx on top of Spark SQL

Challenge: Challenging to provide transactions and analytics in

one system https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

33

Page 34: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Typed Data Frames

Problem DataFrames in Spark, unlike Spark RDDs, do not

provide type safety Project:

Develop a typed DataFrame framework for Spark Challenge:

SQL-like operations are inherently dynamic (e.g. filter(“col”) and make it hard to have static typing unless fancy reflection mechanisms are used

34

Page 35: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

General pipelines for Spark

Problem Spark.ml provides a pipeline abstraction for ML,

generalize it to cover all of Spark Project:

Develop a pipeline abstraction (similar to ML pipelines) that spans all of Spark, allowing users to perform SQL operations, GraphX operations, etc

35

Page 36: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Beyond BSP

Problem With BSP each worker executes the same code

Project Can we extend Spark (or other cluster computing

framework) to support non-BSP computation How much better than emulating everything with

BSP? Challenge

Maintain simple APIs More complex scheduling, communication patterns

36

Page 37: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

37

Project idea: cryptography & big data(Alessandro Chiesa)

As data and computations scale up to larger sizes…

… can cryptography follow?

One direction: zero knowledge proofs for big data

Page 38: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

38

Classical setting:zero knowledge proofs on 1 machine

result

server

client

Here is the result of your computation.

I don’t believe you.

I don’t want to give you my private data.

Send me a ZK proof of correctness?

& ZK proof

add crypto magic

+ generateZK proof

+ generateZK proof + verify

ZK proof+ verifyZK proof

Page 39: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

39

New setting for big data:zero knowledge proofs on clusters

result

cluster

client& ZK proof

+ generateZK proof

+ generateZK proof

+ verifyZK proof+ verifyZK proof

Problem: cannot generate ZK proof on 1 machine (as

before)Challenge:

generate the ZK proof over a cluster (e.g., using Spark)

End goal: “scaling up” ZK proofs to computations on big data

& explore security applications!

Page 40: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Succinct (quick overview)

Queries on compressed data Basic operations:

Search: given a substring “s” return offsets of all occurrences of “s” within the input

Extract: given an offset “o” and a length “l” uncompress and return “l” bytes from original file starting at “o”

Count: given a substring “s” return the number of occurrences of “s” within the input

Can implement key-value store on top of it40

Page 41: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Succinct: Efficient Point Query Support

Problem: Spark implementation: expensive, as always queries

all workers Project:

Implement Succinct on top of Tachyon (storage layer) Provide efficient key-value store lookups, i.e., lookup a

single worker if key is there Note:

Contact Anurag and Rachit, if interested

41

Page 42: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Succinct: External Memory Support

Problem: Some data increases faster than main memory Need to execute queries on external storage (e.g.,

SSDs) Project:

Design & implement compressed data structures for efficient external memory execution

A lot of work in theory community, that could be exploited

Note: Contact Anurag and Rachit, if interested

42

Page 43: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Succinct: Updates

Problem: Current systems use a multi-store architecture Expensive to update compressed representation

Project: Develop a low overhead update solution with minimal

impact on memory overhead and query performance Start from multi-store architecture (see NSDI paper)

Note: Contact Anurag and Rachit, if interested

43

Page 44: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Succinct: SQL

Problem: Arbitrary sub-string search powerful but not as many

workloads Project:

Support SQL on top of Succinct Start from SparkSQL and Succinct Spark package?

Note: Contact Anurag and Rachit, if interested

44

Page 45: 1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (istoica/classes/cs294/15/) September 14, 2015.

Succinct: Genomics

Problem: Genomics pipeline still expensive

Project: Genome processing on a single machine (using

compressed data) Enable queries on compressed genomes

Challenges: Domain specific query optimizations

Note: Contact Anurag and Rachit, if interested

45


Recommended