+ All Categories
Home > Data & Analytics > Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Date post: 12-Jul-2015
Category:
Upload: databricks
View: 1,940 times
Download: 0 times
Share this document with a friend
20
Spark’s Role in the Big Data Ecosystem Matei Zaharia
Transcript
Page 1: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark’s Role in the Big Data Ecosystem Matei Zaharia

Page 2: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

An Exciting Year for Spark

Very fast community growth

1.0 release in May

7+ distributors, 20+ apps

Page 3: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Project Activity

June 2013 June 2014

total contributors

68 255

companies contributing

17 50

total lines"of code

63,000 175,000

Page 4: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Project Activity

June 2013 June 2014

total contributors

68 255

companies contributing

17 50

total lines"of code

63,000 175,000

Page 5: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Compared to Other Projects M

apRe

duce

YA

RN HD

FS

Stor

m

Spar

k

0

200

400

600

800

1000

1200

1400

Map

Redu

ce

YARN

HDFS

St

orm

Spar

k

0

50000

100000

150000

200000

250000

300000

Commits Lines of Code Changed

Activity in past 6 months

Page 6: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Compared to Other Projects M

apRe

duce

YA

RN HD

FS

Stor

m

Spar

k

0

200

400

600

800

1000

1200

1400

Map

Redu

ce

YARN

HDFS

St

orm

Spar

k

0

50000

100000

150000

200000

250000

300000

Commits Lines of Code Changed

Activity in past 6 months Spark is now the most active project in the"

Hadoop ecosystem

Page 7: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Compared to Other Projects

Spark is one of top 3 most active projects at Apache

More active than “general” data processing projects like NumPy, matplotlib, SciKit-Learn

Page 8: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Continuing Growth

source: ohloh.net

Contributors per month to Spark

Page 9: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Major new additions

Page 10: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Last Summit

Last Summit we said we’d focus on two things: •  Standard libraries •  Enterprise features

New libraries: Spark SQL, MLlib (machine learning), GraphX (graph processing)

Enterprise features: security, monitoring, HA

Page 11: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark SQL Enables loading & querying structured data in Spark

c = HiveContext(sc) !rows = c.sql(“select text, year from hivetable”) !rows.filter(lambda r: r.year > 2013).collect() !

From Hive:

{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}

c.jsonFile(“tweets.json”).registerAsTable(“tweets”) !c.sql(“select text, user.name from tweets”) !

From JSON: tweets.json

Page 12: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark SQL Integrates closely with Spark’s language APIs

44 contributors in past year

c.registerFunction(“hasSpark”, lambda text: “Spark” in text) !

c.sql(“select * from tweets where hasSpark(text)”) !

Uniform interface for data access

Hive Parquet JSON Cassan-dra

SQL

Python Scala Java

Page 13: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Machine Learning Library (MLlib)

Standard library of machine learning algorithms

Now includes 15+ algorithms •  New in 1.0: decision trees, SVD, PCA, L-BFGS •  In development: non-negative matrix factorization, LDA,

Lanczos, multiclass trees, ADMM

40 contributors in past year

points = context.sql(“select latitude, longitude from tweets”) !

model = KMeans.train(points, 10) !!

Page 14: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Java 8 API

Enables concise programming in Java similar to Scala and Python

JavaRDD<String> lines = sc.textFile("data.txt"); !

JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); !

int totalLength = lineLengths.reduce((a, b) -> a + b); !

Page 15: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

What is our vision for Spark?

Page 16: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

1. Unified Platform for Big Data Apps

Uniform API for diverse workloads over diverse storage systems and runtimes

Batch Interactive Streaming

Hadoop Cassandra Mesos

… Cloud

Providers

Page 17: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Why a Platform Matters

Good for developers: one system to learn

Good for users: take apps anywhere

Good for distributors: more applications

Page 18: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

2. Standard Library for Big Data

Big data apps lack libraries"of common algorithms

Spark’s generality + support"for multiple languages make it"suitable to offer this

Core

SQL ML graph …

Python Scala Java R

Much of future activity will be in these libraries

Page 19: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks & Spark

At Databricks, we are working to keep Spark 100% open source and compatible across vendors

All our work on Spark is at Apache

Check out project-specific talks to see what’s next!

Page 20: Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Thank You and Enjoy Spark Summit!


Recommended