+ All Categories
Home > Data & Analytics > BDTC2015 databricks-辛湜-state of spark

BDTC2015 databricks-辛湜-state of spark

Date post: 12-Feb-2017
Category:
Upload: jerry-wen
View: 182 times
Download: 0 times
Share this document with a friend
33
State of Spark, and where it is going Reynold Xin 辛湜 @hashjoin 2015-12-10, Beijing BDTC
Transcript
Page 1: BDTC2015 databricks-辛湜-state of spark

State of Spark, and where it is going

Reynold Xin 辛湜 @hashjoin 2015-12-10, Beijing BDTC

Page 2: BDTC2015 databricks-辛湜-state of spark

About Databricks

Founded by creators of Spark in 2013

Cloud service for end-to-end data processing

• Interactive notebooks, dashboards, production jobs, security, „

Page 3: BDTC2015 databricks-辛湜-state of spark

Our Goal for Spark

Unified engine across data workloads and platforms

SQL Streaming ML Graph Batch …

Page 4: BDTC2015 databricks-辛湜-state of spark

Spark “Hall of Fame”

LARGEST SINGLE-DAY INTAKE

LONGEST-RUNNING JOB

LARGEST SHUFFLE

MOST INTERESTING APP

Tencent (1PB+ /day)

Alibaba (1 week on 1PB+

data)

Databricks PB Sort (1PB)

Jeremy Freeman Mapping the Brain at Scale

(with lasers!)

LARGEST CLUSTER

Tencent (8000+ nodes)

Based on Reynold’s personal knowledge

Page 5: BDTC2015 databricks-辛湜-state of spark

A Great Year for Spark

Most active open source project in big data

New language: R

Widespread industry support & adoption

Page 6: BDTC2015 databricks-辛湜-state of spark
Page 7: BDTC2015 databricks-辛湜-state of spark

“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune

Page 8: BDTC2015 databricks-辛湜-state of spark

“Spark是大数据中的

Angelababy.” - Derrick Harris, Fortune

Page 9: BDTC2015 databricks-辛湜-state of spark

Community Growth

2014 2015

Summit Attendees

2014 2015

Meetup Members

2014 2015

Developers Contributing

3900

1100

50K

12K

500

1000

Page 10: BDTC2015 databricks-辛湜-state of spark

Meetup Groups: December 2014

source: meetup.com

Page 11: BDTC2015 databricks-辛湜-state of spark

Meetup Groups: December 2015

source: meetup.com

Page 12: BDTC2015 databricks-辛湜-state of spark
Page 13: BDTC2015 databricks-辛湜-state of spark

Users

1000+ companies

Distributors + Apps

50+ companies

Page 14: BDTC2015 databricks-辛湜-state of spark

Spark Survey 2015

Page 15: BDTC2015 databricks-辛湜-state of spark

Databricks Survey

1400 respondents from 840 companies

Three trends:

1) Diverse applications

2) More runtime environments

3) More types of users

Page 16: BDTC2015 databricks-辛湜-state of spark

Industries Using Spark

Page 17: BDTC2015 databricks-辛湜-state of spark

Top Applications

29%

36%

40%

44%

52%

68%

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Page 18: BDTC2015 databricks-辛湜-state of spark

Spark Components Used

58%

58%

62%

69%

MLlib + GraphX

Spark Streaming

DataFrames

Spark SQL

75%

of users use more than one component

Page 19: BDTC2015 databricks-辛湜-state of spark

Diverse Runtime Environments

Hadoop: combined compute + storage

HDFS

MapReduce

Spark: independent of storage layer

Spark

HDFS SQL

e.g. Oracle

NoSQL

e.g. Cassandra

Page 20: BDTC2015 databricks-辛湜-state of spark

Diverse Runtime Environments

2014 2015

Hadoop

Use alittleUse alot

Hadoop

61%

31%

NoSQL ProprietarySQL

46%

34% 43%

36% 37%

21%

Page 21: BDTC2015 databricks-辛湜-state of spark

Diverse Runtime Environments

Cluster Managers

Page 22: BDTC2015 databricks-辛湜-state of spark

Diversity of Users

84%

38% 38%

71%

31%

58%

18%

Languages Used: 2014 Languages Used: 2015

Page 23: BDTC2015 databricks-辛湜-state of spark

Fastest Growing Components

+280%

increase in Windows users

+56%

production use of Streaming

+380%

production use of SQL

Page 24: BDTC2015 databricks-辛湜-state of spark

Are We Done?

No! Development is faster than ever.

Biggest technical change in 2015 was DataFrames • Moves many computations onto the relational Spark SQL optimizer

Enables both new APIs and more optimization, which is now happening through Project Tungsten

Page 25: BDTC2015 databricks-辛湜-state of spark

Traditional Spark DataFrames

RDDs DataFrames

Opaque Java

objects

User code

Storage

DataFrame API SQL

Schema-aware cache

Structured data

sources

Java functions Expressions

Optimizer

Query pushdown

Page 26: BDTC2015 databricks-辛湜-state of spark

3 Things to Look Forward To

Page 27: BDTC2015 databricks-辛湜-state of spark

Dataset API in Spark 1.6 (SPARK-9999)

Typed interface over DataFrames / Tungsten

case class Person(name: String, age: Int)

val dataframe = read.json(“people.json”)

val ds: Dataset[Person] = dataframe.as[Person]

ds.filter(p => p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)

Page 28: BDTC2015 databricks-辛湜-state of spark

Streaming DataFrames

Easier-to-use APIs (batch, streaming, and interactive)

And optimizations:

- Tungsten backends

- native support for out-of-order data

- data sources and sinks

val stream = read.kafka("...") stream.window(5 mins, 10 secs) .agg(sum("sales")) .write.jdbc("mysql://...")

Page 29: BDTC2015 databricks-辛湜-state of spark
Page 30: BDTC2015 databricks-辛湜-state of spark

3D XPoint - DRAM latency - SSD capacity - Byte addressible

Page 31: BDTC2015 databricks-辛湜-state of spark

Python Java/Scala R SQL „

DataFrame Logical Plan

LLVM JVM SIMD 3D XPoint

Unified API, One Engine, Automatically Optimized

Tungsten backend

language frontend

Page 32: BDTC2015 databricks-辛湜-state of spark
Page 33: BDTC2015 databricks-辛湜-state of spark

谢谢! @rxin


Recommended