Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | dortha-griffith |
View: | 232 times |
Download: | 1 times |
About meCommitter and PMC member of Apache Spark
“Former” PhD student at Berkeley
Release manager for Spark 1.0
Background in networking and distributed systems
Today’s Talk
Spark background
About the Spark release process
The Spark 1.0 release
Looking forward to Spark 1.1
What is Spark?
EfficientGeneral execution graphs
In-memory storage
UsableRich APIs in Java, Scala, Python
Interactive shell
Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop
2-5× less code
Up to 10× faster on disk,100× in memory
30-Day Commit Activity
Patches0
50
100
150
200
250
MapReduceStormYarnSpark
Lines Added0
5000
10000
15000
20000
25000
30000
35000
40000
45000
MapReduceStormYarnSpark
Lines Removed0
2000
4000
6000
8000
10000
12000
14000
16000
MapReduceStormYarnSpark
Spark PhilosophyMake life easy and productive for data scientists
Well documented, expressive API’s
Powerful domain specific libraries
Easy integration with storage systems
… and caching to avoid data movement
Predictable releases, stable API’s
Spark Release Process
Quarterly release cycle (3 months)
2 months of general development
1 month of polishing, QA and fixes
Spark 1.0 Feb 1 April 8th, April 8th+
Spark 1.1 May 1 July 8th, July 8th+
Spark 1.0:By the numbers- 3 months of development
- 639 patches
- 200+ JIRA issues
- 100+ contributors
API Stability in 1.XAPI’s are stable for all non-alpha projects
Spark 1.1, 1.2, … will be compatible
@DeveloperApi
Internal API that is unstable
@Experimental
User-facing API that might stabilize later
Spark CoreHistory server for Spark UI
Integration with YARN security model
Unified job submission tool
Java 8 support
Internal engine improvements
History ServerConfigure with :
spark.eventLog.enabled=truespark.eventLog.dir=hdfs://XX
In Spark Standalone, history server is embedded in the master.
In YARN/Mesos, run history server as a daemon.
Job Submission ToolApps don’t need to hard-code master: conf = new SparkConf().setAppName(“My App”) sc = new SparkContext(conf)
./bin/spark-submit <app-jar> \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster
Java 8 SupportRDD operations can use lambda syntaxclass Split extends FlatMapFunction<String, String> { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); });JavaRDD<String> words = lines.flatMap(new Split());
JavaRDD<String> words = lines .flatMap(s -> Arrays.asList(s.split(" ")));
Old
New
Java 8 SupportNOTE: Minor API changes
(a) If you are extending Function classes, use implements rather than extends.
(b) Return-type sensitive functions
mapToPairmapToDouble
Python API Coveragerdd operators
intersection(), take(), top(), topOrdered()
meta-data
name(), id(), getStorageLevel()
runtime configuration
setJobGroup(), setLocalProperty()
Integration with YARN SecuritySupports Kerberos authentication in YARN environments:
spark.authenticate = true
ACL support for user interfaces:
spark.ui.acls.enable = true
spark.ui.view.acls = patrick, matei
DocumentationUnified Scaladocs across modules
Expanded MLLib guide
Deployment and configuration specifics
Expanded API documentation
Spark
RDDs, Transformations, and Actions
Spark Streamin
greal-time
SparkSQL
MLLibmachine learning
DStream’s: Streams of
RDD’s
SchemaRDD’s RDD-Based Matrices
Turning an RDD into a Relation// Define the schema using a case class.case class Person(name: String, age: Int)
// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
Querying using SQL// SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)
Import and Export// Save SchemaRDD’s directly to parquetpeople.saveAsParquetFile("people.parquet")
// Load data stored in Hiveval hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import hiveContext._
// Queries can be expressed in HiveQL.hql("FROM src SELECT key, value")
In Memory Columnar StorageSpark SQL can cache tables using an in-memory columnar format:
- Scan only required columns- Fewer allocated objects (less GC)- Automatically selects best compression
Spark StreamingWeb UI for streaming
Graceful shutdown
User-defined input streams
Support for creating in Java
Refactored API
MLlibSparse vector support
Decision trees
Linear algebra
SVD and PCA
Evaluation support
3 contributors in the last 6 months
MLlibNote: Minor API change
val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray)val clusters = KMeans.train(parsedData, 4, 100)
val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble)))val clusters = KMeans.train(parsedData, 4, 100)
1.1 and BeyondData import/export leveraging catalyst HBase, Cassandra, etc
Shark-on-catalyst
Performance optimizationsExternal shuffle
Pluggable storage strategies
Streaming: Reliable input from Flume and Kafka
Unifying ExperienceSchemaRDD represents a consistent integration point for data sources
spark-submit abstracts the environmental details (YARN, hosted cluster, etc).
API stability across versions of Spark
ConclusionVisit spark.apache.org for videos, tutorials, and hands-on exercises.
Help us test a release candidate!
Spark Summit on June 30th
spark-summit.org
Meetup group meetup.com/spark-users