Date post: | 25-May-2015 |
Category: |
Software |
Upload: | roger-brinkley |
View: | 243 times |
Download: | 0 times |
The 10 Apache Spark Features You (Unlikely) Didn't Hear About
Roger BrinkleyTechnical Evangelist
The 10 Apache Stack Features You (Unlikely) Didn't Hear About
• 10 minutes – 10 slides• Ignite Format
• No stopping!• No going back!• Questions? Sure, but only if and until time
remains on slide (otherwise, save for later)
• Hire me, I’ll find 45 more
It’s Fast Really Fast
• 10 - 100x faster than MapReduce• 10 – 100x faster than Hive• Historical perspective
– JRuby 2-3x Faster with InvokedDynamic JVM– Hardware rarely gets greater than 10x/year
MapReduce is Listed as the Last Most Important Software Innovation
And Spark Blew the Lid Off of MapReduce
• Commons-based Peer Production– Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing– 12 Organizations Committing– Peer Governance– Participative Decision Making
It’s Pure Open Source
The very essence of a free government consists in considering offices as public trusts,
bestowed for the good of the country, and not for the benefit of an individual or a party
John C. Calhoun 2/13/1835
The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party
Modern FOSS John C. Calhoun
Strong Enterprise Relationships
• Spark is in every major Hadoop distributor• Vertical enterprise use
– Internet companies, government, financials– Churn analysis, fraud detection, risk analytics
• Used in other data stores – Datastax (Cassandra)– MongoDB
• Databricks has a cloud based implementation
Enhances Other Big Data Implementations
• Hadoop – Replacement of Map Reduce• Cassandara – Analytics• Hive – Faster SQL processing• SAP Hana – Faster interactive analysis
API Stability
• Guaranteed stability of its core API for 1.X • Spark has always been conservative with API
changes• Clearly defined annotations for future APIs
– Experimental– Alpha– Developer
Don’t Need to Learn a New Language
• Scala• Java – 25%• Python – 30% • And soon R
Java 8 Lambda SupportJavaRDD<String> lines = sc.textFile("hdfs://log.txt");
// Map each line to multiple wordsJavaRDD<String> words = lines.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }});
// Turn the words into (word, 1) pairsJavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { return new Tuple2<String, Integer>(w, 1); }});
// Group up and add the pairs by key to produce countsJavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; }});
counts.saveAsTextFile("hdfs://counts.txt");
JavaRDD<String> lines = sc.textFile("hdfs://log.txt");JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")));JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://counts.txt");
val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()
val file = sc.textFile("hdfs://.../pagecounts-*.gz")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://.../word-count")
Real Time Stream Process
Caching Interactive Algorithms
val points = sc.textFile("...").map(parsePoint).cache()var w = Vector.random(D) //current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)
New Security Integration
• Complete Integration with Haddop/YARN Security Model– Authenticate Job Submissions– Securely transfer HDFS credentials– Authenticate communication between component
• Other deployments supported val conf = new SparkConfconf.set("spark.authenticate", "true")conf.set("spark.authenticate.secret", "good")
And Lots More
• Apache Spark Website• Databricks – making big data easy
– Introduction to Apache Spark• Jul 28 – Austin, TX - More Info & Registration• Aug 25 – Chicago, IL - More Info & Registration