Date post: | 14-Aug-2015 |
Category: |
Technology |
Upload: | edureka |
View: | 355 times |
Download: | 1 times |
View Apache Spark and Scalacourse details at www.edureka.co/apache-spark-scala-training
Spark For Fast Batch Processing
Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2
Objectives
Let’s talk about:-
What is Big Data?
Associated Challenges
What is Spark?
Why Spark?
Spark Ecosystem
Spark With Hadoop
Spark in Industry
RDDs – A Quick Look
Spark Vs Map Reduce Performance –Demo
Slide 3 www.edureka.co/big-data-and-hadoop
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets solarge and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization
What is Big Data?
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
Slide 4 www.edureka.co/apache-spark-scala-training
IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/
VOLUME
Web logs
Images
Videos
Audios
Sensor Data
VARIETYVELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76
Associated Challenges
Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5
What is Spark?
Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it
easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.
Developed at UC Berkeley
Written in Scala , a Functional Programming Language that runs in a JMV
It generalize the Map Reduce framework
Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6
Why Spark ?
Speed
Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk.
Ease of Use
Supports different languages for developing applications using Spark
Generality
Combine SQL, streaming, and complex analytics into one platform
Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud.
Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7
Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass
computations and algorithms ( Machine learning etc.)
To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in
sequence
Each of those jobs was high-latency, and none could start until the previous job had finished completely
The Job output data between each step has to be stored in the local file system before the next step can begin
Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning
and Storm for streaming data processing)
Why Spark? -Map Reduce Limitations
Slide 8 www.edureka.co/apache-spark-scala-training
Used for structured data. Can run unmodified hive queries on existing Hadoop deployment
Spark Core Engine
Aplha/Pre-alpha
Shark (SQL)
SparkStreaming(Streaming)
MLLib(Machine learning)
GraphX(Graph
Computation)
SparkR(R onSpark)
BlinkDB(ApproximateS
QL)
Enables analytical and interactive apps for live streaming data
An approximate query engine. To run over Core Spark Engine
Graph Computation engine (Similar to Graph)
Package for R language to enable R-users to leverage Spark power from R shell
Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce
Spark Ecosystem
Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9
Spark Features
Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-memory data storage
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
It’s designed to be an execution engine that works both in-memory and on-disk
Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow
Provides concise and consistent APIs in Scala, Java and Python
Offers interactive shell for Scala and Python. This is not available in Java yet
Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10
Spark Core
SparkStreaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Cluster management ( Native Spark Cluster, YARN, MESOS )
Distributed storage ( HDFS, Cassandra, S3, HBase )
Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11
Spark Advantages
EASE OF DEVELOPMENT
COMBINE WORKFLOWS
IN-MEMORY PERFORMANCE
Easier APIs Python, Scala, Java
RDDs DAGs Unify Processing
Shark, MLStreaming, GraphX
Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12
UNLIMITED SCALE
WIDE RANGE OF APPLICATIONS
ENTERPRISE PLATFORM
Multiple data sources Multiple applications Multiple users
Reliability Multi-tenancy Security
Files Databases Semi-structured
Hadoop Advantages
Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13
Spark + Hadoop
UNLIMITED SCALE
WIDE RANGE OF APPLICATIONS
ENTERPRISE PLATFORM
EASE OF DEVELOPMENT
COMBINE WORKFLOWS
IN-MEMORY PERFORMANCE
Operational Applications Augmented by In-Memory Performance
Slide 14 www.edureka.co/apache-spark-scala-training
Spark in Industry
Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15
Resilient Distributed Datasets – A Quick Look
RDD ( Resilient Distributed Data Sets )
Resilient – If data in memory is lost, It can be recreated
Distributed – Stored in memory across the cluster
Dataset – Initial data can come from a file or created programmatically.
RDDs are the fundamental unit of data in spark
Slide 16 www.edureka.co/apache-spark-scala-trainingSlide 16
Resilient Distributed Datasets
Core concept of Spark framework.
RDDs can store any type of data.
Primitive Types : Integer, Characters, Boolean etc.Files : Text files, SequencFiles etc.
RDD is fault tolerance.
RDDs are immutable
Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17
RDD supports two types of operations:
Transformation: Transformations don't return a single value, they return a new RDD.
Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
Action: Action operation evaluates and returns a new value.
Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.
Resilient Distributed Datasets
Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18
Spark Vs Map Reduce Performance -Demo
Slide 19 www.edureka.co/apache-spark-scala-training
Course Topics
Module 1 » Introduction to Scala
Module 2» Scala Essentials
Module 3 » Traits and OOPs in Scala
Module 4 » Functional Programming in Scala
Module 5 » Introduction to Big Data and Spark
Module 6 » Spark Baby Steps
Module 7 » Playing with RDDs
Module 8» Spark with SQL- When Spark meets Hive
Slide 20 www.edureka.co/apache-spark-scala-training
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
Course Features
Slide 21 www.edureka.co/apache-spark-scala-training
Questions
Slide 22 www.edureka.co/apache-spark-scala-training