Apache Spark Introduction

transcript

Spark Conf Taiwan 2016

Apache SparkRICH LEE2016/9/21

AgendaSpark overview

Spark coreRDD

Spark Application DevelopSpark Shell

Zeppline

Application

Spark OverviewApache Spark is a fast and general-purpose cluster computing system

Key Features:FastEase of UseGeneral-purposeScalableFault tolerant

Logistic regression in Hadoop and Spark

Spark OverviewCluster Mode

Standalone

Hadoop YARN

Apache Mesos

Spark Overview

Spark OverviewSpark High Level Architecture

Driver Program

Cluster Management

Worker Node

Executor

Spark OverviewInstall and startup

Download

http://spark.apache.org/downloads.html

Start Master and Worker

./sbin/start-all.sh

http://localhost:8080

Start History server

./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory

http://localhost:18080

Start Spark-Shell

./bin/spark-shell --master "spark://RichdeMacBook-Pro.local:7077"

./bin/spark-shell local[4]

RDDResilient Distributed Datase

RDD represents a collection of partitioned data elements that can be operated on in parallel. It is the primary data abstraction mechanism in Spark.

PartitionedFault TolerantInterfaceIn Memory

RDDCreate RDD

parallelizeval xs = (1 to 10000).toListval rdd = sc.parallelize(xs)

textFile val lines = sc.textFile("/input/README.md") val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-

1.6.2/README.md")HDFS - "hdfs://"

Amazon S3 - "s3n://"Cassandra, HBase

RDDTransformation

Creates a new RDD by performing a computation on the source RDDmap val txtFile = sc.textFile("/input/README.md") val lengths = lines map { l => l.length}

flatMap val words = lines flatMap { l => l.split(" ")}

filter val longLines = lines filter { l => l.length > 80}

RDDAction

Return a value to a driver program

firstval numbersRdd = sc.parallelize(List(10, 5, 3, 1))val firstElement = numbersRdd.first

maxnumbersRdd.max

reduceval sum = numbersRdd.reduce((x, y) => x + y)val product = numbersRdd.reduce((x, y) => x * y)

RDDFilter log example:

val logs = sc.textFile("path/to/log-files")val errorLogs = logs filter { l => l.contains("ERROR")}val warningLogs = logs filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count

log RDD

error RDD

warn RDD

count count

RDDCaching

Stores an RDD in the memory or storageWhen an application caches an RDD in memory, Spark stores it in the

executor memory on each worker node. Each executor stores in memory the RDD partitions that it computes.

persistMEMORY_ONLYDISK_ONLYMEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SER

RDDCache example:

val logs = sc.textFile("path/to/log-files")val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}errorsAndWarnings.cache()val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count

Spark Application DevelopSpark-Shell

Zeppline

Application (Java/Scala)spark-submit

Spark Application Develop WordCount

val textFile = sc.textFile("/input/README.md")val wcData = textFile.flatMap(line => line.split(" ")) .map((_, 1)) .reduceByKey(_ + _)

wcData.collect().foreach(println)

工商時間 Taiwan Hadoop User Group

https://www.facebook.com/groups/hadoop.tw/

Taiwan Spark User Group

https://www.facebook.com/groups/spark.tw/

Apache Spark Introduction

Technology