+ All Categories
Home > Documents > Apache Spark under the hood

Apache Spark under the hood

Date post: 02-Nov-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
40
Apache Spark under the hood Big Data and Cloud Computing (CC4053) Eduardo R. B. Marques, DCC/FCUP 1 / 40
Transcript

Apache Spark under the hoodBig Data and Cloud Computing (CC4053)

Eduardo R. B. Marques, DCC/FCUP

1 / 40

Introduction

2 / 40

Apache SparkApache Spark has the stated goal of providing a “unified platform” forbig data applications (see “Apache Spark:A Unified Engine for BigData Processing” by M. Zaharia et al.).

General graph execution model that is able to performin-memory-processing and optimize data-flow, based on ResilientDistributed Datasets (RDDs), and higher-level DataFrame/SQLAPIs.

Supports batch processing but also stream processing. SpecializedSpark libraries exist for machine learning or graph analytics.

Language bindings for Scala, Java, Python and R.

Flexible deployment (standalone, YARN/HDFS, …) andinteoperable with heterogeneous data-sources (e.g., Google CloudStorage, SQL databases, …) and formats (e.g., CSV, JSON,multiple binary formats).

3 / 40

Spark applications

A benefit of using Spark (or MapReduce) is that programmers do notneed to deal with aspects such as:

how parallel execution and network communication operate duringthe execution of an applicationthe allocation of computing resources necessary to run applications

Spark handles these aspects automatically, as you have experienced inpractice. Let us now uncover the details of Spark in terms ofapplication architecture and execution.

4 / 40

Image credits

Some images in these slides are taken from Bill Chambers and MateiZaharia’s book in compliance with O’Reilly’s Safari learning platformmembership agreement.

5 / 40

Architecture

6 / 40

Spark architecture

A Spark application is composed of a driver process (also calleddriver program) and a set of executor processes. The driver isresponsible for maintaning application state, running user code andhandling user input scheduling and distributing work to executors.Executors execute code assigned by the driver and report the stateand final results of computation back to the driver.

7 / 40

The Spark session

As part of a driver process, an application must first create a Sparksession. The spark session provides user code with the primaryinterface for Spark functionality.

When you run your program using an interactive Spark shell(spark-shell, pyspark, sparkR) or custom notebook environments, theSpark session is conveniently created at startup and made acessiblethrough the spark (and the sc “Spark context” variable). In GoogleColab notebooks we must initialize these explicitly as we have seen.

8 / 40

PySpark and SparkR

Spark is primarly written in Scala. Scala is interoperable with Java,in any case you can write Spark applications using separate APIs foreach language. The driver program in this case will be hosted by aJava Virtual Machine (JVM), i.e., the execution engine for Javacompiled bytecode.

Spark also has APIs for Python (PySpark) and R (SparkR) In thiscase the Python or R Spark program will spawn a JVM that hosts theactual Spark session. This happens transparently to the Python/Rprogram(mer).

9 / 40

Spark session creation

For non-interactive program execution (outside “Spark shells”) theSpark session must be created explicitly by user code.

Here’s an example in Python for Spark in “stand-alone” mode (usingthe local machine):

if __name__ == "__main__" :from pyspark import SparkContextfrom pyspark.sql import SparkSessionspark = SparkSession\

.builder\

.appName("My beautiful app")\

.master("local[*]")\

.getOrCreate()sc = spark.sparkContextsc.setLogLevel("WARN")

10 / 40

Spark execution

11 / 40

Execution of a Spark applicationLet us present the core concepts for the execution of a Sparkapplication.

During execution, the driver process launches jobs. A job isdefined whenever the application triggers an action(e.g. RDD.collect()).

Job = sequence of stages Stage1 → ... → Stage𝑛. Stages forthe same job do not run in parallel.

Stage = groups of tasks that may execute together tocompute the same operation on multiple RDD partitions/ machines. Stages are separated by shuffle operations thatrepartition data.

Task = computation that runs in a single executoroperating transformations over blocks of data. Severalinstances of the same tasks may run in parallel (usingdistinct executors).

12 / 40

Logical plans and physical plans

For each job (application action), Spark assembles:

Logical execution plans structured in terms of (RDD, dataframe, …) transformations, that are independent of the cluster’scharacteristics.

Physical execution plans, compiled from logical execution plansdefine the actual job stages and their component tasks. and mayaccount for the cluster characteristics.

Logical and physical execution plans take form as directed acyclicgraphs (DAGs), where nodes define tasks or reshuffle operations andedges reflect execution precedence.

13 / 40

Example 1

Let us consider a variant of the word count computation:

## "Word count" computation.rdd = sc.textFile(input_file)\

.flatMap(lambda line: [(word,1) \for word in line.split()])\

.reduceByKey(lambda x,y: x + y)## Action that triggers the execution of jobresults = rdd.collect()

14 / 40

Spark UI for an application

For every running application we can check its Spark UI, where we canget detailed information regarding:

Jobs: jobs executed by the application;Stages: we can inspect stage details including RDD-level DAGsfor component tasks;Storage: storage associated to the application;Environment: information on environment configuration;Executors: executors associated to the application;SQL: high-level execution plans/DAGs that are defined for SparkSQL (data frame operations).

15 / 40

Example 1 in the Spark UIThe call to collect() triggers the execution of a new job, involvingtwo stages and 8 tasks, as shown in the Spark UI:

In the DAG, we see that the there is a data reshuffle at the end ofstage 1, before stage 2 can start. In this case, each stage has 4 taskscorresponding to 4 RDD partitions.

16 / 40

Narrow vs. wide transformations

Narrow transformations like flatMap (or map, filter) map oneinput partition to one output partition.

Wide transformations, like reduceByKey, also called shuffles, mayread from several input partitions and contribute to many outputpartitions. They required data to be reshuffled across executors.

17 / 40

Narrow transformations and stage pipelining

We could add other narrow transformations in sequence to flatMapthat would fit in the same stage.

In the same stage, Spark will pipeline the transformations, allowing forin-memory data processing with no intermediate disk writes(up to the amount of available memory per executor).

18 / 40

Wide transformations and reshuffle persistence

Wide transformations require reshuffling in the network, hence twowide transformations cannot be part of the same stage and bepipelined. Spark will however try to optimize performance throughshuffle persistence:

The source stage of a shuffle operation writes shuffle files tolocal disks (at the level of each executor).In the sink stage, that performs the grouping and reduction, datais fetched from these shuffle files in groups of keys.

The point is to avoid recomputation of the the source stage if there is afailure during the sink stage, and to be able to schedule the sink stageflexibly.

wide transformations cannot be pipelined.19 / 40

Example 2

We now consider a data frame (Spark SQL execution) The following isa simple (self-explanatory) example over a MovieLens data set:

movies = ... ## read file## Transformations form a Spark SQL queryStarWarMovies = movies\.filter(movies.title.contains('Star Wars'))\.orderBy(movies.title)

## Actionresults = StarWarMovies.collect()

Spark SQL queries are executed as set of jobs that form a graph.

20 / 40

Example 2 - Spark UI

21 / 40

Example 3

Let us inpect the execution in the Spark UI online only as the DAG forthis example is relatively complex. (the image is available here).

hitchCockMovies = spark.sql('''SELECT title, count(rating) as num_ratings, avg(rating) as avg_ratingFROM tags JOIN movies USING(movieId)

JOIN ratings USING(movieId)WHERE tag = 'Alfred Hitchcock'GROUP BY titleORDER BY avg_rating

''')hitchCockMovies.show()

22 / 40

Spark deployment modes

23 / 40

Application execution in local mode

In local mode (also called standlone mode) the entire application runson a single machine.

Parallelism is only made possible by multithreading over the machine’sCPU cores (and for some operations also GPUs).

Local mode is typically employed during application development forsmall/“not so big” datasets but not in production or for “really big”datasets.

24 / 40

Spark in cluster mode

A Spark cluster is composed of master and worker nodes. Amaster mediates access to workers in the cluster. Workers hostdriver or executor processes of Spark applications.

Each master/worker is normally tied to a distinct machine in acomputer cluster, though other settings are also possible: e.g., a singlemachine hosting a master and worker simultaneously or multipleworkers.

25 / 40

Cluster master - Spark UI

26 / 40

Cluster worker - Spark UI

27 / 40

Application - Spark UI

28 / 40

Application execution in cluster mode

In cluster mode, a Spark application driver and its executors all runinside the cluster in association to workers.

29 / 40

Application execution in cluster mode (cont.)

Thus, in cluster mode, all driver/executor interaction for an applicationtakes place within the cluster.

30 / 40

Application execution in client mode

In client mode, the aplication driver runs in a machine outside thecluster.

The driver’s machine is often called a “gateway machine” or “edgenode”.

31 / 40

Application execution in client mode (cont.)

Client mode may be more convenient/flexible in a number of situations:if the driver program (but not executors) requires resources notacessible within the cluster; for security reasons (user code may not betrustworthy), …

32 / 40

HDFS (Hadoop Distributed File System)

33 / 40

HDFS design goalsHDFS is commonly used in computer clusters as storage for ApacheSpark, Hadoop MapReduce, and other frameworks in the Hadoopecosystem.

HDFS, originally inspired by the Google File System (GFS), is a filesystem designed to store very large files across distributedmachines in a large cluster with streaming data access patterns:

very large files → hundreds of megabytes, gigabytes, orterabytes in size;distributed → data is distributed across several machines toallow for fault tolerance and parallel processing;large clusters → clusters can be formed by thousands ofmachines using commodity hardware (with non-negligible failurerate);streaming data access pattern → files are typically writtenonce or in append mode, and read many times subsequently.

34 / 40

HDFS is not good for …

HDFS is not a good fit for applications that require:

Multiple writers and/or random file access operations →HDFS does not support these features.Low-latency operation in the millisecond scale → HDFS isoriented towards high throughputLots of small files → HDFS is not designed with small files inmind, and a lot of small files in HDFS clusters in fact hurtperformance.

35 / 40

HDFS architecture

HDFS clusters are composed of namenodes and datanodes.Namenodes manage the file system and its meta-data, anddatanodes provide actual storage. Clients access namenodes to getinformation about HDFS files, and datanodes to read and write data.

(Image from: HDFS Architecture, Apache Hadoop documentation)36 / 40

HDFS files

A HDFS file provides the abstraction of a single file, but is in factdivided into blocks, each with an equal size of typically 64 or 128 MB.A block is the elementary unit for read / write operations by clientapplications, and each is stored and replicated independently indifferent machines. The host file system in datanodes stores blocks asregular files.

(Image from: HDFS Architecture, Apache Hadoop documentation)37 / 40

HDFS files (cont.)

Splitting a file into several replicated blocks has several advantages:

Support for “really big files”: a HDFS file can be larger thanany single disk in the network;Fault tolerance / high availability: if a block becomesunavailable from a datanode, a replica can be read from anotherdatanode in a way that is transparent to the client.Data locality in integration with MapReduce: HadoopMapReduce takes advantage of the block separation to movecomputation near to where data is located, rather than theother way round that would be quite more costly in terms ofnetwork communication/operation time. Moving computationis cheaper than moving data.

38 / 40

HDFS file reads

(Image from: Hadoop, The Definitive Guide, 4th ed.)

39 / 40

HDFS file writes

A pipeline is formed between datanodes to replicate blocks in responseto block write operation by the client, according to the desiredreplication level (3 in the example above).

(Image from: Hadoop, The Definitive Guide, 4th ed.)40 / 40


Recommended