+ All Categories
Home > Data & Analytics > Apache spark Intro

Apache spark Intro

Date post: 24-Jan-2017
Category:
Upload: tudor-lapusan
View: 210 times
Download: 0 times
Share this document with a friend
32
Apache Spark Intro workshop BigData Romania
Transcript
Page 1: Apache spark Intro

Apache Spark Introworkshop

BigData Romania

Page 2: Apache spark Intro

Apache Spark Intro

★ Apache Spark history★ RDD★ Transformations★ Actions★ Hands-on session

Page 4: Apache spark Intro

From where to learn Spark ?

http://spark.apache.org/

http://shop.oreilly.com/product/0636920028512.do

Page 5: Apache spark Intro

Spark architecture

Page 6: Apache spark Intro

Easy ways to run Spark ?★ your IDE (ex. Eclipse or IDEA)★ Standalone Deploy Mode: simplest way to deploy Spark

on a single machine★ Docker & Zeppelin★ EMR★ Hadoop vendors (Cloudera, Hortonworks)

Page 7: Apache spark Intro

Supported languages

Page 8: Apache spark Intro

Spark basics

★ RDD★ Operations : Transformations and Actions

Page 9: Apache spark Intro

RDD

An RDD is simply an immutable distributed collection of objects!

b c d ge f ih kj ml ona qp

Page 10: Apache spark Intro

Creating RDD (I) Pythonlines = sc.parallelize([“workshop”, “spark”])

Scalaval lines = sc.parallelize(List(“workshop”, “spark”))

Java JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))

Page 11: Apache spark Intro

Creating RDD (II) Pythonlines = sc.textFile(“/path/to/file.txt”)

Scalaval lines = sc.textFile(“/path/to/file.txt”)

Java JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)

Page 12: Apache spark Intro

RDD persistence MEMORY_ONLY

MEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SERDISK_ONLYMEMORY_ONLY_2MEMORY_AND_DISK_2OFF_HEAP

Page 13: Apache spark Intro

Other data structures in Spark

★ Paired RDD★ DataFrame★ DataSet

Page 14: Apache spark Intro

Paired RDD

Paired RDD = an RDD of key/value pairs

user1 user2 user3 user4 user5

id1/user1 id2/user2 id3/user3 id4/user4 id5/user5

Page 15: Apache spark Intro

Spark operations RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

Action

Transformation

Page 16: Apache spark Intro

TransformationsRDD 1

RDD 2Transformations describe how to transform an RDD into another RDD.

RDD 1

RDD 2

Page 17: Apache spark Intro

Transformations RDD 1

RDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

Page 19: Apache spark Intro

Actions

Actions compute a result from an RDD !

RDD 1

Page 20: Apache spark Intro

Actions

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

count()=6 take(2)={1,2} saveAsTextFile()

Page 22: Apache spark Intro

Transformations and Actions

users

administrators

filter

take(3)

Page 23: Apache spark Intro

Transformations and Actions

users

administrators

filter()

take(3) saveAsTextFile()

Page 24: Apache spark Intro

Transformations and Actions

users

administrators

filter()

take(3) saveAsTextFile()

persist()

Page 25: Apache spark Intro

Lazy initialization

users

administrators

filter

take(3)

Page 26: Apache spark Intro

How Spark Executes Your Program

Page 27: Apache spark Intro

Hands-on session

Page 28: Apache spark Intro

MovieLens MovieLens data sets were collected by the GroupLens Research Projectat the University of Minnesota. This data set consists of:

* 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies.

* Simple demographic info for the users (age, gender, occupation, zip)

Download link : http://grouplens.org/datasets/movielens/

Page 29: Apache spark Intro

MovieLens dataset

useruser_idagegenderoccupationzipcode

user_ratinguser_idmovie_idratingtimestamp

moviemovie_idtitlerelease_datevideo_releaseimdb_urlgenres...

Page 30: Apache spark Intro

Exercises already solved !

★ Return only the users with occupation ‘administrator’

★ Increase the age of each user by one★ Join user and rating datasets by user id

Page 31: Apache spark Intro

Exercises to solve★ How many men/women register to MovieLens★ Distribution of age for male/female registered to

MovieLens★ Which are the movies names with rating x?

★ Average rating by movies★ Sort users by their occupation

Page 32: Apache spark Intro

Congrats if you reached this slide !


Recommended