A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example...

transcript

A Year With SparkMartin Goodson, VP Data Science

Skimlinks

Phase I of my big data experience

R files, python files, Awk, Sed, job scheduler (sun grid engine), Make/bash scripts

Phase II of my big data experience

Pig + python user defined functions

Phase III of my big data experience

PySpark?

Skimlinks data

Automated Affiliatization Tech140,000 publisher sitesCollect 30TB month of user behaviour (clicks, impressions, purchases)

Data science team

5 Data scientistsMachine learning or statistical computing Varying programming experienceNot engineersNo devops

Reality

Spark Can Be Unpredictable

Reality

Learning in depth how spark worksTry to divide and conquerLearning how to configure spark properly

Learning in depth how spark works

Read all this:https://spark.apache.org/docs/1.2.1/programming-guide.htmlhttps://spark.apache.org/docs/1.2.1/configuration.htmlhttps://spark.apache.org/docs/1.2.1/cluster-overview.html

And then:https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals)https://github.com/apache/spark/blob/master/python/pyspark/rdd.py

Try to divide and conquer

Don't throw 30Tb of data at a spark script and expect it to just work.

Divide the work into bite sized chunks - aggregating and projecting as you go.

Try to divide and conquer

Use reduceByKey() not groupByKey()

Use max() and add()(cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)

Start with this(k1, 1)(k1, 1)(k1, 2)(k1, 1)(k1, 5)(k2, 1)(k2, 2)

Use RDD.reduceByKey(add) to get this:(k1, 10)(k2, 3)

Key concept: reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

{k1: 10, …}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …}

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

Key concept: reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

{k1: 10, …}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …} reduceByKey(numPartitions)

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

PySpark Memory: worked example

10 x r3.4xlarge (122G, 16 cores)Use half for each executor: 60GBNumber of cores = 120Cache = 60% x 60GB x 10 = 360GBEach java thread: 40% x 60GB / 12 = ~2GBEach python process: ~4GBOS: ~12GB

spark.executor.memory=60gspark.cores.max=120gspark.driver.memory=60g

~/spark/bin/pyspark --driver-memory 60g

PySpark: other memory configuration

spark.akka.frameSize=1000spark.kryoserializer.buffer.max.mb=10(spark.python.worker.memory)

PySpark: other configuration

spark.shuffle.consolidateFiles=Truespark.rdd.compress=Truespark.speculation=true

Errors

java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed outLost executor, cancelled key exceptions, etc

Errors

java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed outLost executor, cancelled key exceptions, etc

All of the above are caused by memory errors!

Errors

‘ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue’: filter() little data from many partitions - use coalesce()

Collect() fails - increase driver memory + akka framesize

Were our assumptions correct?

We have a very fast development process.Use spark for development and for scale-up.Scale-able data science development.

Large-scale machine learning with Spark and Python

Empowering the data scientistby Maria Mestre

ML @ Skimlinks

● Mostly for research and prototyping● No developer background● Familiar with scikit-learn and Spark● Building a data scientist toolbox

➢ Scraping pages ➢ Training a classifier

Every ML system….

➢ Filtering

➢ Segmenting urls

➢ Sample training instances

➢ Applying a classifier

Data collection: scraping lots of pages

This is how I would do it in my local machine…

● use of Scrapy package● write a function scrape() that creates a Scrapy object

urls = open(‘list_urls.txt’, ‘r’).readlines()

output = s3_bucket + ‘results.json’

scrape(urls, output)

Distributing over the cluster

def distributed_scrape(urls, index, s3_bucket):

output = s3_bucket + ‘part’ + str(index) + ‘.json’

scrape(urls, output)

urls = open(‘list_urls.txt’, ‘r’).readlines()

urls = sc.parallelize(urls, 100)

urls.mapPartitionsWithIndex(lambda index, urls: distributed_scrape(urls, index, s3_bucket))

Installing scrapy over the cluster

1/ need to use Python 2.7echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh

2/ use pssh to install packages in the slavespssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’

Every ML system….

➢ Filtering

➢ Segmenting urls

Example: filtering

● we want to find activity of 30M users in 2 months of activity: 2 Gb vs 6 Tb

○ map-side join using broadcast() ⇒ does not work with large objects!■ e.g. input.filter(lambda x: x[‘user’] in user_list_b)

○ use of mapPartitions() ■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))

6 TB,~11B input

35 mins 113 Gb, 529M matches

60 Gb, 515M matches

9 mins

bloom filter join

Example: segmenting urls

● we want to convert an url ‘www.iloveshoes.com’ to [‘i’, ‘love’, ‘shoes’]

● Segmentation○ wordsegment package in python ⇒ very slow!○ 300M urls take 10 hours with 120 cores!

Example: getting a representative sample

Our solution in Spark!

sample = sc.parallelize([],1)

sample_size = 1000

input.cache()

for category, proportion in stats.items():

category_pages = input.filter(lambda x: x[‘category’] == category)

category_sample = category_pages.takeSample(False, sample_size * proportion)

sample = sample.union(category_sample)

MLLib offers a probabilistic solution (not exact sample size):

sample = sampleByKey(input, stats)

Every ML system….

➢ Filtering

➢ Segmenting urls

Grid search for hyperparametersProblem: we have some candidate [ 1, 2, ..., 10000] values for a hyperparameter

, which one should we choose?

If the data is small enough that processing time is fine

➢ Do it in a single machine

If the data is too large to process on a single machine

➢ Use MLlib

If the data can be processed on a single machine but takes too long to train

➢ The next slide!

number of combinations = {parameters} = 2

Using cross-validation to optimise a hyperparameter

1. separate the data into k equally-sized chunks2. for each candidate value i

a. use (k-1) chunks to fit the classifier parametersb. use the remaining chunk to get a classification scorec. report average score

3. At the end, select the that achieves the best average score

number of combinations = {parameters} x {folds} = 4

Every ML system….

➢ Filtering

➢ Segmenting urls

Apply the classifier over the new_data: easy!

With scikit-learn:classifier_b = sc.broadcast(classifier)new_labels = new_data.map(lambda x: classifier_b.value.predict(x))

With scikit-learn but cannot broadcast:save classifier models to files, ship to s3use mapPartitions to read model parameters and classify

With MLlib:(model._threshold = None) new_labels = new_data.map(lambda x: model.predict(x))

Thanks!

Apache Spark for Big Data

Spark at Scale & Performance Tuning

Sahan Bulathwela | Data Science Engineer @ Skimlinks |

Outline

● Spark at scale: Big Data Example

● Tuning and Performance

Spark at Scale: Big Data Example● Yes, we use Spark !!

● Not just to prototype or one-time analyses

● Run automated analyses at a large scale on

daily basis

● Use-case: Generating audience statistics for

our customers

Before…

● We provide data products based on audience statistics to customers

● Extract event data from Datastore

● Generate Audience statistics and reports

Data● Skimlinks records web data in terms of user

events such as clicks, impressions and etc…● Our Data!!

○ Records 18M clicks (11 GB)○ Records 203M impressions (950 GB)○ These numbers are on daily basis (Oct 01, 2014)

● About 1TB of relevant events

A few days and data scientists later...

Statistics

Major pain points● Most of the data is not relevant

○ Only 3-4 out of 30ish fields are useful for each report

● Many duplicate steps ○ Reading the data○ Extracting relevant fields ○ Transformations such as classifying

events

Solution

Aggregation doing its magic● Mostly grouping events and summarizing

● Distribute the workload in time

● “Reduce by” instead of “Group by”

● BOTS

Deep Dive

DatastoreBuild Daily profiles

Intermediate Data Structure

(Compressed in GZIP)

Events (1 TB)

Daily Profiles1.8 GB

Build Monthly profiles

Monthly Aggregate40 GB

Generate Audience StatisticsCustomers

Statistics7 GB

● Takes 4 hours● 150 Statistics● Delivered daily to

clients

Deep Dive

Events (1 TB)

Statistics7 GB

clients

Deep Dive

Events (1 TB)

Statistics7 GB

clients

SO WHAT???Before After

Computing Daily event summary 1+ DAYS !!! 20 Mins

Computing monthly aggregate 40 Mins

Storing Daily event summary 100’s of GBs 1.8 GB

Storing monthly aggregate 40 GB

Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins

time taken per Report 1+ DAYS !!! 1.4 mins

Parquet enabled us to reduce our storage costs by 86% and

increase data loading speed by 5x

Storage

Performance when parsing 31 daily profiles

Thank You !!

A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example...

Documents