+ All Categories
Home > Documents > A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example...

A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example...

Date post: 20-May-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
67
A Year With Spark Martin Goodson, VP Data Science Skimlinks
Transcript
Page 1: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

A Year With SparkMartin Goodson, VP Data Science

Skimlinks

Page 2: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Phase I of my big data experience

R files, python files, Awk, Sed, job scheduler (sun grid engine), Make/bash scripts

Page 3: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Phase II of my big data experience

Pig + python user defined functions

Page 4: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Phase III of my big data experience

PySpark?

Page 5: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Skimlinks data

Automated Affiliatization Tech140,000 publisher sitesCollect 30TB month of user behaviour (clicks, impressions, purchases)

Page 6: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Data science team

5 Data scientistsMachine learning or statistical computing Varying programming experienceNot engineersNo devops

Page 7: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Reality

Page 8: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Spark Can Be Unpredictable

Page 9: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Reality

Learning in depth how spark worksTry to divide and conquerLearning how to configure spark properly

Page 10: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Learning in depth how spark works

Read all this:https://spark.apache.org/docs/1.2.1/programming-guide.htmlhttps://spark.apache.org/docs/1.2.1/configuration.htmlhttps://spark.apache.org/docs/1.2.1/cluster-overview.html

And then:https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals)https://github.com/apache/spark/blob/master/python/pyspark/rdd.py

Page 11: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Try to divide and conquer

Don't throw 30Tb of data at a spark script and expect it to just work.

Divide the work into bite sized chunks - aggregating and projecting as you go.

Page 12: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Try to divide and conquer

Use reduceByKey() not groupByKey()

Use max() and add()(cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)

Page 13: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Start with this(k1, 1)(k1, 1)(k1, 2)(k1, 1)(k1, 5)(k2, 1)(k2, 2)

Use RDD.reduceByKey(add) to get this:(k1, 10)(k2, 3)

Page 14: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Key concept: reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

(k1,

5)

{k1: 10, …}

{…}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …}

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

Page 15: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Key concept: reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

(k1,

5)

{k1: 10, …}

{…}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …} reduceByKey(numPartitions)

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

Page 16: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

PySpark Memory: worked example

Page 17: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

PySpark Memory: worked example

10 x r3.4xlarge (122G, 16 cores)Use half for each executor: 60GBNumber of cores = 120Cache = 60% x 60GB x 10 = 360GBEach java thread: 40% x 60GB / 12 = ~2GBEach python process: ~4GBOS: ~12GB

Page 18: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

PySpark Memory: worked example

spark.executor.memory=60gspark.cores.max=120gspark.driver.memory=60g

Page 19: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

PySpark Memory: worked example

spark.executor.memory=60gspark.cores.max=120gspark.driver.memory=60g

~/spark/bin/pyspark --driver-memory 60g

Page 20: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

PySpark: other memory configuration

spark.akka.frameSize=1000spark.kryoserializer.buffer.max.mb=10(spark.python.worker.memory)

Page 21: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

PySpark: other configuration

spark.shuffle.consolidateFiles=Truespark.rdd.compress=Truespark.speculation=true

Page 22: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Errors

java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed outLost executor, cancelled key exceptions, etc

Page 23: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Errors

java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed outLost executor, cancelled key exceptions, etc

All of the above are caused by memory errors!

Page 24: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Errors

‘ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue’: filter() little data from many partitions - use coalesce()

Collect() fails - increase driver memory + akka framesize

Page 25: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Were our assumptions correct?

We have a very fast development process.Use spark for development and for scale-up.Scale-able data science development.

Page 26: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Large-scale machine learning with Spark and Python

Empowering the data scientistby Maria Mestre

Page 27: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

ML @ Skimlinks

● Mostly for research and prototyping● No developer background● Familiar with scikit-learn and Spark● Building a data scientist toolbox

Page 28: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

➢ Scraping pages ➢ Training a classifier

Every ML system….

➢ Filtering

➢ Segmenting urls

➢ Sample training instances

➢ Applying a classifier

Page 29: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Data collection: scraping lots of pages

This is how I would do it in my local machine…

● use of Scrapy package● write a function scrape() that creates a Scrapy object

urls = open(‘list_urls.txt’, ‘r’).readlines()

output = s3_bucket + ‘results.json’

scrape(urls, output)

Page 30: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Distributing over the cluster

def distributed_scrape(urls, index, s3_bucket):

output = s3_bucket + ‘part’ + str(index) + ‘.json’

scrape(urls, output)

urls = open(‘list_urls.txt’, ‘r’).readlines()

urls = sc.parallelize(urls, 100)

urls.mapPartitionsWithIndex(lambda index, urls: distributed_scrape(urls, index, s3_bucket))

Page 31: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Installing scrapy over the cluster

1/ need to use Python 2.7echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh

2/ use pssh to install packages in the slavespssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’

Page 32: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

➢ Scraping pages ➢ Training a classifier

Every ML system….

➢ Filtering

➢ Segmenting urls

➢ Sample training instances

➢ Applying a classifier

Page 33: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Example: filtering

● we want to find activity of 30M users in 2 months of activity: 2 Gb vs 6 Tb

○ map-side join using broadcast() ⇒ does not work with large objects!■ e.g. input.filter(lambda x: x[‘user’] in user_list_b)

○ use of mapPartitions() ■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))

Page 34: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

6 TB,~11B input

35 mins 113 Gb, 529M matches

60 Gb, 515M matches

9 mins

bloom filter join

Page 35: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Example: segmenting urls

● we want to convert an url ‘www.iloveshoes.com’ to [‘i’, ‘love’, ‘shoes’]

● Segmentation○ wordsegment package in python ⇒ very slow!○ 300M urls take 10 hours with 120 cores!

Page 36: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Example: getting a representative sample

Page 37: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Our solution in Spark!

sample = sc.parallelize([],1)

sample_size = 1000

input.cache()

for category, proportion in stats.items():

category_pages = input.filter(lambda x: x[‘category’] == category)

category_sample = category_pages.takeSample(False, sample_size * proportion)

sample = sample.union(category_sample)

MLLib offers a probabilistic solution (not exact sample size):

sample = sampleByKey(input, stats)

Page 38: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

➢ Scraping pages ➢ Training a classifier

Every ML system….

➢ Filtering

➢ Segmenting urls

➢ Sample training instances

➢ Applying a classifier

Page 39: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Grid search for hyperparametersProblem: we have some candidate [ 1, 2, ..., 10000] values for a hyperparameter

, which one should we choose?

If the data is small enough that processing time is fine

➢ Do it in a single machine

If the data is too large to process on a single machine

➢ Use MLlib

If the data can be processed on a single machine but takes too long to train

➢ The next slide!

Page 40: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not
Page 41: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not
Page 42: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

number of combinations = {parameters} = 2

Page 43: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

number of combinations = {parameters} = 2

Page 44: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Using cross-validation to optimise a hyperparameter

1. separate the data into k equally-sized chunks2. for each candidate value i

a. use (k-1) chunks to fit the classifier parametersb. use the remaining chunk to get a classification scorec. report average score

3. At the end, select the that achieves the best average score

Page 45: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

number of combinations = {parameters} x {folds} = 4

Page 46: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

➢ Scraping pages ➢ Training a classifier

Every ML system….

➢ Filtering

➢ Segmenting urls

➢ Sample training instances

➢ Applying a classifier

Page 47: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Apply the classifier over the new_data: easy!

With scikit-learn:classifier_b = sc.broadcast(classifier)new_labels = new_data.map(lambda x: classifier_b.value.predict(x))

With scikit-learn but cannot broadcast:save classifier models to files, ship to s3use mapPartitions to read model parameters and classify

With MLlib:(model._threshold = None) new_labels = new_data.map(lambda x: model.predict(x))

Page 48: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Thanks!

Page 49: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Apache Spark for Big Data

Spark at Scale & Performance Tuning

Sahan Bulathwela | Data Science Engineer @ Skimlinks |

Page 50: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Outline

● Spark at scale: Big Data Example

● Tuning and Performance

Page 51: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Spark at Scale: Big Data Example● Yes, we use Spark !!

● Not just to prototype or one-time analyses

● Run automated analyses at a large scale on

daily basis

● Use-case: Generating audience statistics for

our customers

Page 52: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Before…

● We provide data products based on audience statistics to customers

● Extract event data from Datastore

● Generate Audience statistics and reports

Page 53: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Data● Skimlinks records web data in terms of user

events such as clicks, impressions and etc…● Our Data!!

○ Records 18M clicks (11 GB)○ Records 203M impressions (950 GB)○ These numbers are on daily basis (Oct 01, 2014)

● About 1TB of relevant events

Page 54: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

A few days and data scientists later...

Statistics

Page 55: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Major pain points● Most of the data is not relevant

○ Only 3-4 out of 30ish fields are useful for each report

● Many duplicate steps ○ Reading the data○ Extracting relevant fields ○ Transformations such as classifying

events

Page 56: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Solution

Page 57: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Solution

Page 58: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Solution

Page 59: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Aggregation doing its magic● Mostly grouping events and summarizing

● Distribute the workload in time

● “Reduce by” instead of “Group by”

● BOTS

Page 60: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Deep Dive

DatastoreBuild Daily profiles

Intermediate Data Structure

(Compressed in GZIP)

Events (1 TB)

Daily Profiles1.8 GB

Build Monthly profiles

Monthly Aggregate40 GB

Generate Audience StatisticsCustomers

Statistics7 GB

● Takes 4 hours● 150 Statistics● Delivered daily to

clients

Page 61: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Deep Dive

DatastoreBuild Daily profiles

Intermediate Data Structure

(Compressed in GZIP)

Events (1 TB)

Daily Profiles1.8 GB

Build Monthly profiles

Generate Audience StatisticsCustomers

Statistics7 GB

● Takes 4 hours● 150 Statistics● Delivered daily to

clients

Monthly Aggregate40 GB

Page 62: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Deep Dive

DatastoreBuild Daily profiles

Intermediate Data Structure

(Compressed in GZIP)

Events (1 TB)

Daily Profiles1.8 GB

Build Monthly profiles

Generate Audience StatisticsCustomers

Statistics7 GB

● Takes 4 hours● 150 Statistics● Delivered daily to

clients

Monthly Aggregate40 GB

Page 63: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

SO WHAT???Before After

Computing Daily event summary 1+ DAYS !!! 20 Mins

Computing monthly aggregate 40 Mins

Storing Daily event summary 100’s of GBs 1.8 GB

Storing monthly aggregate 40 GB

Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins

time taken per Report 1+ DAYS !!! 1.4 mins

Page 64: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Parquet enabled us to reduce our storage costs by 86% and

increase data loading speed by 5x

Page 65: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Storage

Page 66: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Performance when parsing 31 daily profiles

Page 67: A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Thank You !!


Recommended