+ All Categories
Home > Software > Spark on Dataproc - Israel Spark Meetup at taboola

Spark on Dataproc - Israel Spark Meetup at taboola

Date post: 19-Jan-2017
Category:
Upload: tsliwowicz
View: 421 times
Download: 1 times
Share this document with a friend
29
Vadim Solovey [email protected] Google Cloud Dataproc Spark and Hadoop with superfast start-up, easy management and billed by the minute.
Transcript
Page 1: Spark on Dataproc - Israel Spark Meetup at taboola

Vadim [email protected]

Google Cloud DataprocSpark and Hadoop with superfast start-up, easy management and billed by the minute.

Page 2: Spark on Dataproc - Israel Spark Meetup at taboola

Copyright 2015 Google Inc

<[email protected]>

Google Developer Expert & Trainer

CTO of DoIT International

Page 3: Spark on Dataproc - Israel Spark Meetup at taboola

Agenda

01

02

03

04

05

06

Google Dataproc Overview

Features

Demo

Roadmap

Q&A

Try Google Dataproc

Page 4: Spark on Dataproc - Israel Spark Meetup at taboola

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform.

Cloud Dataproc

Page 5: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 5

Management

Mobile

Services

Compute

Big Data

Storage

Developer Tools

Page 6: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 6

Dataproc 101

Low Cost IntegratedEasy to Use

Easily create and scale clusters to run native:

• Spark• PySpark• Spark SQL• MapReduce• Hive• Pig• More with IA’s

Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management.

Low-cost data processing with:• Low and fixed price• Minute-by-minute billing• Fast cluster provisioning,

execution, and removal• Ability to manually scale

clusters based on needs• Preemptible instances

Page 7: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 7

Product Characteristics Cloud Dataproc

Amazon EMR Customer Impact

Cluster start timeElapsed time from cluster creation until it is ready.

< 90 seconds ~360 secondsFaster data processing workflows because less time is spent waiting for clusters to provision and start executing applications.

Billing unit of measureIncrement used for billing service when active.

Minute HourlyReduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up.

Preemptible VMsClusters can utilize preemptible VMs.

Yes Kind of :-)Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles.

Job output & cancellationJob output easy to find and are cancelable without SSH

Yes NoHigher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH.

Competitive Highlights

Page 8: Spark on Dataproc - Israel Spark Meetup at taboola

02 Features

Page 9: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 9

● Spark 1.5.2 w/ Py-Spark & Spark-SQL

● Hadoop 2.7.1

● Pig 0.15

● Hive 1.2.1

● YARN Resource Manager

● Debian 8 based O/S

● Google Connectors for Cloud Storage, BigQuery & BigTable etc.

Packaging & Versioning

Page 10: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 10

Features

Integrated with Cloud Storage, Cloud Logging,

BigQuery, and more.

Integrated

Manually scale clusters up or down based on need,

even when jobs are running.

Anytime Scaling

UI, API & CLI for rapid development including

Initialization Actions & Job Output Driver

Tools

Available in every Google Cloud zone in the United States, Europe, and Asia

Global Availability

Page 11: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 11

# Only run on the master nodeROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)if [[ "${ROLE}" == 'Master' ]]; then

apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requestscurl https://bootstrap.pypa.io/get-pip.py | python

mkdir IPythonNBpip install "ipython[notebook]"ipython profile create default

echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.pyecho "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py

# Setup script for iPython Notebook so it uses the cluster's Sparkcat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'

import osimport sysspark_home = '/usr/lib/spark/'os.environ["SPARK_HOME"] = spark_homesys.path.insert(0, os.path.join(spark_home, 'python'))sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))_EOF

nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &fi

Initialization Action Example

Page 12: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 12

Off-the-Shelf Initialization Actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions

Pull Requests are Welcome!

JupyterFacebook Presto Zeppelin Kafka Zookeeper

Page 13: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 13

BigQuery BigTable CloudSQL Datastore

Available Datastores

Cloud Storage Nearline

Page 14: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 14

GCS Connector Performance (I)Recommendation Engine Use-Case (1 file, 500GB)

Page 15: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 15

GCS Connector Performance (II)Sessionization Use-Case (14,800 files, 1GB each)

Page 16: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 16

GCS Connector Performance (III)Document Clustering Use-Case (31,000 files, 250MB each)

Page 17: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 17

Additional Integrations

Cloud Logging Cloud Monitoring

Page 18: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 18

Spark & BigQuery Integration Exampleval fullyQualifiedInputTableId = "publicdata:samples.shakespeare"val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"val jobName = "wordcount"

// Set the job-level projectId.conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

// Use the systemBucket for temporary BigQuery export data used by the InputFormat.val systemBucket = conf.get("fs.gs.system.bucket")conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

// Configure input and output for BigQuery access.BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema)

val fieldName = "word"

val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])tableData.cache()tableData.count()tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

Page 19: Spark on Dataproc - Israel Spark Meetup at taboola

03 Demo

Page 20: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 20

Pricing Example

35-minutes Spark job running on 14x 16-cores workers (224 cores)

[ Crunching 3TB TeraSort ]

Page 21: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 21

Pricing

Pricing Example

Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price

Master Node n1-standard-4 1 4 $0.2 $0.04

Worker Nodes n1-highmem-16 4 64 $4.032 $0.64

Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6

Cluster Total n/a 15 224 $4.88

Pricing Details

Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)

35% to 300% less than AWS EMR(c3.2xlarge | m2.4xlarge)

Page 22: Spark on Dataproc - Israel Spark Meetup at taboola

04 Roadmap

Page 23: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 23

Roadmap (Q1 2015)

More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)Mahout, Hue, Cloudera, MapR and others

PerformanceFurther improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD)

More Native DatastoresSpanner, Google ML

Page 24: Spark on Dataproc - Israel Spark Meetup at taboola

06 Try Google Dataproc in 2015

Page 25: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 25

AWS EMR Customer?

Get $1,000To test Google Dataproc

Page 26: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 26

Not a AWS EMR Customer?

Get $1,000*

To test Google Dataproc

Page 27: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 27

* Agree to 1-hour meeting@ Google Tel-Aviv

to discuss your Big Data needs

Page 28: Spark on Dataproc - Israel Spark Meetup at taboola

Confidential & ProprietaryGoogle Cloud Platform 28

goo.gl/mFwCYapromo code is “1K-Dataproc”

Page 29: Spark on Dataproc - Israel Spark Meetup at taboola

05 Q?A

goo.gl/mFwCYa


Recommended