+ All Categories
Home > Technology > Serverless Data Architecture at scale on Google Cloud Platform

Serverless Data Architecture at scale on Google Cloud Platform

Date post: 13-Apr-2017
Category:
Upload: meetupdatascienceroma
View: 161 times
Download: 0 times
Share this document with a friend
61
Serverless Data Architecture at scale on Google Cloud Platform Lorenzo Ridi Machine Learning/Data Science Meetup Rome, 02-02-2017
Transcript
Page 1: Serverless Data Architecture at scale on Google Cloud Platform

Serverless Data Architecture at scale on Google Cloud Platform

Lorenzo RidiMachine Learning/Data Science Meetup

Rome, 02-02-2017

Page 2: Serverless Data Architecture at scale on Google Cloud Platform

I’ve been aResearch Fellow @UniFI

I am aSoftware Engineer @Noovle

I am aGoogle Cloud PlatformQualified DeveloperI am aGoogle Cloud Platform Authorized Trainer

Hi, I’m Lorenzo!

Page 3: Serverless Data Architecture at scale on Google Cloud Platform

Google’s Mission

Organize the world’s information and make it universally accessible and useful.“ ”

Page 4: Serverless Data Architecture at scale on Google Cloud Platform

2002

2004

2006

2008

2010

2012

2014

2016

GFS

MapReduce

TensorFlow

BigTable

Dremel

Colossus

Flume

Megastore

Spanner

Millwheel

PubSubF1

Google’s Data Research

Page 5: Serverless Data Architecture at scale on Google Cloud Platform

2002

2004

2006

2008

2010

2012

2014

2016

ML

PubSub

DataFlow

DataStore

DataFlow

Cloud Storage

BigQuery

BigTable

DataProc

Cloud Storage

Google’s Data Products

Page 6: Serverless Data Architecture at scale on Google Cloud Platform

GA

Cloud Natural

Language

BetaGAGA

Cloud Speech

Cloud Translat

eCloud Vision

Stay tuned...

Fully trained ML models from Google Cloud that allow a general developer to take advantage of rich machine learning capabilities with

simple REST based services.

Pre-Trained Machine Learning Models

Page 7: Serverless Data Architecture at scale on Google Cloud Platform

tensorflow.orggithub.com/tensorflow

Open Source Software Library for Machine Learning.

Cloud Machine Learning

Managed service that enables you to easily build machine learning models, that work on any type of

data, of any size.

Use your own data to train models

Page 8: Serverless Data Architecture at scale on Google Cloud Platform

Cracking Black FridayAdding Machine Learning to a

serverless data analysis pipeline

Page 9: Serverless Data Architecture at scale on Google Cloud Platform

Black Friday (ˈblæk fraɪdɪ)noun

The day following Thanksgiving Day in the United States. Since 1932, it

has been regarded as the beginning of the Christmas shopping season.

Page 10: Serverless Data Architecture at scale on Google Cloud Platform

Black Friday in the US

2012 - 2016

source: Google Trends, November 23rd 2016

Page 11: Serverless Data Architecture at scale on Google Cloud Platform

Black Friday in Italy2012 - 2016

source: Google Trends, November 23rd 2016

Page 12: Serverless Data Architecture at scale on Google Cloud Platform

What are we doing

Processing + analytics

Tweets about black friday

insights

$$Hashtags(blackfriday, blackfriday2016)Brands & vendors(walmart, bestbuy)

Negative Hashtags(notonedime, blackoutblackfriday)

Page 13: Serverless Data Architecture at scale on Google Cloud Platform

ingest

process

store

analyze

How we’re gonna do it

Reqs (boss):Do it fastMake it work

Page 14: Serverless Data Architecture at scale on Google Cloud Platform

process

store

analyze

How we’re gonna do it

Reqs (boss):Do it fastMake it work

goserverless

ingest

Page 15: Serverless Data Architecture at scale on Google Cloud Platform

process

store

analyze

Pub/Sub

Container Engine

(Kubernetes)

How we’re gonna do it

Reqs (boss):Do it fastMake it work

goserverless

Page 16: Serverless Data Architecture at scale on Google Cloud Platform

What is Google Cloud Pub/Sub?

● Google Cloud Pub/Sub is a fully-managed real-time messaging service.

○ Guaranteed delivery■ “At least once” semantics

○ Reliable at scale■ Messages are replicated in

different zones

Page 17: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

$ gcloud beta pubsub topics create blackfridaytweetsCreated topic [blackfridaytweets].

SHELL

Page 18: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

?Pub/Sub Topic

Subscription A

Subscription B

Subscription C

Consumer A

Consumer B

Consumer C

Reliable AND scalable deliveryDecouples

producer and

consumer(s)

Absorbs shocks and changes

Page 19: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

● Simple Python application using the TweePy library

# somewhere in the code, track a given set of keywordsstream = Stream(auth, listener)stream.filter(track=['blackfriday', [...]])

[...]

# somewhere else, write messages to Pub/Subfor line in data_lines: pub = base64.urlsafe_b64encode(line) messages.append({'data': pub})body = {'messages': messages}resp = client.projects().topics().publish( topic='blackfridaytweets', body=body).execute(num_retries=NUM_RETRIES)

PYTHON

This is our Pub/Sub topic

Page 20: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs

Page 21: Serverless Data Architecture at scale on Google Cloud Platform

VM

From Twitter to Pub/Sub

App+

Libs

Page 22: Serverless Data Architecture at scale on Google Cloud Platform

VM

From Twitter to Pub/Sub

App+

Libs

hard toscale

hard to make

fault-tolerant

difficult todeploy & update

Page 23: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container

Page 24: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container

FROM google/python

RUN pip install --upgrade pipRUN pip install pyopenssl ndg-httpsclient pyasn1RUN pip install tweepyRUN pip install --upgrade google-api-python-clientRUN pip install python-dateutil

ADD twitter-to-pubsub.py /twitter-to-pubsub.pyADD utils.py /utils.py

CMD python twitter-to-pubsub.py

DOCKERFILE

install/updatelibs

execute script

copy scripts

Page 25: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container

easy to deploy but

still doesn’t scale

Page 26: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container Pod

Page 27: Serverless Data Architecture at scale on Google Cloud Platform

What is Kubernetes (K8S)?

● An orchestration tool for managing a cluster of containers across multiple hosts○ Scaling, rolling upgrades, A/B

testing, etc.

● Declarative – not procedural○ Auto-scales and self-heals to desired

state

● Supports multiple container runtimes, currently Docker and CoreOS Rkt

● Open-source: github.com/kubernetes

Page 28: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container Pod

apiVersion: v1kind: ReplicationControllermetadata: [...]Spec: replicas: 1 template: metadata: labels: name: twitter-stream spec: containers: - name: twitter-to-pubsub image: gcr.io/codemotion-2016-demo/pubsub_pipeline env: - name: PUBSUB_TOPIC value: ...

YAMLTodo: use Deployments!

Page 29: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container Pod

Page 30: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

App+

Libs Container Pod Node

Page 31: Serverless Data Architecture at scale on Google Cloud Platform

Node

From Twitter to Pub/Sub

Pod A Pod B

Container Engine manages the K8S master for us!

Page 32: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

Node 1

Node 2

Nodes autoscalingcourtesy of Kubernetes 1.3

Page 33: Serverless Data Architecture at scale on Google Cloud Platform

From Twitter to Pub/Sub

$ gcloud container clusters create codemotion-2016-demo-clusterCreating cluster cluster-1...done.Created [...projects/codemotion-2016-demo/.../clusters/codemotion-2016-demo-cluster].

$ gcloud container clusters get-credentials codemotion-2016-demo-clusterFetching cluster endpoint and auth data.kubeconfig entry generated for cluster-1.

$ kubectl create -f ~/git/kube-pubsub-bq/pubsub/twitter-stream.yamlreplicationcontroller “twitter-stream” created.

SHELL

Page 35: Serverless Data Architecture at scale on Google Cloud Platform

process

store

analyze

Pub/Sub

Kubernetes

How we’re gonna do it

Page 36: Serverless Data Architecture at scale on Google Cloud Platform

store

analyze

Pub/Sub

Kubernetes

Dataflow

How we’re gonna do it

Page 37: Serverless Data Architecture at scale on Google Cloud Platform

analyze

Pub/Sub

Kubernetes

DataflowBigQuery

How we’re gonna do it

Page 38: Serverless Data Architecture at scale on Google Cloud Platform

What is Google Cloud Dataflow?

● Cloud Dataflow is a collection of open source SDKs to implement parallel processing pipelines.○ same programming model

for streaming and batch pipelines

● Cloud Dataflow is a managed service to run parallel processing pipelines on Google Cloud Platform

Apache Beam

Page 39: Serverless Data Architecture at scale on Google Cloud Platform

What is Google BigQuery?

● Google BigQuery is a fully-managed Analytic Data Warehouse solution allowing real-time analysis of Petabyte-scale datasets.

● Enterprise-grade features○ Batch and streaming (100K

rows/sec) data ingestion○ JDBC/ODBC connectors○ Rich SQL-2011-compliant

query language○ Supports updates and

deletesnew!

new!

Page 40: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

Pub/Sub Topic

Subscription

Read tweets from

Pub/Sub

Format tweets for BigQuery

Write tweets on BigQuery

BigQuery Table

Dataflow Pipeline

Page 41: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

● A Dataflow pipeline is a Java program.

// TwitterProcessor.java

public static void main(String[] args) {

Pipeline p = Pipeline.create();

PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));

PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));

formattedTweets.apply(BigQueryIO.Write.to(tableReference));

p.run();

}

JAVA

Reads from Pub/Sub

Writes on BigQuery

Transforms each tweet (json) in a BigQuery record

(python too!)

Page 42: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

● A Dataflow pipeline is a Java program.

// TwitterProcessor.java

// Do Function (to be used within a ParDo)private static final class DoFormat extends DoFn<String, TableRow> { private static final long serialVersionUID = 1L;

@Override public void processElement(DoFn<String, TableRow>.ProcessContext c) { c.output(createTableRow(c.element())); }}

// Helper methodprivate static TableRow createTableRow(String tweet) throws IOException { return JacksonFactory.getDefaultInstance().fromString(tweet, TableRow.class);}

JAVA

(python too!)

Input elementWrite output

Page 43: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

● Use Maven to build, deploy or update the Pipeline.

$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor-Dexec.args="--streaming"

[...]

INFO: To cancel the job using the 'gcloud' tool, run:> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-19_15_49_53-5264074060979116717[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 18.131s[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016[INFO] Final Memory: 28M/362M[INFO] ------------------------------------------------------------------------

SHELL

Page 44: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

● You can monitor your pipelines from Cloud Console.

Page 45: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery● Data start flowing into BigQuery tables. You can run

queries from the CLI or the Web Interface.

Yay! It works!What now?

Page 46: Serverless Data Architecture at scale on Google Cloud Platform

analyze

Pub/Sub

Kubernetes

DataflowBigQuery

How we’re gonna do it

Page 47: Serverless Data Architecture at scale on Google Cloud Platform

Pub/Sub

Kubernetes

DataflowBigQuery

DataStudio

How we’re gonna do it

Page 48: Serverless Data Architecture at scale on Google Cloud Platform
Page 49: Serverless Data Architecture at scale on Google Cloud Platform

Pub/Sub

Kubernetes

DataflowBigQuery

How we’re gonna do it

enrich

DataStudio

Add magic here

Page 50: Serverless Data Architecture at scale on Google Cloud Platform

Pub/Sub

Kubernetes

DataflowBigQuery

How we’re gonna do it

Natural Language

API

DataStudio

Page 51: Serverless Data Architecture at scale on Google Cloud Platform

Sentiment Analysis with Natural Language API

Polarity: [-1,1]

Magnitude: [0,+inf)

Text

Page 52: Serverless Data Architecture at scale on Google Cloud Platform

Sentiment Analysis with Natural Language API

Polarity: [-1,1]

Magnitude: [0,+inf)

Text

sentiment = polarity x magnitude

Page 53: Serverless Data Architecture at scale on Google Cloud Platform

Sentiment Analysis with Natural Language API

Pub/Sub Topic

Read tweets from

Pub/Sub

Write tweets on BigQuery BigQuery

Tables

Dataflow Pipeline

Filter and Evaluate

sentiment

Format tweets for BigQuery

Write tweets on BigQuery

Format tweets for BigQuery

Page 54: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

● We just add the additional necessary steps.// TwitterProcessor.java

public static void main(String[] args) {

Pipeline p = Pipeline.create();

PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));

PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess())); PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat())); formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));

PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));

formattedTweets.apply(BigQueryIO.Write.to(tableReference));

p.run();

}

JAVA

PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess())); PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat())); formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));

Page 55: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

● The update process preserves all in-flight data.

$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor-Dexec.args="--streaming --update --jobName=twitterprocessor-lorenzo-

1107222550"

[...]

INFO: To cancel the job using the 'gcloud' tool, run:> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-19_15_49_53-5264074060979116717[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 18.131s[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016[INFO] Final Memory: 28M/362M[INFO] ------------------------------------------------------------------------

SHELL

Page 56: Serverless Data Architecture at scale on Google Cloud Platform

From Pub/Sub to BigQuery

Page 57: Serverless Data Architecture at scale on Google Cloud Platform

Pub/Sub

Kubernetes

DataflowBigQuery

DataStudio

We did it!

Natural Language

API

Page 58: Serverless Data Architecture at scale on Google Cloud Platform

Pub/Sub

Kubernetes

DataflowBigQuery

DataStudio

We did it!

Natural Language

API

“To serve and protect”

Page 59: Serverless Data Architecture at scale on Google Cloud Platform

Live demo

Page 60: Serverless Data Architecture at scale on Google Cloud Platform

Polarity: -1.0Magnitude: 1.5

Polarity: -1.0Magnitude: 2.1

Open on Thursday? Bad idea..

Beta test your deals!

Page 61: Serverless Data Architecture at scale on Google Cloud Platform

Thank you!

[email protected]


Recommended