MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Post on 07-Jan-2017

203 views 2 download

transcript

Warehousing MongoDB DataUsing Apache Beam and BigQuerySandeep ParikhHead of Solutions Architecture, Americas East@crcsmnky

Google Cloud Platform 2

About Me

Agenda

MongoDB on Google Cloud Platform

What is Data Warehousing

Tools & Technologies

Example Use Case

Show, Don’t Tell

Confidential & ProprietaryGoogle Cloud Platform 4

MongoDB on Google Cloud Platform

Google Cloud Platform 5

MongoDB on Google Cloud Platform

Google Cloud Platform 6

Manually Deploying MongoDB

Google Cloud Platform 7

Google Cloud Launcher

Google Cloud Platform 8

MongoDB Cloud Manager

Google Cloud Platform 9

MongoDB Cloud Manager

How do you automate this?

Google Cloud Platform 10

Bootstrapping MongoDB Cloud Manager

DeploymentManagerTemplate

Google Cloud Platform 11

Cloud Deployment Manager

Provision, configure your deployment

Configuration as code

Declarative approach to configuration

Template-driven

Supports YAML, Jinja, and Python

Use schemas to constrain parameters

References control order and dependencies

Google Cloud Platform 12

Bootstrapping Cloud Manager

Schema, Configuration & Template

Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager

Three Compute Engine instances, each with 500 GB PD-SSD

MongoDB Cloud Manager automation agent pre-installed and configured

$ gcloud deployment-manager deployments create mongodb-cloud-manager \

--config mongodb-cloud-manager.jinja \

--properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY

Confidential & ProprietaryGoogle Cloud Platform 13

What’s a Data Warehouse

Data Warehouses are central repositories of integrated data from one or more disparate

sourceshttps://en.wikipedia.org/wiki/Data_warehouse

Google Cloud Platform 15

Data Warehouse

Money

Data

Data

Data

Insights

Profit!

Confidential & ProprietaryGoogle Cloud Platform 16

Tools and Technologies

Google Cloud Platform 17

Where: BigQuery

Complex, Petabyte-scale data warehousing made simple

Scales automatically; No setup or admin

Foundation for analytics and machine learning

Google Cloud Platform 18

RUN QUERY

Google Cloud Platform 19

Google Cloud Platform 20

How: Apache Beam (incubating)

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

Google Cloud Platform 21

Understand What, Where, When, How

3Streaming

4Streaming

+ Accumulation

1Classic Batch

2Windowed

Batch

Google Cloud Platform 22

Pipelines in Beam

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(ParDo.of(new ExtractTags())

.apply(Count.create())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to(“gs://…”));

p.run();

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(ParDo.of(new ExtractTags())

.apply(Count.create())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to(“gs://…”));

p.run();

.apply(PubsubIO.Read.from(“input_topic”))

.apply(Window.<Integer>by(FixedWindows.of(5, MINUTES))

.apply(PubsubIO.Write.to(“output_topic”));

Batch to Streaming

Google Cloud Platform 23

Apache Beam Vision

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Google Cloud Platform 24

Running Apache Beam

Cloud Dataflow Local Runner

25

A great place for executing Beam pipelines which provides:

● Fully managed, no-ops execution environment

● Integration with Google Cloud Platform

● Java support in GA. Python in Alpha

Cloud Dataflow Service

Deploy Tear Down

Fully Managed: Worker Lifecycle Management

Fully Managed: Dynamic Worker Scaling

100 mins. 65 mins.

vs.

Fully Managed: Dynamic Work Rebalancing

Integrated: Monitoring UI

Integrated: Distributed Logging

Cloud Logs

Google App Engine

Google Analytics Premium

Cloud Pub/Sub

BigQuery Storage(tables)

Cloud Bigtable(NoSQL)

Cloud Storage(files)

Cloud Dataflow

BigQuery Analytics(SQL)

Capture Store Analyze

Batch

Cloud DataStore

Process

Stream

Cloud MonitoringCloud

Bigtable

Real time analytics and Alerts

Cloud Dataflow

Cloud Dataproc

Integrated: Part of Google Cloud Platform

Cloud Dataproc

31

Confidential & ProprietaryGoogle Cloud Platform 32

Example Use Case

Google Cloud Platform 33

Sensor Data

Confidential & ProprietaryGoogle Cloud Platform 34

Show, Don’t Tell

Insert Demo Here