Post on 07-Jan-2017
transcript
Warehousing MongoDB DataUsing Apache Beam and BigQuerySandeep ParikhHead of Solutions Architecture, Americas East@crcsmnky
Google Cloud Platform 2
About Me
Agenda
MongoDB on Google Cloud Platform
What is Data Warehousing
Tools & Technologies
Example Use Case
Show, Don’t Tell
Confidential & ProprietaryGoogle Cloud Platform 4
MongoDB on Google Cloud Platform
Google Cloud Platform 5
MongoDB on Google Cloud Platform
Google Cloud Platform 6
Manually Deploying MongoDB
Google Cloud Platform 7
Google Cloud Launcher
Google Cloud Platform 8
MongoDB Cloud Manager
Google Cloud Platform 9
MongoDB Cloud Manager
How do you automate this?
Google Cloud Platform 10
Bootstrapping MongoDB Cloud Manager
DeploymentManagerTemplate
Google Cloud Platform 11
Cloud Deployment Manager
Provision, configure your deployment
Configuration as code
Declarative approach to configuration
Template-driven
Supports YAML, Jinja, and Python
Use schemas to constrain parameters
References control order and dependencies
Google Cloud Platform 12
Bootstrapping Cloud Manager
Schema, Configuration & Template
Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager
Three Compute Engine instances, each with 500 GB PD-SSD
MongoDB Cloud Manager automation agent pre-installed and configured
$ gcloud deployment-manager deployments create mongodb-cloud-manager \
--config mongodb-cloud-manager.jinja \
--properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
Confidential & ProprietaryGoogle Cloud Platform 13
What’s a Data Warehouse
Data Warehouses are central repositories of integrated data from one or more disparate
sourceshttps://en.wikipedia.org/wiki/Data_warehouse
Google Cloud Platform 15
Data Warehouse
Money
Data
Data
Data
Insights
Profit!
Confidential & ProprietaryGoogle Cloud Platform 16
Tools and Technologies
Google Cloud Platform 17
Where: BigQuery
Complex, Petabyte-scale data warehousing made simple
Scales automatically; No setup or admin
Foundation for analytics and machine learning
Google Cloud Platform 18
RUN QUERY
Google Cloud Platform 19
Google Cloud Platform 20
How: Apache Beam (incubating)
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
MillwheelApache Beam
Google Cloud Dataflow
Google Cloud Platform 21
Understand What, Where, When, How
3Streaming
4Streaming
+ Accumulation
1Classic Batch
2Windowed
Batch
Google Cloud Platform 22
Pipelines in Beam
Pipeline p = Pipeline.create();
p.begin()
.apply(TextIO.Read.from(“gs://…”))
.apply(ParDo.of(new ExtractTags())
.apply(Count.create())
.apply(ParDo.of(new ExpandPrefixes())
.apply(Top.largestPerKey(3))
.apply(TextIO.Write.to(“gs://…”));
p.run();
Pipeline p = Pipeline.create();
p.begin()
.apply(TextIO.Read.from(“gs://…”))
.apply(ParDo.of(new ExtractTags())
.apply(Count.create())
.apply(ParDo.of(new ExpandPrefixes())
.apply(Top.largestPerKey(3))
.apply(TextIO.Write.to(“gs://…”));
p.run();
.apply(PubsubIO.Read.from(“input_topic”))
.apply(Window.<Integer>by(FixedWindows.of(5, MINUTES))
.apply(PubsubIO.Write.to(“output_topic”));
Batch to Streaming
Google Cloud Platform 23
Apache Beam Vision
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
Google Cloud Platform 24
Running Apache Beam
Cloud Dataflow Local Runner
25
A great place for executing Beam pipelines which provides:
● Fully managed, no-ops execution environment
● Integration with Google Cloud Platform
● Java support in GA. Python in Alpha
Cloud Dataflow Service
Deploy Tear Down
Fully Managed: Worker Lifecycle Management
Fully Managed: Dynamic Worker Scaling
100 mins. 65 mins.
vs.
Fully Managed: Dynamic Work Rebalancing
Integrated: Monitoring UI
Integrated: Distributed Logging
Cloud Logs
Google App Engine
Google Analytics Premium
Cloud Pub/Sub
BigQuery Storage(tables)
Cloud Bigtable(NoSQL)
Cloud Storage(files)
Cloud Dataflow
BigQuery Analytics(SQL)
Capture Store Analyze
Batch
Cloud DataStore
Process
Stream
Cloud MonitoringCloud
Bigtable
Real time analytics and Alerts
Cloud Dataflow
Cloud Dataproc
Integrated: Part of Google Cloud Platform
Cloud Dataproc
31
Confidential & ProprietaryGoogle Cloud Platform 32
Example Use Case
Google Cloud Platform 33
Sensor Data
Confidential & ProprietaryGoogle Cloud Platform 34
Show, Don’t Tell
Insert Demo Here