www.pythian.com | White Paper 1
When it comes to running your data in the public cloud, there is a range of
Database-as-a-Service (DBaaS) offerings from all three major public cloud
providers. Knowing which is best for your use case can be challenging. This
paper provides a high-level overview of the main DBaaS offerings from Amazon,
Microsoft, and Google.
After reading this white paper, you’ll have a high-level understanding of the most
popular data repositories and data analytics service offerings from each vendor,
you’ll know the key differences among the offers, and which ones are best for
each use case. With this information, you can direct your more detailed research
to a manageable number of options.
CHOOSING A DATABASE- AS-A-SERVICE
Warner ChavesPrincipal Consultant, Microsoft Certified Master, Microsoft MVP
With Contributors
Danil Zburivsky, Director of Big Data and Data Science
Vladimir Stoyak, Principal Consultant for Big Data, Certified Google Cloud
Platform Qualified Developer
Derek Downey, Practice Advocate, OpenSource Databases
Manoj Kukreja, Big Data and IT Security Specialist, CISSP, CCAH and OCP
AN OVERVIEW OF OFFERINGS BY MAJOR PUBLIC CLOUD SERVICE PROVIDERS
www.pythian.com | White Paper 2
This white paper does not discuss private cloud providers or colocation environments,
streaming, data orchestration, or Infrastructure-as-a-Service (IaaS) offerings.
This paper is targeted to IT professionals with a good understanding of databases
and also business people who want an overview of data platforms in the cloud.
WHAT IS A DBAAS OFFERING?A DBaaS is a database running in the public cloud. Three things define a DBaaS:
• The service provider installs and maintains the database software, including
backups and other common database administration tasks. The service
provider also owns and manages the operating system, hypervisors, and
bare metal hardware.
• Application owners pay according to their usage of the service.
• Usage of the service must be flexible—users can scale up or down on
demand and also create and destroy environments on demand. These
operations should be possible through code with no provider intervention.
FOUR CATEGORIES OF DBAAS OFFERINGSTo keep things simple, we’ve created four categories of DBaaS offerings. Your
vehicles of choice are:
• The Corollas: These are the classic RDBMS services in the cloud: Amazon
Relational Database Service (RDS), Microsoft Azure SQL Database, and
Google Cloud SQL.
• The Formula One offerings: These special-purpose offerings ingest and
query data very quickly but might not offer all the amenities of the Corollas.
Options include Amazon DynamoDB, Microsoft Azure DocumentDB, Google
Cloud Datastore, and Google Cloud Bigtable.
• The 18-wheelers: These data warehouses of structured data in the cloud
include Amazon Redshift, Microsoft Azure SQL Data Warehouse, and Google
BigQuery.
• The container ships: These Hadoop-based big-data systems can carry
anything, and include Amazon Elastic MapReduce (EMR), Microsoft Azure
HDInsight, and Google Cloud Dataproc. This category also includes the
further automated offering of Azure Data Lake.
The rest of this white paper discusses each category and the Amazon, Microsoft,
and Google offerings within each category. We describe each offering, explain
what it is well suited for, provide expert tips or additional relevant information, and
provide high-level pricing information.
www.pythian.com | White Paper 3
COROLLASWith the Corollas, just like with the car, you know what you’re getting, and you know
what to expect. This type of classic RDBMS service gets you from point A to point B
reliably. It’s not the flashiest or newest thing on the block, but it gets the job done.
AMAZON RDS
Amazon Relational Database Service (RDS) is the granddaddy of DBaaS offerings
available on the Internet. RDS is an automation layer that Amazon has built on
top of MySQL, MariaDB, Oracle, PostgreSQL, and SQL Server. Amazon has also
developed its own MySQL fork called Amazon Aurora, which also lives inside RDS.
RDS is an easy way to transition into DBaaS because the service mimics the on-
premises experience very closely. You simply need to provision an RDS instance,
which maps very closely to the virtual machine models that Amazon offers.
Amazon then installs bits, manages patches and backups, and can also manage
the high availability, so you do not need to plan and execute these tasks yourself.
RDS is very good for lift-and-shift types of cloud migrations. It makes it easy for
existing staff to take advantage of the service because it mimics the on-premises
experience, be it physical or virtual.
EXPERT TIP The storage is very flexible: this is both a pro and a con. The pro is that you have a lot of
control over storage. The con is that there are so many storage options, you need the
knowledge to choose the best one for your use case.
Amazon has general storage, provisioned IOPS (input/output operations per second),
and two categories of magnetic storage. The storage method you choose will depend
on your particular use cases.
You need to be aware that Amazon does not make every patch version of all products
available on RDS. Instead, Amazon makes only some major service packs or Oracle
patch levels available. As a result, the exact patch level that you have on premises might
not map to a patch level on RDS. In this situation, do not move to a patch level that is
below the patch level you have because that may result in product regressions. Instead,
wait until Amazon has deployed a patch level higher than what you have. At this point, it
should be fairly safe to start testing if you want to migrate to RDS.
HOW IT’S PRICED
The hourly rate for RDS depends on:
• whether you have your own license or if Amazon is leasing you the license;
www.pythian.com | White Paper 4
• how much compute power you choose: The number of cores, and amount of
memory and temporary disk you want on this instance;
• the storage you require; and
• whether you pre-purchased with Reserved Instances.
MICROSOFT AZURE SQL DATABASEMicrosoft Azure SQL Database is a “cloud-first” SQL Server fork. The term “cloud-
first” means that Microsoft now tests and deploys their code continuously with Azure
SQL Database, and the code and lessons learned are implemented in the retail SQL
Server product—whether the product is on premises or on a virtual machine.
Even if you don’t have any investment in SQL Server, Azure SQL Database is an
excellent DBaaS platform because of the investments made to support the elastic
capabilities and to the ease of scaling horizontally. As you need more capacity, you
just add more databases.
It’s also easy to manage the databases by pooling resources, performing elastic
queries, and performing elastic job executions. You could deploy your own code
to do something similar in Amazon RDS , but in Azure SQL Database, Microsoft has
already built it for you.
In addition, Azure SQL Database makes it easy to build an elastic application on a
relational service. This capability supports the Software-as-a-Service (SaaS) model,
wherein you have many clients and each has a database. The SaaS provider has
a data layer that is easier to manage and scale than if they were running on their
own infrastructure.
Unlike Amazon RDS, Azure SQL Database does not exactly map to a type of retail
database, such as Oracle, SQL Server, or open-source MySQL. It is closely related
to SQL Server but it’s not licensed or sold in a similar way. As a result, Azure SQL
Database does not have any licensing component.
At the same time, Azure SQL Database does not give you a lot of control over the
hardware. With Amazon RDS, you need to select CPUs, memory, and your storage
layout. Azure SQL Database does all this for you.
With Azure SQL Database the only thing that you need to choose is the service
tier. Your choice determines how much power your database has. There are three
service tiers: basic, standard, and premium. Each of these also has some sub-tiers
to increase or decrease performance. If you have many databases in Azure SQL
Database, you can also choose the elastic database pool pricing option to increase
your savings by sharing resources.
www.pythian.com | White Paper 5
Azure SQL Database is a good choice if you already have Transact-SQL (T-SQL)
skills in-house. If you have a large investment in SQL Server, Azure SQL Database
is the most natural way to take advantage of DBaaS offerings in the cloud. It’s also a
very good web scale relational service in its own right because of all the investments
made to support the SaaS model.
EXPERT TIP
You do need to ensure that you do the proper SQL tuning to be able to choose
the right service tier for your needs. In the past, it was more difficult to scale up
because all equipment was on premises. Now, it’s very easy to increase the power
of the service and therefore pay more money. However, just because scaling up is
easy does not mean it’s always what you need to do. If you perform the proper SQL
tuning, you will not need to pay more for raw power.
HOW IT’S PRICED
Azure SQL Database has a simple pricing model. You pay an hourly rate for the service tier
your database is running on: Basic, Standard, or Premium. Each has a different size limit for
the database and provides more performance as you go up in the tier.
GOOGLE CLOUD SQLGoogle Cloud SQL is a MySQL managed database service that is very similar to
Amazon RDS for MySQL and Amazon Aurora. You select an instance and deploy it
without needing to install any software.
CIoud SQL automates all your backups, replication, patches, and updates anywhere
in the world while ensuring greater than 99.95 percent availability. Automatic failover
ensures your database will be available when you need it.
Cloud SQL Second Generation introduces per-minute, pay-per-use billing, automatic
sustained use discounts, and instance tiers to fit any budget.
EXPERT TIP
Cloud SQL does have restrictions on:
• anything related to loading/dumping the database to a local file system,
• installing plugins,
• creating user-defined functions,
• performance schema,
• SUPER privileges, and
• Storage engines: InnoDB is the only one supported for Second Generation
instances
www.pythian.com | White Paper 6
HOW IT’S PRICED
Pricing for Cloud SQL Second Generation is made up of three components: instance
pricing, storage pricing, and network pricing. The charge is based on the machine
type you choose for the instance. Storage and network pricing are separate charges.
FORMULA ONE OFFERINGSThe Formula One DBaaS offerings are fit-for-purpose offerings. They do not have
all the functionality of the mature RDBMS products but they do a limited number of
things very well.
A Formula One car is built purely for speed. It does not have a cup holder, heated seats,
or satellite radio. However, it’s fit for purpose—and that purpose is to go fast. (Admittedly,
you might miss some of the amenities that you are used to with a regular car.)
Similarly, the Formula One DBaaS offerings are built for purpose. That purpose is
to ingest and query data very quickly. Think of them as NoSQL in the cloud. The
NoSQL movement was popularized by large web applications such as Google and
Facebook as a way to differentiate their database platforms from the classic RDBMS
offerings. Usually NoSQL products handle horizontal scalability with more ease,
have more relaxed restrictions on schema (if any), and forego some of the ACID
requirements as a trade-off for more speed.
AMAZON DYNAMODBAmazon DynamoDB is a very popular service offered through Amazon Web Services
(AWS). It’s basically a NoSQL document/key value table store. All you need to define is
the table and either its key or its key and sort order. The schema is completely flexible
and is up to you. DynamoDB is best suited for applications with known query patterns
that don’t require complex transactions and that ingest large volumes of data.
DynamoDB is built for scale-out growth of high-ingest applications because the
Amazon scale-out architecture guarantees that you will not run out of space. You don’t
need to worry about the scale out, you just need to know that that this is how Amazon
has architected the service. For example, when you specify a partition key for records,
they will all be distributed to the same nodes that Amazon builds transparently behind
the scenes for your data.
This offering does not have an optimizer, so it does not support ad hoc SQL querying
the way a relational product does. It’s more a set of the normalized instantiated views
based on the indexes that you have created on your data.
Querying is not done with SQL, it is performed through a different type of specification.
Amazon provides SDKs in many languages, including Java, NET, and Python. You use
www.pythian.com | White Paper 7
these SDKs to develop queries. This process does require a bit of learning but that’s
not a major time investment.
Although DynamoDB does not have a fixed schema, it does support complex
schemas. For example, fields are denormalized: some fields could be lists, some could
be maps or sets. This service also exposes a stream-based API, so if you need to
replicate the data changes from DynamoDB to another system, you can do so through
the stream-based API.
EXPERT TIP
Because this service does not support ad hoc querying, your schema can have a
huge impact on what you’re allowed to do on your application. DynamoDB also has
a finite number of indexes that you can apply: five global indexes for each table. You
need to keep in mind the indexing limits and lack of an optimizer, and ensure that
your schema will be able to support your future application requirements.
HOW IT’S PRICED
The cost of DynamoDB is based on storage, how much data you have, and the I/O
rate: your number of requests for read units and write units. If you have any streams,
you will need to pay for the streams’ read rate.
MICROSOFT AZURE DOCUMENTDBMicrosoft Azure DocumentDB is a NoSQL document database that is basically a
repository for JSON (JavaScript Object Notation) documents. JSON documents
have no schema restrictions. They can contain almost any type of field, and they can
also have nested fields. This DBaaS is NoSQL denormalized, with built-in support
for partition collection, so you can specify a field in the JSON documents and Azure
DocumentDB will partition the documents based on that field.
Azure DocumentDB also has built in geo-replication support, so you can have, for
example, an Azure DocumentDB collection reading and writing on the east coast
of the United States and a replica of this collection that you can use for reads in the
central United States. If there’s an issue with the DocumentDB on the east coast, you
can failover to the other geo-region for very high availability.
Azure DocumentDB is a good choice for JSON-based storage, and it’s very easy
to set up and start storing documents. Retrieval is also easy because this database
supports full-blown SQL-style queries, so you don’t need to learn any new query
language.
EXPERT TIP
If you don’t specify any indexes, the system has some automatic indexing policies.
However, keep in mind that indexing has a storage consumption value, so the more
www.pythian.com | White Paper 8
indexes you have, the more storage you will consume—and you will pay for that amount
of storage. The storage could be for indexes that you do not use, so ensure that the
automatic indexing policies work for your use case.
If it doesn’t make sense to have an index on a field because you never search on it, you
can disable the index through a custom policy. Also, if specific collections have limits and
you need to perform partition collection, each key will be able to hold no more than 10
Gbit of documents. If you need more than this amount per partition key, you will probably
want to ensure that you design with a very high-granularity partition key.
HOW IT’S PRICED
Azure DocumentDB offers some pre-defined tiers for billing based on common usage
patterns. However, if you want to customize the system, you can easily select your
individual compute power, referred to as “request units”, plus the amount of storage that
you want for the collection.
GOOGLE CLOUD DATASTOREGoogle Cloud Datastore is Google’s version of a NoSQL cloud service similar
to Amazon DynamoDB and Microsoft Azure DocumentDB. From an architecture
perspective, Cloud Datastore is similar to other key/value stores. The data model
is organized based on “entities”, which loosely resemble rows in a relational table.
Entities can have multiple properties but no rigid schema is imposed on entities.
Two different entities of a similar type don’t need to have the same number or
type of properties. An interesting feature of Cloud Datastore is built-in support for
hierarchical data.
In addition to all the properties you would expect from a cloud NoSQL DBaaS,
such as massive scalability, high availability, and flexible storage, Cloud Datastore
also supports some unique properties, including out-of-the-box transaction support
and encryption at rest.
Google also provides tight integration of Cloud Datastore with other Google Cloud
Platform services. Applications running in Google App Engine can use Cloud
Datastore as their default database. You can also load data from Cloud Datastore
into Google BigQuery for analytics purposes.
There are multiple ways to access data in Cloud Datastore. There are client
libraries for most popular programming languages as well as a REST interface.
Google also provides a GQL language that is roughly modelled on SQL and can
provide an easier transition from relational databases to the NoSQL world.
www.pythian.com | White Paper 9
EXPERT TIP
Cloud Datastore automatically indexes all properties for an entity, making simple
single-property queries possible without any additional configuration. More
complex multi-property indexes can be created by defining them in a special
configuration file.
HOW IT’S PRICED
Similar to other cloud NoSQL services, Cloud Datastore is priced according to
amount of storage the database requires and the number of different operations
it performs. Google defines prices for reads, writes, and deletes per 100,000
entities. However, simple requests such as fetching an entity by its key (which is a
very common operation), are free.
GOOGLE CLOUD BIGTABLEGoogle Cloud Bigtable is Google’s NoSQL big-data database service. It’s the cloud
version of the same database that powers many core Google services, including
Search, Analytics, Maps, and Gmail.
Bigtable is designed to handle massive workloads at consistent low latency
and high throughput, so it’s a great choice for both operational and analytical
applications, including Internet of Things (IoT) use cases, user analytics, and
financial data analysis.
This public cloud service gives you instant access to all the engineering effort that
was put into Bigtable at Google over the years. The Apache HBase-like database
is flexible and robust, and lacks some of the inherited HBase issues, such as Java
Google Cloud stalls. In addition, Cloud Bigtable is completely managed, so you
don’t need to provision hardware, install software, or handle failures.
EXPERT TIP
Cloud Bigtable does not have strong typing; it’s basically a massive key value
table. As data comes in, it is treated as binary strings. This DBaaS also does not
have any type of querying through SQL. You have the key, then you can get the
value. Cloud Bigtable is also built for very large tables, so it’s not worth considering
this for anything less than a table of 1 terabyte.
HOW IT’S PRICED
Pricing for Cloud Bigtable is based on: the number of Cloud Bigtable nodes that
you provision in your project for each hour (you will be billed for a minimum of one
hour); the amount of storage that your tables use over a one-month period; and
the amount of network bandwidth used. Some types of network egress traffic are
subject to bandwidth charges.
www.pythian.com | White Paper 10
18-WHEELERSThe 18-wheelers can handle the heavy load of structured data. These are basically
data warehouses in the cloud. They store and easily query large amounts of
structured data.
AMAZON REDSHIFTAmazon Redshift is the granddaddy of the 18-wheeler DBaaS offerings. This is
Amazon’s modified PostgreSQL with columnar storage.
Other columnar storage-type offerings include HPE Vertica, Microsoft SQL
Server Parallel Data Warehouse (PDW) and SQL Server Column stores, Oracle
Exadata Database Machine (Exadata), and Oracle Database In-Memory. All these
technologies achieve excellent compression ratios through the columnar storage.
Instead of storing the data by rows, they store it by columns, which makes the
scans of the data very fast.
Redshift is a relational massively parallel processing (MPP) data warehouse, so
there are multiple nodes rather than just one big machine. The service works with
SQL queries as well as allowing you to write your own modules on Python.
Because Redshift is scaled per node, if you need more power you need to add
another node. This means you need to make a selection of both compute and
storage, and the service is charged per node, per hour. Redshift gives you a lot of
control over specific node configurations, so you can choose how many cores and
how much memory the nodes have. You can also decide whether to pay more and
have the fastest storage on the nodes through solid state drives (SSDs) or save
some money by instead using hard drive-based storage attached to the nodes.
Redshift is a very good warehousing solution for all your data. If you have a big
footprint on AWS, Redshift is definitely the warehousing solution that you want.
EXPERT TIP
With Redshift, you do need to watch node count and configurations. The ideal
configuration of your Redshift cluster might depend on your workload and your
workload patterns, so you need to decide if it’s better to have fewer nodes with
really high specs or more nodes with less compute or less memory. Based on
your analysis, you then need to properly tune Redshift for your workload and
warehouse design.
Also be aware of possible copy issues due to Amazon Simple Storage System (S3)
consistency. Amazon recommends that you use the manifest files to specify what
www.pythian.com | White Paper 11
you want to load so that you’re not in a situation where you just read the names of
the files off S3, and because of the eventual consistency, there is a file that you miss.
Finally, Redshift does require regular maintenance to keep the statistics and tables
up to date. If you do any updates or deletes, the service has an operation called
the Vacuum to keep the tables optimally organized for fast retrieval.
HOW IT’S PRICED
Redshift is billed by the hour per node. The cost of each node depends on the
configuration of cores, amount of memory, and type of storage you select.
MICROSOFT AZURE SQL DATA WAREHOUSEMicrosoft Azure SQL Data Warehouse is Microsoft’s response to Redshift. It’s fully
relational, with 100 percent SQL-type queries, and highly compatible with the
T-SQL for SQL Server. If you have SQL Server investments, it would be very easy
to adopt SQL Data Warehouse.
Like Redshift, storage is columnar and the service is MPP. Data is split into storage
distributions when you load it. The architecture is distributed, so a query is sent to
all the different nodes to help resolve your questions.
Azure SQL Data Warehouse scales compute and storage independently. Unlike
Redshift, where you always need to scale on a full node, Azure SQL Data
Warehouse allows you to add just more compute if you only need more compute.
You can also add more storage and keep the same amount of compute.
A very powerful capability is that you can pause compute completely. For example, if you
don’t have much load on your data warehouse during the weekend, you can decide to
shut it down and pause it completely during the weekend, for maximum savings.
Azure SQL Data Warehouse is an excellent enterprise warehousing solution,
particularly if you have a lot of data already built on Azure services. If you have a
pause-friendly workload, this service will provide very good savings.
EXPERT TIP
Unlike Redshift, which gives you a lot of control over the configuration of the
nodes, Azure SQL Data Warehouse gives you no control over hardware. It’s 100
percent Platform-as-a-Service (PaaS). You simply select a compute unit, called a
Data Warehousing Unit (DWU). The amount of the DWU will give you an idea of the
power that you get for the data warehouse.
www.pythian.com | White Paper 12
Be aware that at the time of publication, not all T-SQL data types are supported
yet. For example, if you need to store spatial data, you could store it now just as
binary, but you won’t have full support of all the spatial functions.
Before you start a full migration to Azure SQL Data Warehouse, ensure that you
carefully review which functionality is available. However, if you have only regular
structured types on your data warehouse, it’s definitely wise to consider this
service now.
HOW IT’S PRICED
Azure SQL Data Warehouse has two separate cost components: storage and
compute. Compute is elastic and is billed by the hour based on the number of
DWUs you provision.
GOOGLE BIGQUERYGoogle BigQuery is a mix of an 18-wheeler and a container ship. A container ship
is a big-data, Hadoop-style service.
BigQuery is a hybrid because it is based on a structured schema but at the same
time allows for easy integration with Google DataProc and fixing schema on read
over storage. The service supports regular tables with data stored inside the
service as well as virtual tables where you put schema on read. It’s the same with
external tables, so you can map BigQuery to other services inside Google, such as
Google Cloud Storage, and then have those tables defined inside BigQuery to be
used for your analytic queries.
BigQuery is Google Cloud Platform’s serverless analytics data warehouse, so you
do not need to manage hardware, software or the operating system.
Google has replaced its SQL with a standards-compliant dialect that enables
more advanced query planning and optimization. There are also new data types,
additional support for timestamps, and extended JOIN support.
BigQuery also has a streaming interface, so instead of running an Extract,
Transform and Load (ETL) process based mostly on fixed-schedule batch
processing, you can also have a streaming flow that brings inserts directly into
BigQuery constantly using an API or the Cloud Dataflow engine.
EXPERT TIP
BigQuery is a very good one-stop shop if you have streaming data, relational
data, and file-based data because it can put schema on read. But watch out for
www.pythian.com | White Paper 13
high-compute queries where Google estimates that it takes too much compute
to resolve at their regular rate per query: as of August 2016, the limit is 1 terabyte.
Above this limit, the extra compute cost is $5 per terabyte. You might receive an
error message that reads, “Hey, you need to run with higher compute.” The cost of
that query will be higher, so you will need to watch out for runaway costs.
Hadoop can be attached to BigQuery tables, but it does require a temporary data
copy. The BigQuery Hadoop connector will perform a temporary data copy to
Google Cloud Storage (GCS) for Hadoop. Don’t be surprised if you incur some
GCS costs for this type of operation.
HOW IT’S PRICED
BigQuery is billed per storage and per query, with automatic lower pricing tiers
after 90 days of data being idle. Costs are based on the amount of data you have
and how much of it you read. If you are streaming data in, you will pay extra for it.
With BigQuery, you pay only for data read by queries. For example, if you have a
20-TB warehouse in BigQuery, but you’re only running 1 to 10 queries per day, you
will pay for only those queries. You do not need to pay for provisioning compute
storage the way you do with Redshift. With Azure SQL Data Warehouse you also
pay for compute, but at least you can pause it. BigQuery goes one step beyond
by charging for only specific queries that you run. As a result, you don’t even need
to think about starting and pausing compute. You simply use compute on demand
whenever you want to run a query.
CONTAINER SHIPSThe container ships are big-data systems that carry everything, any shape or
form. They are really Hadoop-as-a-Service, and this is very attractive because
on-premises Hadoop deployments have a high cost to experiment: the high cost
of curiosity. You need to build your Hadoop service and also have enough storage
and enough nodes before you can start your data exploration.
If you instead do your data exploration in the cloud, you can let the cloud deploy
all the power you need. If you need a very large cluster, you don’t need to make
any type of capital expenditures to get up and running. You also don’t need to
make operational expenditures to manage the cluster. You simply create and
destroy as needed, and you pay for storage in the cloud. All the major cloud
providers offer this type of service.
All of the container ships follow a similar pattern. You pick a machine model for
your nodes, deploy the cluster with a given size, pick how many nodes you want,
and then you attach the ship to a storage service that it can read the data from.
The Amazon storage service is S3. The Microsoft Azure services are Azure Data
Lake and Azure Blob Storage. Google uses Google Cloud Storage.
www.pythian.com | White Paper 14
After the cluster is deployed, you use it as a Hadoop installation if you need to run
EMR, Apache Spark, Apache Storm, or any other type of Hadoop-based service.
AMAZON ELASTIC MAPREDUCEAmazon Elastic MapReduce (EMR) is a managed Hadoop framework that makes it
easy, fast, and cost-effective to distribute and process vast amounts of your data
across dynamically scalable Amazon EC2 instances. You can also run other popular
distributed frameworks such as Spark and Presto in Amazon EMR, and interact with
data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
Amazon EMR releases are packaged using a system based on Apache Bigtop,
which is an open-source project associated with the Hadoop ecosystem.
In addition to Hadoop and Spark ecosystem projects, each Amazon EMR
release provides components that enable cluster and resource management,
interoperability with other AWS services, and additional configuration optimizations
for installed software.
EXPERT TIP
Amazon provides the AWS Data Pipeline service, which allows automating recurring
clusters by implementing an orchestration layer to automatically start the cluster
submit job, handle exceptions, and tear down clusters when the job is done.
HOW IT’S PRICED
Amazon charges per hour for EMR. One way to minimize costs is to have some of the
compute nodes deployed on Spot Instances; this provides savings of up to 90 percent.
MICROSOFT AZURE HDINSIGHTMicrosoft Azure HDInsight is an Apache Hadoop distribution that deploys and
provisions managed Hadoop clusters. This service can process unstructured or
semi-structured data and has programming extensions for. C#, Java, and .NET,
so you can use your programming language of choice on Hadoop to create,
configure, submit, and monitor Hadoop jobs.
HDInsight is tightly integrated with Excel, so you can visualize and analyze your
Hadoop data in compelling new ways using a tool that’s familiar to your business
users. HDInsight incorporates R Server for Hadoop, a cloud implementation of
one of the most popular programming languages for statistical computing and
machine learning. It gives the familiarity of R with the scalability and performance
of Hadoop. HDInsight also includes Apache HBase, a columnar NoSQL database
that runs on top of the Hadoop Distributed File System (HDFS). This lets you do
large transactional processing (OLTP) of non-relational data, enabling use cases
like interactive websites or having sensor data write to Azure Blob Storage.
www.pythian.com | White Paper 15
HDInsight includes Apache Storm, an open-source stream analytics platform
that can process real-time events at large scale. It also includes Apache Spark,
an open-source project in the Apache ecosystem that can run large-scale data
analytics applications in memory.
EXPERT TIP
HDInsight includes HBase, enabling you to do large transactional processing (OLTP)
of non-relational data for use cases such as interactive websites or having sensor
data write to Azure Blob Storage. You can also run Spark and Storm in HDInsight.
HOW IT’S PRICED
HDInsight is priced based on storage and the cost of the cluster. The cost of the
cluster is an hourly rate per node of the cluster.
GOOGLE CLOUD DATAPROC
Google Cloud Dataproc is a managed Apache Hadoop, Apache Spark, Apache
Pig, and Apache Hive service that lets you use open-source data tools for batch
processing, querying, streaming, and machine learning. Dataproc helps you create
clusters quickly, manage them easily, and save money by turning clusters off when
you don’t need them. From a networking perspective, Dataproc supports subnets,
role-based access, and clusters with no public IP.
Similar to Amazon EMR, Dataproc releases are packaged using a system based
on Apache Bigtop, which is an open-source project associated with the Hadoop
ecosystem. Although some of the tools from the Hadoop ecosystem might not be
enabled by default, it is very easy to add them to the deployment.
One advantage of Dataproc over EMR is how fast the cluster can be deployed: for
most of the configurations the time is less than 90 seconds. Also, after the first 10
minutes there is by-the-minute billing, which makes Dataproc a great contender for
building blogs of a more complex ETL pipeline.
Another advantage of Dataproc over other managed Hadoop services is its
integration with Google Cloud Storage as an alternative to the Hadoop Distributed
File System. This integration provides immediate consistency. By contrast, it usually
takes 1 to 3 minutes before files become visible on, for example, S3. Immediate
consistency in Dataproc means that the same storage can be accessed across
multiple clusters in a consistent manner.
EXPERT TIP
There is no global orchestration and scheduling service available from Google yet
(similar to AWS Data Pipeline), so custom Luigi, Oozie, or Airflow will need to be
deployed and maintained.
www.pythian.com | White Paper 16
Google is also still working on deeper integration of Stackdriver, Google’s
integrated monitoring, logging, and diagnostics tool, with Dataproc. An integration
at the Job level should be available soon. In the meantime, the Dataproc user
interface does provide access to the required logs.
HOW IT’S PRICED
Pricing for Dataproc is based on storage and the cost of the cluster. The cost of
the cluster is an hourly rate per node of the cluster.
MICROSOFT AZURE DATA LAKEMicrosoft Azure Data Lake is Microsoft’s one step up from Hadoop-as-a-Service.
Azure Data Lake service is separated into storage and analytics. The storage
service has no limit on size, including no limit on the size of a file. The analytics
service can run large data jobs on demand, very similar to how BigQuery runs
queries on demand.
Because Azure Data Lake is a big-data type of repository, you can mix tables,
you can mix files, and you can have external tables. Azure Data Lake does all this
through the U-SQL language, which is a mix of SQL and C#. If you have DBAs in
your company, or if you have developers who know SQL and C#, it is easy for
them to be productive very quickly with Azure Data Lake without needing to learn
all the different pieces of the Hadoop ecosystem, such as Pig and Hive.
If you do need a full Hadoop cluster, for example if you want to use some Mahout
algorithms on your data, you can attach an HDInsight cluster directly to Azure Data
Lake and then run from that. You also have the option of on-demand analytics
through U-SQL.
Analytics can also be scaled dynamically to increase compute. You simply increase
the number of analytic units, which are the nodes running your queries. Because
analytics are performed per job, you can easily control your cost of using the
service. Each time you submit a job, there’s a fixed cost.
Azure Data Lake is excellent for leveraging T-SQL and .NET skills to provide
Platform-as-a-Service (PaaS) big-data analytics. The barrier of entry for doing big
data analytics is very low in terms of learning new skills.
EXPERT TIP
Be aware that this service is still on public preview at the time of this writing. For
this reason, it has a limit of 50 analytic units when you run a job, and 3 concurrent
jobs per account. However, if you do have a strong use case, you should reach
out to Microsoft Support because they can lift these restrictions.
www.pythian.com | White Paper 17
HOW IT’S PRICED
Azure Data Lake has two components: storage and jobs. Your total costs on
storage depend on how much you store and volume of data transfers. The jobs
have a flat rate per job and amount of Analytic Units. These units govern how
many compute resources you can get.
SUMMARYWhen it comes to choosing a DBaaS, you have a variety of options. The Corollas
are the classic RDBMS services in the cloud: not flashy, but reliable. The Formula
One offerings are built for purpose. They don’t have all the functionality of
the mature RDBMS products but they ingest and query data very quickly. The
18-wheelers are data warehouses in the cloud that store and easily query large
amounts of structured data. The container ships are big-data systems that carry
everything. Think of them as Hadoop-as-a-service.
All of these offerings can improve delivery because all the management tasks are
automated. As a result, there’s less chance of human error and less chance of
quality issues during maintenance. All of the offerings also reduce time-to-market,
enable faster ROI, and reduce capital expenditures.
Before you choose a service, you need to understand all of them, then closely
consider your requirements. You don’t want to deploy DocumentDB, then realize
later that what you really needed was an RDBMS service. You don’t want to
choose Redshift, only to discover that you’d have been better served by BigQuery.
Think about your relational data, your NoSQL unstructured data, and your big
structured data requirements for warehousing. Maybe you’re also adopting big
data analytics.
With the right public cloud service for your use case, you can leverage your data
to gain insights, then use those insights to gain competitive advantages.
For more information about how Pythian can help you choose the right DBaaS for
your needs, please visit: https://www.pythian.com/solutions/
Warner Chaves
@warchav
Warner Chaves is a principal consultant at Pythian, and Microsoft Certified Master and Microsoft
MVP. Warner has been recognized by his colleagues for his ability to remain calm and collected
under pressure. His transparency and candor enable him to develop meaningful relationships
with his clients, where he welcomes the opportunity to be challenged. Originally from Costa Rica,
Warner is fluent in English and Spanish.
ABOUT THE AUTHOR
www.pythian.com | White Paper 18
ABOUT PYTHIAN
Pythian is a global IT services company that helps businesses become more competitive by using technology to reach their business goals.
We design, implement, and manage systems that directly contribute to revenue and business success. Our services deliver increased agility
and business velocity through IT transformation, and high system availability and performance through operational excellence. Our highly
skilled technical teams work as an integrated extension of our clients’ organizations to deliver continuous transformation and uninterrupted
operational excellence using our expertise in databases, cloud, DevOps, big data, advanced analytics, and infrastructure management.
V01-092016-NA
Danil Zburivsky
@zburivsky
Danil Zburivsky is Pythian’s director of big data and data science. Danil leads a team of big data
architects and data scientists that help customers worldwide to achieve their most ambitious goals
when it comes to large scale data platforms. He is recognized for his expertise in architecting, and
building and supporting large mission-critical data platforms using MySQL, Hadoop and MongoDB.
Danil is a popular speaker at industry events, and has authored a book titled Hadoop Cluster
Deployment.
Vladimir Stoyak
Vladimir Stoyak is a principal consultant for big data. Vladimir is a certified Google Cloud Platform
Qualified Developer, and Principal Consultant for Pythian’s Big Data team. He has more than 20
years of expertise working in Big Data and machine learning technologies including Hadoop, Kafka,
Spark, Flink, Hbase, and Cassandra. Throughout his career in IT, Vladimir has been involved in
a number of startups. He was Director of Application Services for Fusepoint, which was recently
acquired by CenturyLink. He also founded AlmaLOGIC Solutions Incorporated, an e-Learning
analytics company.
Derek Downey
@derek_downey
Derek Downey is the practice advocate for the OpenSource Database practice at Pythian,
helping to align technical and business objectives for the company and for our clients. Derek
loves automating MySQL, implementing visualization strategies and creating repeatable training
environments.
Manoj Kukreja
@mkukreja
Manoj Kukreja is a big data and IT security specialist whose qualifications include a degree
in computer science, a master’s degree in engineering, along with CISSP, CCAH and OCP
designations. With more than twenty years of experience in the planning, creation and deployment
of complex and large scale infrastructures, Manoj has worked for large scale public and private
sectors organizations including US and Canadian government agencies. Manoj has expertise in
NoSQL and big data technologies including Hadoop, MySQL, MongoDB and Oracle.
CONTRIBUTORS
Pythian, The Pythian Group, “love your data”, pythian.com, and Adminiscope are trademarks of The Pythian Group Inc. Other product and company names mentioned herein may be trademarks or registered trademarks of their respective owners. The information presented is subject to change without notice. Copyright © <year> The Pythian Group Inc. All rights reserved.