CHOOSING A DATABASE- AS-A-SERVICE · 2017-10-06 · Database-as-a-Service (DBaaS) offerings from...

www.pythian.com | White Paper 1

When it comes to running your data in the public cloud, there is a range of

Database-as-a-Service (DBaaS) offerings from all three major public cloud

providers. Knowing which is best for your use case can be challenging. This

paper provides a high-level overview of the main DBaaS offerings from Amazon,

Microsoft, and Google.

After reading this white paper, you’ll have a high-level understanding of the most

popular data repositories and data analytics service offerings from each vendor,

you’ll know the key differences among the offers, and which ones are best for

each use case. With this information, you can direct your more detailed research

to a manageable number of options.

CHOOSING A DATABASE- AS-A-SERVICE

Warner ChavesPrincipal Consultant, Microsoft Certified Master, Microsoft MVP

With Contributors

Danil Zburivsky, Director of Big Data and Data Science

Vladimir Stoyak, Principal Consultant for Big Data, Certified Google Cloud

Platform Qualified Developer

Derek Downey, Practice Advocate, OpenSource Databases

Manoj Kukreja, Big Data and IT Security Specialist, CISSP, CCAH and OCP

AN OVERVIEW OF OFFERINGS BY MAJOR PUBLIC CLOUD SERVICE PROVIDERS


This white paper does not discuss private cloud providers or colocation environments,

streaming, data orchestration, or Infrastructure-as-a-Service (IaaS) offerings.

This paper is targeted to IT professionals with a good understanding of databases

and also business people who want an overview of data platforms in the cloud.

WHAT IS A DBAAS OFFERING?A DBaaS is a database running in the public cloud. Three things define a DBaaS:

• The service provider installs and maintains the database software, including

backups and other common database administration tasks. The service

provider also owns and manages the operating system, hypervisors, and

bare metal hardware.

• Application owners pay according to their usage of the service.

• Usage of the service must be flexible—users can scale up or down on

demand and also create and destroy environments on demand. These

operations should be possible through code with no provider intervention.

FOUR CATEGORIES OF DBAAS OFFERINGSTo keep things simple, we’ve created four categories of DBaaS offerings. Your

vehicles of choice are:

• The Corollas: These are the classic RDBMS services in the cloud: Amazon

Relational Database Service (RDS), Microsoft Azure SQL Database, and

Google Cloud SQL.

• The Formula One offerings: These special-purpose offerings ingest and

query data very quickly but might not offer all the amenities of the Corollas.

Options include Amazon DynamoDB, Microsoft Azure DocumentDB, Google

Cloud Datastore, and Google Cloud Bigtable.

• The 18-wheelers: These data warehouses of structured data in the cloud

include Amazon Redshift, Microsoft Azure SQL Data Warehouse, and Google

BigQuery.

• The container ships: These Hadoop-based big-data systems can carry

anything, and include Amazon Elastic MapReduce (EMR), Microsoft Azure

HDInsight, and Google Cloud Dataproc. This category also includes the

further automated offering of Azure Data Lake.

The rest of this white paper discusses each category and the Amazon, Microsoft,

and Google offerings within each category. We describe each offering, explain

what it is well suited for, provide expert tips or additional relevant information, and

provide high-level pricing information.


COROLLASWith the Corollas, just like with the car, you know what you’re getting, and you know

what to expect. This type of classic RDBMS service gets you from point A to point B

reliably. It’s not the flashiest or newest thing on the block, but it gets the job done.

AMAZON RDS

Amazon Relational Database Service (RDS) is the granddaddy of DBaaS offerings

available on the Internet. RDS is an automation layer that Amazon has built on

top of MySQL, MariaDB, Oracle, PostgreSQL, and SQL Server. Amazon has also

developed its own MySQL fork called Amazon Aurora, which also lives inside RDS.

RDS is an easy way to transition into DBaaS because the service mimics the on-

premises experience very closely. You simply need to provision an RDS instance,

which maps very closely to the virtual machine models that Amazon offers.

Amazon then installs bits, manages patches and backups, and can also manage

the high availability, so you do not need to plan and execute these tasks yourself.

RDS is very good for lift-and-shift types of cloud migrations. It makes it easy for

existing staff to take advantage of the service because it mimics the on-premises

experience, be it physical or virtual.

EXPERT TIP The storage is very flexible: this is both a pro and a con. The pro is that you have a lot of

control over storage. The con is that there are so many storage options, you need the

knowledge to choose the best one for your use case.

Amazon has general storage, provisioned IOPS (input/output operations per second),

and two categories of magnetic storage. The storage method you choose will depend

on your particular use cases.

You need to be aware that Amazon does not make every patch version of all products

available on RDS. Instead, Amazon makes only some major service packs or Oracle

patch levels available. As a result, the exact patch level that you have on premises might

not map to a patch level on RDS. In this situation, do not move to a patch level that is

below the patch level you have because that may result in product regressions. Instead,

wait until Amazon has deployed a patch level higher than what you have. At this point, it

should be fairly safe to start testing if you want to migrate to RDS.

HOW IT’S PRICED

The hourly rate for RDS depends on:

• whether you have your own license or if Amazon is leasing you the license;


• how much compute power you choose: The number of cores, and amount of

memory and temporary disk you want on this instance;

• the storage you require; and

• whether you pre-purchased with Reserved Instances.

MICROSOFT AZURE SQL DATABASEMicrosoft Azure SQL Database is a “cloud-first” SQL Server fork. The term “cloud-

first” means that Microsoft now tests and deploys their code continuously with Azure

SQL Database, and the code and lessons learned are implemented in the retail SQL

Server product—whether the product is on premises or on a virtual machine.

Even if you don’t have any investment in SQL Server, Azure SQL Database is an

excellent DBaaS platform because of the investments made to support the elastic

capabilities and to the ease of scaling horizontally. As you need more capacity, you

just add more databases.

It’s also easy to manage the databases by pooling resources, performing elastic

queries, and performing elastic job executions. You could deploy your own code

to do something similar in Amazon RDS , but in Azure SQL Database, Microsoft has

already built it for you.

In addition, Azure SQL Database makes it easy to build an elastic application on a

relational service. This capability supports the Software-as-a-Service (SaaS) model,

wherein you have many clients and each has a database. The SaaS provider has

a data layer that is easier to manage and scale than if they were running on their

own infrastructure.

Unlike Amazon RDS, Azure SQL Database does not exactly map to a type of retail

database, such as Oracle, SQL Server, or open-source MySQL. It is closely related

to SQL Server but it’s not licensed or sold in a similar way. As a result, Azure SQL

Database does not have any licensing component.

At the same time, Azure SQL Database does not give you a lot of control over the

hardware. With Amazon RDS, you need to select CPUs, memory, and your storage

layout. Azure SQL Database does all this for you.

With Azure SQL Database the only thing that you need to choose is the service

tier. Your choice determines how much power your database has. There are three

service tiers: basic, standard, and premium. Each of these also has some sub-tiers

to increase or decrease performance. If you have many databases in Azure SQL

Database, you can also choose the elastic database pool pricing option to increase

your savings by sharing resources.


Azure SQL Database is a good choice if you already have Transact-SQL (T-SQL)

skills in-house. If you have a large investment in SQL Server, Azure SQL Database

is the most natural way to take advantage of DBaaS offerings in the cloud. It’s also a

very good web scale relational service in its own right because of all the investments

made to support the SaaS model.

EXPERT TIP

You do need to ensure that you do the proper SQL tuning to be able to choose

the right service tier for your needs. In the past, it was more difficult to scale up

because all equipment was on premises. Now, it’s very easy to increase the power

of the service and therefore pay more money. However, just because scaling up is

easy does not mean it’s always what you need to do. If you perform the proper SQL

tuning, you will not need to pay more for raw power.

HOW IT’S PRICED

Azure SQL Database has a simple pricing model. You pay an hourly rate for the service tier

your database is running on: Basic, Standard, or Premium. Each has a different size limit for

the database and provides more performance as you go up in the tier.

GOOGLE CLOUD SQLGoogle Cloud SQL is a MySQL managed database service that is very similar to

Amazon RDS for MySQL and Amazon Aurora. You select an instance and deploy it

without needing to install any software.

CIoud SQL automates all your backups, replication, patches, and updates anywhere

in the world while ensuring greater than 99.95 percent availability. Automatic failover

ensures your database will be available when you need it.

Cloud SQL Second Generation introduces per-minute, pay-per-use billing, automatic

sustained use discounts, and instance tiers to fit any budget.

EXPERT TIP

Cloud SQL does have restrictions on:

• anything related to loading/dumping the database to a local file system,

• installing plugins,

• creating user-defined functions,

• performance schema,

• SUPER privileges, and

• Storage engines: InnoDB is the only one supported for Second Generation

instances


HOW IT’S PRICED

Pricing for Cloud SQL Second Generation is made up of three components: instance

pricing, storage pricing, and network pricing. The charge is based on the machine

type you choose for the instance. Storage and network pricing are separate charges.

FORMULA ONE OFFERINGSThe Formula One DBaaS offerings are fit-for-purpose offerings. They do not have

all the functionality of the mature RDBMS products but they do a limited number of

things very well.

A Formula One car is built purely for speed. It does not have a cup holder, heated seats,

or satellite radio. However, it’s fit for purpose—and that purpose is to go fast. (Admittedly,

you might miss some of the amenities that you are used to with a regular car.)

Similarly, the Formula One DBaaS offerings are built for purpose. That purpose is

to ingest and query data very quickly. Think of them as NoSQL in the cloud. The

NoSQL movement was popularized by large web applications such as Google and

Facebook as a way to differentiate their database platforms from the classic RDBMS

offerings. Usually NoSQL products handle horizontal scalability with more ease,

have more relaxed restrictions on schema (if any), and forego some of the ACID

requirements as a trade-off for more speed.

AMAZON DYNAMODBAmazon DynamoDB is a very popular service offered through Amazon Web Services

(AWS). It’s basically a NoSQL document/key value table store. All you need to define is

the table and either its key or its key and sort order. The schema is completely flexible

and is up to you. DynamoDB is best suited for applications with known query patterns

that don’t require complex transactions and that ingest large volumes of data.

DynamoDB is built for scale-out growth of high-ingest applications because the

Amazon scale-out architecture guarantees that you will not run out of space. You don’t

need to worry about the scale out, you just need to know that that this is how Amazon

has architected the service. For example, when you specify a partition key for records,

they will all be distributed to the same nodes that Amazon builds transparently behind

the scenes for your data.

This offering does not have an optimizer, so it does not support ad hoc SQL querying

the way a relational product does. It’s more a set of the normalized instantiated views

based on the indexes that you have created on your data.

Querying is not done with SQL, it is performed through a different type of specification.

Amazon provides SDKs in many languages, including Java, NET, and Python. You use


these SDKs to develop queries. This process does require a bit of learning but that’s

not a major time investment.

Although DynamoDB does not have a fixed schema, it does support complex

schemas. For example, fields are denormalized: some fields could be lists, some could

be maps or sets. This service also exposes a stream-based API, so if you need to

replicate the data changes from DynamoDB to another system, you can do so through

the stream-based API.

EXPERT TIP

Because this service does not support ad hoc querying, your schema can have a

huge impact on what you’re allowed to do on your application. DynamoDB also has

a finite number of indexes that you can apply: five global indexes for each table. You

need to keep in mind the indexing limits and lack of an optimizer, and ensure that

your schema will be able to support your future application requirements.

HOW IT’S PRICED

The cost of DynamoDB is based on storage, how much data you have, and the I/O

rate: your number of requests for read units and write units. If you have any streams,

you will need to pay for the streams’ read rate.

MICROSOFT AZURE DOCUMENTDBMicrosoft Azure DocumentDB is a NoSQL document database that is basically a

repository for JSON (JavaScript Object Notation) documents. JSON documents

have no schema restrictions. They can contain almost any type of field, and they can

also have nested fields. This DBaaS is NoSQL denormalized, with built-in support

for partition collection, so you can specify a field in the JSON documents and Azure

DocumentDB will partition the documents based on that field.

Azure DocumentDB also has built in geo-replication support, so you can have, for

example, an Azure DocumentDB collection reading and writing on the east coast

of the United States and a replica of this collection that you can use for reads in the

central United States. If there’s an issue with the DocumentDB on the east coast, you

can failover to the other geo-region for very high availability.

Azure DocumentDB is a good choice for JSON-based storage, and it’s very easy

to set up and start storing documents. Retrieval is also easy because this database

supports full-blown SQL-style queries, so you don’t need to learn any new query

language.

EXPERT TIP

If you don’t specify any indexes, the system has some automatic indexing policies.

However, keep in mind that indexing has a storage consumption value, so the more


indexes you have, the more storage you will consume—and you will pay for that amount

of storage. The storage could be for indexes that you do not use, so ensure that the

automatic indexing policies work for your use case.

If it doesn’t make sense to have an index on a field because you never search on it, you

can disable the index through a custom policy. Also, if specific collections have limits and

you need to perform partition collection, each key will be able to hold no more than 10

Gbit of documents. If you need more than this amount per partition key, you will probably

want to ensure that you design with a very high-granularity partition key.

HOW IT’S PRICED

Azure DocumentDB offers some pre-defined tiers for billing based on common usage

patterns. However, if you want to customize the system, you can easily select your

individual compute power, referred to as “request units”, plus the amount of storage that

you want for the collection.

GOOGLE CLOUD DATASTOREGoogle Cloud Datastore is Google’s version of a NoSQL cloud service similar

to Amazon DynamoDB and Microsoft Azure DocumentDB. From an architecture

perspective, Cloud Datastore is similar to other key/value stores. The data model

is organized based on “entities”, which loosely resemble rows in a relational table.

Entities can have multiple properties but no rigid schema is imposed on entities.

Two different entities of a similar type don’t need to have the same number or

type of properties. An interesting feature of Cloud Datastore is built-in support for

hierarchical data.

In addition to all the properties you would expect from a cloud NoSQL DBaaS,

such as massive scalability, high availability, and flexible storage, Cloud Datastore

also supports some unique properties, including out-of-the-box transaction support

and encryption at rest.

Google also provides tight integration of Cloud Datastore with other Google Cloud

Platform services. Applications running in Google App Engine can use Cloud

Datastore as their default database. You can also load data from Cloud Datastore

into Google BigQuery for analytics purposes.

There are multiple ways to access data in Cloud Datastore. There are client

libraries for most popular programming languages as well as a REST interface.

Google also provides a GQL language that is roughly modelled on SQL and can

provide an easier transition from relational databases to the NoSQL world.


EXPERT TIP

Cloud Datastore automatically indexes all properties for an entity, making simple

single-property queries possible without any additional configuration. More

complex multi-property indexes can be created by defining them in a special

configuration file.

HOW IT’S PRICED

Similar to other cloud NoSQL services, Cloud Datastore is priced according to

amount of storage the database requires and the number of different operations

it performs. Google defines prices for reads, writes, and deletes per 100,000

entities. However, simple requests such as fetching an entity by its key (which is a

very common operation), are free.

GOOGLE CLOUD BIGTABLEGoogle Cloud Bigtable is Google’s NoSQL big-data database service. It’s the cloud

version of the same database that powers many core Google services, including

Search, Analytics, Maps, and Gmail.

Bigtable is designed to handle massive workloads at consistent low latency

and high throughput, so it’s a great choice for both operational and analytical

applications, including Internet of Things (IoT) use cases, user analytics, and

financial data analysis.

This public cloud service gives you instant access to all the engineering effort that

was put into Bigtable at Google over the years. The Apache HBase-like database

is flexible and robust, and lacks some of the inherited HBase issues, such as Java

Google Cloud stalls. In addition, Cloud Bigtable is completely managed, so you

don’t need to provision hardware, install software, or handle failures.

EXPERT TIP

Cloud Bigtable does not have strong typing; it’s basically a massive key value

table. As data comes in, it is treated as binary strings. This DBaaS also does not

have any type of querying through SQL. You have the key, then you can get the

value. Cloud Bigtable is also built for very large tables, so it’s not worth considering

this for anything less than a table of 1 terabyte.

HOW IT’S PRICED

Pricing for Cloud Bigtable is based on: the number of Cloud Bigtable nodes that

you provision in your project for each hour (you will be billed for a minimum of one

hour); the amount of storage that your tables use over a one-month period; and

the amount of network bandwidth used. Some types of network egress traffic are

subject to bandwidth charges.


18-WHEELERSThe 18-wheelers can handle the heavy load of structured data. These are basically

data warehouses in the cloud. They store and easily query large amounts of

structured data.

AMAZON REDSHIFTAmazon Redshift is the granddaddy of the 18-wheeler DBaaS offerings. This is

Amazon’s modified PostgreSQL with columnar storage.

Other columnar storage-type offerings include HPE Vertica, Microsoft SQL

Server Parallel Data Warehouse (PDW) and SQL Server Column stores, Oracle

Exadata Database Machine (Exadata), and Oracle Database In-Memory. All these

technologies achieve excellent compression ratios through the columnar storage.

Instead of storing the data by rows, they store it by columns, which makes the

scans of the data very fast.

Redshift is a relational massively parallel processing (MPP) data warehouse, so

there are multiple nodes rather than just one big machine. The service works with

SQL queries as well as allowing you to write your own modules on Python.

Because Redshift is scaled per node, if you need more power you need to add

another node. This means you need to make a selection of both compute and

storage, and the service is charged per node, per hour. Redshift gives you a lot of

control over specific node configurations, so you can choose how many cores and

how much memory the nodes have. You can also decide whether to pay more and

have the fastest storage on the nodes through solid state drives (SSDs) or save

some money by instead using hard drive-based storage attached to the nodes.

Redshift is a very good warehousing solution for all your data. If you have a big

footprint on AWS, Redshift is definitely the warehousing solution that you want.

EXPERT TIP

With Redshift, you do need to watch node count and configurations. The ideal

configuration of your Redshift cluster might depend on your workload and your

workload patterns, so you need to decide if it’s better to have fewer nodes with

really high specs or more nodes with less compute or less memory. Based on

your analysis, you then need to properly tune Redshift for your workload and

warehouse design.

Also be aware of possible copy issues due to Amazon Simple Storage System (S3)

consistency. Amazon recommends that you use the manifest files to specify what


you want to load so that you’re not in a situation where you just read the names of

the files off S3, and because of the eventual consistency, there is a file that you miss.

Finally, Redshift does require regular maintenance to keep the statistics and tables

up to date. If you do any updates or deletes, the service has an operation called

the Vacuum to keep the tables optimally organized for fast retrieval.

HOW IT’S PRICED

Redshift is billed by the hour per node. The cost of each node depends on the

configuration of cores, amount of memory, and type of storage you select.

MICROSOFT AZURE SQL DATA WAREHOUSEMicrosoft Azure SQL Data Warehouse is Microsoft’s response to Redshift. It’s fully

relational, with 100 percent SQL-type queries, and highly compatible with the

T-SQL for SQL Server. If you have SQL Server investments, it would be very easy

to adopt SQL Data Warehouse.

Like Redshift, storage is columnar and the service is MPP. Data is split into storage

distributions when you load it. The architecture is distributed, so a query is sent to

all the different nodes to help resolve your questions.

Azure SQL Data Warehouse scales compute and storage independently. Unlike

Redshift, where you always need to scale on a full node, Azure SQL Data

Warehouse allows you to add just more compute if you only need more compute.

You can also add more storage and keep the same amount of compute.

A very powerful capability is that you can pause compute completely. For example, if you

don’t have much load on your data warehouse during the weekend, you can decide to

shut it down and pause it completely during the weekend, for maximum savings.

Azure SQL Data Warehouse is an excellent enterprise warehousing solution,

particularly if you have a lot of data already built on Azure services. If you have a

pause-friendly workload, this service will provide very good savings.

EXPERT TIP

Unlike Redshift, which gives you a lot of control over the configuration of the

nodes, Azure SQL Data Warehouse gives you no control over hardware. It’s 100

percent Platform-as-a-Service (PaaS). You simply select a compute unit, called a

Data Warehousing Unit (DWU). The amount of the DWU will give you an idea of the

power that you get for the data warehouse.


Be aware that at the time of publication, not all T-SQL data types are supported

yet. For example, if you need to store spatial data, you could store it now just as

binary, but you won’t have full support of all the spatial functions.

Before you start a full migration to Azure SQL Data Warehouse, ensure that you

carefully review which functionality is available. However, if you have only regular

structured types on your data warehouse, it’s definitely wise to consider this

service now.

HOW IT’S PRICED

Azure SQL Data Warehouse has two separate cost components: storage and

compute. Compute is elastic and is billed by the hour based on the number of

DWUs you provision.

GOOGLE BIGQUERYGoogle BigQuery is a mix of an 18-wheeler and a container ship. A container ship

is a big-data, Hadoop-style service.

BigQuery is a hybrid because it is based on a structured schema but at the same

time allows for easy integration with Google DataProc and fixing schema on read

over storage. The service supports regular tables with data stored inside the

service as well as virtual tables where you put schema on read. It’s the same with

external tables, so you can map BigQuery to other services inside Google, such as

Google Cloud Storage, and then have those tables defined inside BigQuery to be

used for your analytic queries.

BigQuery is Google Cloud Platform’s serverless analytics data warehouse, so you

do not need to manage hardware, software or the operating system.

Google has replaced its SQL with a standards-compliant dialect that enables

more advanced query planning and optimization. There are also new data types,

additional support for timestamps, and extended JOIN support.

BigQuery also has a streaming interface, so instead of running an Extract,

Transform and Load (ETL) process based mostly on fixed-schedule batch

processing, you can also have a streaming flow that brings inserts directly into

BigQuery constantly using an API or the Cloud Dataflow engine.

EXPERT TIP

BigQuery is a very good one-stop shop if you have streaming data, relational

data, and file-based data because it can put schema on read. But watch out for


high-compute queries where Google estimates that it takes too much compute

to resolve at their regular rate per query: as of August 2016, the limit is 1 terabyte.

Above this limit, the extra compute cost is $5 per terabyte. You might receive an

error message that reads, “Hey, you need to run with higher compute.” The cost of

that query will be higher, so you will need to watch out for runaway costs.

Hadoop can be attached to BigQuery tables, but it does require a temporary data

copy. The BigQuery Hadoop connector will perform a temporary data copy to

Google Cloud Storage (GCS) for Hadoop. Don’t be surprised if you incur some

GCS costs for this type of operation.

HOW IT’S PRICED

BigQuery is billed per storage and per query, with automatic lower pricing tiers

after 90 days of data being idle. Costs are based on the amount of data you have

and how much of it you read. If you are streaming data in, you will pay extra for it.

With BigQuery, you pay only for data read by queries. For example, if you have a

20-TB warehouse in BigQuery, but you’re only running 1 to 10 queries per day, you

will pay for only those queries. You do not need to pay for provisioning compute

storage the way you do with Redshift. With Azure SQL Data Warehouse you also

pay for compute, but at least you can pause it. BigQuery goes one step beyond

by charging for only specific queries that you run. As a result, you don’t even need

to think about starting and pausing compute. You simply use compute on demand

whenever you want to run a query.

CONTAINER SHIPSThe container ships are big-data systems that carry everything, any shape or

form. They are really Hadoop-as-a-Service, and this is very attractive because

on-premises Hadoop deployments have a high cost to experiment: the high cost

of curiosity. You need to build your Hadoop service and also have enough storage

and enough nodes before you can start your data exploration.

If you instead do your data exploration in the cloud, you can let the cloud deploy

all the power you need. If you need a very large cluster, you don’t need to make

any type of capital expenditures to get up and running. You also don’t need to

make operational expenditures to manage the cluster. You simply create and

destroy as needed, and you pay for storage in the cloud. All the major cloud

providers offer this type of service.

All of the container ships follow a similar pattern. You pick a machine model for

your nodes, deploy the cluster with a given size, pick how many nodes you want,

and then you attach the ship to a storage service that it can read the data from.

The Amazon storage service is S3. The Microsoft Azure services are Azure Data

Lake and Azure Blob Storage. Google uses Google Cloud Storage.


After the cluster is deployed, you use it as a Hadoop installation if you need to run

EMR, Apache Spark, Apache Storm, or any other type of Hadoop-based service.

AMAZON ELASTIC MAPREDUCEAmazon Elastic MapReduce (EMR) is a managed Hadoop framework that makes it

easy, fast, and cost-effective to distribute and process vast amounts of your data

across dynamically scalable Amazon EC2 instances. You can also run other popular

distributed frameworks such as Spark and Presto in Amazon EMR, and interact with

data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR releases are packaged using a system based on Apache Bigtop,

which is an open-source project associated with the Hadoop ecosystem.

In addition to Hadoop and Spark ecosystem projects, each Amazon EMR

release provides components that enable cluster and resource management,

interoperability with other AWS services, and additional configuration optimizations

for installed software.

EXPERT TIP

Amazon provides the AWS Data Pipeline service, which allows automating recurring

clusters by implementing an orchestration layer to automatically start the cluster

submit job, handle exceptions, and tear down clusters when the job is done.

HOW IT’S PRICED

Amazon charges per hour for EMR. One way to minimize costs is to have some of the

compute nodes deployed on Spot Instances; this provides savings of up to 90 percent.

MICROSOFT AZURE HDINSIGHTMicrosoft Azure HDInsight is an Apache Hadoop distribution that deploys and

provisions managed Hadoop clusters. This service can process unstructured or

semi-structured data and has programming extensions for. C#, Java, and .NET,

so you can use your programming language of choice on Hadoop to create,

configure, submit, and monitor Hadoop jobs.

HDInsight is tightly integrated with Excel, so you can visualize and analyze your

Hadoop data in compelling new ways using a tool that’s familiar to your business

users. HDInsight incorporates R Server for Hadoop, a cloud implementation of

one of the most popular programming languages for statistical computing and

machine learning. It gives the familiarity of R with the scalability and performance

of Hadoop. HDInsight also includes Apache HBase, a columnar NoSQL database

that runs on top of the Hadoop Distributed File System (HDFS). This lets you do

large transactional processing (OLTP) of non-relational data, enabling use cases

like interactive websites or having sensor data write to Azure Blob Storage.


HDInsight includes Apache Storm, an open-source stream analytics platform

that can process real-time events at large scale. It also includes Apache Spark,

an open-source project in the Apache ecosystem that can run large-scale data

analytics applications in memory.

EXPERT TIP

HDInsight includes HBase, enabling you to do large transactional processing (OLTP)

of non-relational data for use cases such as interactive websites or having sensor

data write to Azure Blob Storage. You can also run Spark and Storm in HDInsight.

HOW IT’S PRICED

HDInsight is priced based on storage and the cost of the cluster. The cost of the

cluster is an hourly rate per node of the cluster.

GOOGLE CLOUD DATAPROC

Google Cloud Dataproc is a managed Apache Hadoop, Apache Spark, Apache

Pig, and Apache Hive service that lets you use open-source data tools for batch

processing, querying, streaming, and machine learning. Dataproc helps you create

clusters quickly, manage them easily, and save money by turning clusters off when

you don’t need them. From a networking perspective, Dataproc supports subnets,

role-based access, and clusters with no public IP.

Similar to Amazon EMR, Dataproc releases are packaged using a system based

on Apache Bigtop, which is an open-source project associated with the Hadoop

ecosystem. Although some of the tools from the Hadoop ecosystem might not be

enabled by default, it is very easy to add them to the deployment.

One advantage of Dataproc over EMR is how fast the cluster can be deployed: for

most of the configurations the time is less than 90 seconds. Also, after the first 10

minutes there is by-the-minute billing, which makes Dataproc a great contender for

building blogs of a more complex ETL pipeline.

Another advantage of Dataproc over other managed Hadoop services is its

integration with Google Cloud Storage as an alternative to the Hadoop Distributed

File System. This integration provides immediate consistency. By contrast, it usually

takes 1 to 3 minutes before files become visible on, for example, S3. Immediate

consistency in Dataproc means that the same storage can be accessed across

multiple clusters in a consistent manner.

EXPERT TIP

There is no global orchestration and scheduling service available from Google yet

(similar to AWS Data Pipeline), so custom Luigi, Oozie, or Airflow will need to be

deployed and maintained.


Google is also still working on deeper integration of Stackdriver, Google’s

integrated monitoring, logging, and diagnostics tool, with Dataproc. An integration

at the Job level should be available soon. In the meantime, the Dataproc user

interface does provide access to the required logs.

HOW IT’S PRICED

Pricing for Dataproc is based on storage and the cost of the cluster. The cost of

the cluster is an hourly rate per node of the cluster.

MICROSOFT AZURE DATA LAKEMicrosoft Azure Data Lake is Microsoft’s one step up from Hadoop-as-a-Service.

Azure Data Lake service is separated into storage and analytics. The storage

service has no limit on size, including no limit on the size of a file. The analytics

service can run large data jobs on demand, very similar to how BigQuery runs

queries on demand.

Because Azure Data Lake is a big-data type of repository, you can mix tables,

you can mix files, and you can have external tables. Azure Data Lake does all this

through the U-SQL language, which is a mix of SQL and C#. If you have DBAs in

your company, or if you have developers who know SQL and C#, it is easy for

them to be productive very quickly with Azure Data Lake without needing to learn

all the different pieces of the Hadoop ecosystem, such as Pig and Hive.

If you do need a full Hadoop cluster, for example if you want to use some Mahout

algorithms on your data, you can attach an HDInsight cluster directly to Azure Data

Lake and then run from that. You also have the option of on-demand analytics

through U-SQL.

Analytics can also be scaled dynamically to increase compute. You simply increase

the number of analytic units, which are the nodes running your queries. Because

analytics are performed per job, you can easily control your cost of using the

service. Each time you submit a job, there’s a fixed cost.

Azure Data Lake is excellent for leveraging T-SQL and .NET skills to provide

Platform-as-a-Service (PaaS) big-data analytics. The barrier of entry for doing big

data analytics is very low in terms of learning new skills.

EXPERT TIP

Be aware that this service is still on public preview at the time of this writing. For

this reason, it has a limit of 50 analytic units when you run a job, and 3 concurrent

jobs per account. However, if you do have a strong use case, you should reach

out to Microsoft Support because they can lift these restrictions.


HOW IT’S PRICED

Azure Data Lake has two components: storage and jobs. Your total costs on

storage depend on how much you store and volume of data transfers. The jobs

have a flat rate per job and amount of Analytic Units. These units govern how

many compute resources you can get.

SUMMARYWhen it comes to choosing a DBaaS, you have a variety of options. The Corollas

are the classic RDBMS services in the cloud: not flashy, but reliable. The Formula

One offerings are built for purpose. They don’t have all the functionality of

the mature RDBMS products but they ingest and query data very quickly. The

18-wheelers are data warehouses in the cloud that store and easily query large

amounts of structured data. The container ships are big-data systems that carry

everything. Think of them as Hadoop-as-a-service.

All of these offerings can improve delivery because all the management tasks are

automated. As a result, there’s less chance of human error and less chance of

quality issues during maintenance. All of the offerings also reduce time-to-market,

enable faster ROI, and reduce capital expenditures.

Before you choose a service, you need to understand all of them, then closely

consider your requirements. You don’t want to deploy DocumentDB, then realize

later that what you really needed was an RDBMS service. You don’t want to

choose Redshift, only to discover that you’d have been better served by BigQuery.

Think about your relational data, your NoSQL unstructured data, and your big

structured data requirements for warehousing. Maybe you’re also adopting big

data analytics.

With the right public cloud service for your use case, you can leverage your data

to gain insights, then use those insights to gain competitive advantages.

For more information about how Pythian can help you choose the right DBaaS for

your needs, please visit: https://www.pythian.com/solutions/

Warner Chaves

@warchav

Warner Chaves is a principal consultant at Pythian, and Microsoft Certified Master and Microsoft

MVP. Warner has been recognized by his colleagues for his ability to remain calm and collected

under pressure. His transparency and candor enable him to develop meaningful relationships

with his clients, where he welcomes the opportunity to be challenged. Originally from Costa Rica,

Warner is fluent in English and Spanish.

ABOUT THE AUTHOR


ABOUT PYTHIAN

Pythian is a global IT services company that helps businesses become more competitive by using technology to reach their business goals.

We design, implement, and manage systems that directly contribute to revenue and business success. Our services deliver increased agility

and business velocity through IT transformation, and high system availability and performance through operational excellence. Our highly

skilled technical teams work as an integrated extension of our clients’ organizations to deliver continuous transformation and uninterrupted

operational excellence using our expertise in databases, cloud, DevOps, big data, advanced analytics, and infrastructure management.

V01-092016-NA

Danil Zburivsky

@zburivsky

Danil Zburivsky is Pythian’s director of big data and data science. Danil leads a team of big data

architects and data scientists that help customers worldwide to achieve their most ambitious goals

when it comes to large scale data platforms. He is recognized for his expertise in architecting, and

building and supporting large mission-critical data platforms using MySQL, Hadoop and MongoDB.

Danil is a popular speaker at industry events, and has authored a book titled Hadoop Cluster

Deployment.

Vladimir Stoyak

Vladimir Stoyak is a principal consultant for big data. Vladimir is a certified Google Cloud Platform

Qualified Developer, and Principal Consultant for Pythian’s Big Data team. He has more than 20

years of expertise working in Big Data and machine learning technologies including Hadoop, Kafka,

Spark, Flink, Hbase, and Cassandra. Throughout his career in IT, Vladimir has been involved in

a number of startups. He was Director of Application Services for Fusepoint, which was recently

acquired by CenturyLink. He also founded AlmaLOGIC Solutions Incorporated, an e-Learning

analytics company.

Derek Downey

@derek_downey

Derek Downey is the practice advocate for the OpenSource Database practice at Pythian,

helping to align technical and business objectives for the company and for our clients. Derek

loves automating MySQL, implementing visualization strategies and creating repeatable training

environments.

Manoj Kukreja

@mkukreja

Manoj Kukreja is a big data and IT security specialist whose qualifications include a degree

in computer science, a master’s degree in engineering, along with CISSP, CCAH and OCP

designations. With more than twenty years of experience in the planning, creation and deployment

of complex and large scale infrastructures, Manoj has worked for large scale public and private

sectors organizations including US and Canadian government agencies. Manoj has expertise in

NoSQL and big data technologies including Hadoop, MySQL, MongoDB and Oracle.

CONTRIBUTORS

Pythian, The Pythian Group, “love your data”, pythian.com, and Adminiscope are trademarks of The Pythian Group Inc. Other product and company names mentioned herein may be trademarks or registered trademarks of their respective owners. The information presented is subject to change without notice. Copyright © <year> The Pythian Group Inc. All rights reserved.

Date post:	19-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

CHOOSING A DATABASE- AS-A-SERVICE · 2017-10-06 · Database-as-a-Service (DBaaS) offerings from...

Documents