+ All Categories
Home > Documents > A Data Store for all Sizes in the...

A Data Store for all Sizes in the...

Date post: 01-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
42
CloudDB: A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management Research NEC Laboratories America http://www.nec-labs.com/dm www.nec-labs.com
Transcript
Page 1: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

CloudDB:A Data Store for all Sizes in the Cloud

Hakan Hacigumus

Data Management Research

NEC Laboratories America

http://www.nec-labs.com/dm

www.nec-labs.com

Page 2: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

2 NEC Labs Data Management Research

What I will try to cover

Historical perspective and motivation

(Preliminary) Technical Approach

Current Status

Food for Thought

Page 3: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

3 NEC Labs Data Management Research

Why Data Management Research?

Many Data Management Technologies and Products have been around

Data Centers have evolved over the time

Data Center hosting became a business

Database Community was successful in creating technologies and business

Page 4: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

4 NEC Labs Data Management Research

Why Data Management (Again)?

Amount of Data

Amount of business

data doubles every

12-18 months

New Data Types

Relational

databases only

manage 10-15% of

the available data

New Data Sources

Individual user via

Web2.0 applications,

social sides,

collaboration, mobile

devices, sensors, etc

New Usage Patterns

Around the clock,

around the world,

highly interconnected

Large Number of Users

Unprecedented increase

and fluctuations

New Type of Apps

Highly integrated,

Extremely data

intensive

(Good Old)

Database

Page 5: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

5 NEC Labs Data Management Research

Cloud Computing

A paradigm shift in how and where a workload is generated and it gets executed

Cloud service provider – Cloud service consumer

Market Size Data Management Market ~$20B

IT Cloud Service ~$42B (by 2012) (IDC)

Cloud Provider

API

Page 6: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

6 NEC Labs Data Management Research

Cloud Computing

A paradigm shift in how and where a workload is generated and it gets executed

Cloud service provider – Cloud service consumer

Market Size Data Management Market ~$20B

IT Cloud Service ~$42B (by 2012) (IDC)

Cloud Provider

API

Page 7: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

7 NEC Labs Data Management Research

Animoto on Amazon EC2

Rapid growth in three days, the number of users increased from 25k to 250k

Number of servers from 50 to 3500

Assume $500 per machine, $1.75M!

Instead, they used Amazon EC2

A no-infrastructure startup

Biggest piece of hardware A (fancy) espresso

machine!

Problem: It is not trivial to distribute users’

accesses to the data by just scaling out

cloud computing nodes

Page 8: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

8 NEC Labs Data Management Research

Database-as-a-Service?

ICDE 2002!

Reaction: Cool but…Technology

Regulations

Psychological

Acceptance

Business

Model

Page 9: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

9 NEC Labs Data Management Research

Data Management in Cloud

Cloud computing model may provide a platform to address new challenges

But the problem is: Data Management Systems were not designed and

implemented with cloud computing model in mind

So the question is:What are the data management challenges we need to

address before the full potential of cloud computing can be realized?

Page 10: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

10 NEC Labs Data Management Research

Need for New Solutions

Massive scalability to handle Very large amount of data Very large number of diverse users/requests

Elasticity to handle varying demand optimize operating costs

Flexibility to handle different data and processing models

Massively multi-tenanted to achieve economies of scale

More intelligent system monitoring and management

Page 11: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

11 NEC Labs Data Management Research

Cloud Data Management Challenges

# of queries / sec

# of records / query

Large

Analytic

apps

(OLAP)

Large

Transactional

apps (OLTP)

Small

appsKey challenge:

scalable multi-

tenant hosting

Key

challenge:

scalable

read/write

Key challenge:

scalable scan

and

aggregation

Key challenge:

seamless data

management

Ultimate goal

Query scalability

Data scalability

Multi-tenancy

CloudDB

Page 12: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

12 NEC Labs Data Management Research

Buy All Sizes?

OLTPOLAP

? – NO!

Page 13: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

13 NEC Labs Data Management Research

Buy One Size?

OLTP

OLAP

Page 14: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

14 NEC Labs Data Management Research

Let Someone Else Do All That

OLTPOLAP

Access and Management

Page 15: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

15 NEC Labs Data Management Research

Let Someone Else Do All That

OLTPOLAP

Access and Management

Leveraging very

specialized

database

technologies

Easier integration

with applications

Easier adoption by

developers

(dominant force for

adoption of cloud!)

Easier and more flexible

deployment options in the

middleware

Page 16: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

16 NEC Labs Data Management Research

Wish Lists

Clients

- Standard language API (e.g.,

SQL)

- Identifiable and verifiable

Service Level Agreements

- Common DBMS maintenance

tasks, (e.g. backup, versioning,

patching etc.)

- Availability of value-add

services, such as business

analytics, information sharing,

collaboration etc.

Service Provider

- Satisfying clients’ SLAs to

sustain revenue

- Great cost efficiency via high

level of automation and resource

sharing to ensure profitability

- Maintaining an extendable

platform for value-add services

Page 17: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

17 NEC Labs Data Management Research

(Some) Storage Models

Store Type Main Purpose Pro Con

Relational

- Transaction processing - Standardization

- Higher performance on

Online Transaction

Processing (OLTP)

- ACID properties

- Scalability

Key/Value

- Scalable data storage

- Read/Write intensive

workload

-Scalability - Standardization

- Performance issues

- Complex query

capability

- ACID properties(?)

Column-Oriented

- Analytics processing

- Read optimized,

throughput oriented

-Higher performance on

Online Analytical

Processing (OLAP)

- More flexible schema

evolution (?)

- Standardization

- Complex query

capability

Page 18: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

18 NEC Labs Data Management Research

Application Scenario

Personal Profile

Management

•Address

•Phone

•Notes

•Contacts

•Calendars

•Reminders

Application v1

Profile

Data

User 1

Data

User 2

Data

Information

Portal

•Online Shopping

Catalogs

•Product Reviews

•Subscriptions

•…

Application v2

Portal

Data

Products

Reviews

.

.

.

.

.

External Sources

Relational

Database

Key/Value

Store

Very difficult migration•Application developers (skills, time)

•Architects (redesign)

•Company (investment)

Page 19: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

19 NEC Labs Data Management Research

Data Model Decisions

Problem: Users are forced to make a decision on the data model based on the current needs of the applications

Is it possible to make the “right” decision all the time?

Problem: The developer (client) has to re-architect their application in order to take advantage of different data models

How easy is it to change the architecture and the implementation?

# of queries /sec

Single

RDBMSClustering

Sharding

Key-value store

Application

Ver 1.0

Ver

2.0

Ver

3.0

Ver

4.0Workload evolves…

Page 20: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

20 NEC Labs Data Management Research

Remember Data Independence?

1968

1970

Page 21: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

21 NEC Labs Data Management Research

Data Independence

Decouple application logic from data processing

Let them be optimized and managed independently

Enabled decades of innovation and improvement in databases

Page 22: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

22 NEC Labs Data Management Research

Data Independence

The application should not have to be aware of the physical organization of the data (and how it can be accessed)

All it needs is a logical (declarative) specification

CloudDB makes decisions based on application context, workload characteristics, etc.

# of queries /sec

Application

CloudDB: A layer for data independence

SQL API

Relational

Store

Key/Value

Store

Analytics

Store

Data Load

Query/Update

Page 23: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

23 NEC Labs Data Management Research

Language?

New Breed Databases

CouchDB, Project Voldemort (Dynamo), Cassandra, BigTable, Tokyo Cabinet, MangoDB, SimpleDB, ….

MapReduce/Hadoop

Page 24: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

24 NEC Labs Data Management Research

Some Reminders about SQL

By far the most widely used data access language

It has nothing to do with

How the data is stored

How the queries are executed

How the transactions are handled

Very large number of skilled programmers

Huge amount of existing applications and tools

Page 25: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

25 NEC Labs Data Management Research

SQL is actually good?

HIVE: SQL API op top of MapReduce

Google BigQuery: SQL over data stored in non-relational databases

….

Page 26: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

26 NEC Labs Data Management Research

CloudDB - Guiding Principals

Embrace heterogeneity

One size does not fit all

Leverage specialized technologies

Maintain and restore “declarative” nature of data processing

Understand and Define dimensions of scalability

Page 27: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

27 NEC Labs Data Management Research

CloudDB Middleware –

Opaque vs. Transparent

System Independence?

The middleware would be responsible for making all the decisions regarding the choice of data stores, processing the queries, and end-to-end system optimization

While the middleware can abstract away the underlying storage systems, it should explicitly express certain essential aspects of the system, such as consistency levels and scalability of transactions

Results

Applications

SQL

Queries

API/Language Support (SQL)

Clo

ud

DB

Mid

dle

ware

….Data

Sto

res

Transaction Patterns

Consistency / Scalability

Opaque Transparent

Distributed Query Processor

Page 28: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

28 NEC Labs Data Management Research

CloudDB Platform

Results

(External) Applications

SQL

Queries

Distributed Query Processor

API/Language Support (JDBC,SQL)Intelligent Cloud Database

Coordinator (ICDC)

Workload

Analysis

Design

Optimizer

System Monitor

Database

Cluster

Controller

Client SLAs

SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity

Planner

Multi Tenancy

Manager (MTM)

Auto Sharding

Relational Store

Internal Query

Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query

Processing

Auto Replication Auto Partitioning

Internal Query

Processing

Key-Value Store

CloudDB Store

Data Migration

Page 29: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

29 NEC Labs Data Management Research

CloudDB Platform – Key Points

Results

(External) Applications

SQL

Queries

Distributed Query Processor

API/Language Support (JDBC,SQL)Intelligent Cloud Database

Coordinator (ICDC)

Workload

Analysis

Design

Optimizer

System Monitor

Database

Cluster

Controller

Client SLAs

SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity

Planner

Multi Tenancy

Manager (MTM)

Auto Sharding

Relational Store

Internal Query

Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query

Processing

Auto Replication Auto Partitioning

Internal Query

Processing

Key-Value Store

CloudDB Store

Data Migration

One Unified,

Standard API

Intelligent Analysis and

Decision MakingSpecialized Stores

for Specific Needs

Page 30: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

30 NEC Labs Data Management Research

Our Data Management Platform

Key Research Areas

Results

(External) Applications

SQL

Queries

Distributed Query Processor

API/Language Support (JDBC,SQL)Intelligent Cloud Database

Coordinator (ICDC)

Workload

Analysis

Design

Optimizer

System Monitor

Database

Cluster

Controller

Client SLAs

SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity

Planner

Multi Tenancy

Manager (MTM)

Auto Sharding

Relational Store

Internal Query

Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query

Processing

Auto Replication Auto Partitioning

Internal Query

Processing

Key-Value Store

CloudDB Store

Data Migration

Intelligent Management

Workload Management

Data Stores Specialized Stores

for Specific NeedsIntelligent Analysis and

Decision Making

One Unified,

Standard API

Page 31: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

31 NEC Labs Data Management Research

CloudDB System Architecture --

Microsharding is a part of CloudDB

Results

(External) Applications

SQL

Queries

Distributed Query Processor

API/Language Support (JDBC,SQL)Intelligent Cloud Database

Coordinator (ICDC)

Workload

Analysis

Design

Optimizer

System Monitor

Database

Cluster

Controller

Client SLAs

SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity

Planner

Multi Tenancy

Manager (MTM)

Auto Sharding

Relational Store

Internal Query

Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query

Processing

Auto Replication Auto Partitioning

Internal Query

Processing

Key-Value Store

CloudDB Store

Data Migration

Microsharding

Page 32: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

32 NEC Labs Data Management Research

Pool of

Servers

SQL over Key-Value Stores

Microsharding to enable SQL over key-value stores

Application

SQL

Key-

access

Applications

Storage nodes

(Storage cloud)

Query execution nodes

(Relational middleware)

Key-Value Store

Application

Pool of

Servers

Key challenge:

limited access

capabilities

(only key-based

put/ get)

Page 33: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

33 NEC Labs Data Management Research

Microsharding

Key-Value stores are good at scaling write intensive workloads

But, they don’t leverage a large body of technologies developed in databases over the decades such as: Relationships

Transactions

Advanced query functions etc.

These are hand-coded by developers

Microsharding aims at bringing those capabilities into key-value stores in a principled way

Page 34: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

34 NEC Labs Data Management Research

Key Technical Questions Addressed

How can we map relational schemas to key-value store data models?

How can we map relational tuples to key-value objects?

Once we have those mappings, how can we define transaction classes that can be supported in a scalable way in key-value stores?

What are the system implementation issues with such a middleware?

Page 35: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

35 NEC Labs Data Management Research

Query and Data Transformation

Physical design: mapping between relational data and K/V data

TABLE users (

id primary key

…)TABLE reviews (

id: primary key

user_id : foreign key to orders

…)

SELECT * FROM users, reviews

WEHRE users.id= reviews.user_id

and users.id = ?

NEST reviews BY user_id

….

users

reviewsreviewsreviews

GET UNNEST

Physical Design

Query planTransformed data

(KV data)

Schema

(+data)

Query (template)

“Microshard”

User[Review]

Page 36: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

36 NEC Labs Data Management Research

Microsharding

A microshard is

a logical unit of data

a principled way to shard a database into small fragments

a unit of transactional data access

is accessed by its key, key of root relation

Key= 1 Key= 2 Key= 3 Key= N

microshard microshard microshard microshard

Transaction on

Users key =1

Transaction on

Users key =1

Transaction on

Users key =2

Transaction on

Users key =3

Page 37: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

37 NEC Labs Data Management Research

Isolation Levels

No consistency guarantee on read/write outside of a microshard

T T TT T T

transaction grouptransaction group

microshardmicroshard

Distributed on

key-value store

Distributed

on query

execution

nodes

Page 38: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

38 NEC Labs Data Management Research

Scale Independence

Experiment Setup

RUBiS benchmark (eBay type auction application)

Read/Write workload (transition matrix)

Short think time to saturate the system

Voldemort (Dynamo) key-value store

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 2.5 5 7.5 10 12.5 15 17.5 20

Th

rou

gh

pu

t (1

00

0 s

essio

ns / s

ec)

Number of emulated concurrent clients (thousands)

3 Voldemort nodes

4 Voldemort nodes

5 Voldemort nodes

6 Voldemort nodes

Message:

Ability to automatically

scale to more concurrent

sessions (throughput)

simply by increasing the

number of key-value nodes

Page 39: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

39 NEC Labs Data Management Research

Directions/Questions

Support for Specifying Relaxed Consistency

Tooling to relax consistency just to the degree that there exists a feasible solution (physical design and query plans) for the specification

Scalable Data Organization over heterogeneous data stores

Physical design over heterogeneous stores such that the service level specifications are met

Scalability vs. Consistency

Page 40: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

40 NEC Labs Data Management Research

The Cast

NEC Labs Researchers Hakan Hacigumus Yun Chi Wang-Pin Hsiung Hojjat Jafarpour Hyun J. Moon Oliver Po Junichi Tatemura Jagan Sankaranarayanan

Advisors/Collaborators Michael Carey (U. of California, Irvine)

Hector Garcia-Molina (Stanford)

Jeff Naughton (U. of Wisconsin, Madison)

Page 41: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

41 NEC Labs Data Management Research

CloudDB would be…

A unified data management platform that provides capabilities to transparently and efficiently support heterogeneous workloads by leveraging specialized storage models with SLA-conscious profit optimizationin the cloud.

Page 42: A Data Store for all Sizes in the Cloudi.stanford.edu/infoseminar/archive/WinterY2011/hacigumus-slides.pdf · A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management

42 NEC Labs Data Management Research

Thank You!


Recommended