+ All Categories
Home > Documents > Hacigumus Slides

Hacigumus Slides

Date post: 07-Apr-2018
Category:
Upload: gopal-chandu
View: 216 times
Download: 0 times
Share this document with a friend

of 42

Transcript
  • 8/4/2019 Hacigumus Slides

    1/42

    CloudDB:

    A Data Store for all Sizes in the Cloud

    Hakan Hacigumus

    Data Management Research

    NEC Laboratories America

    http://www.nec-labs.com/dm

    www.nec-labs.com

  • 8/4/2019 Hacigumus Slides

    2/42

    2 NEC Labs Data Management Research

    What I will try to cover

    Historical perspective and motivation

    (Preliminary) Technical Approach

    Current Status

    Food for Thought

  • 8/4/2019 Hacigumus Slides

    3/42

    3 NEC Labs Data Management Research

    Why Data Management Research?

    Many Data ManagementTechnologies and Productshave been around

    Data Centers have evolvedover the time

    Data Center hostingbecame a business

    Database Community wassuccessful in creatingtechnologies and business

  • 8/4/2019 Hacigumus Slides

    4/42

    4 NEC Labs Data Management Research

    Why Data Management (Again)?

    Amount of Data

    Amount of business

    data doubles every

    12-18 months

    New Data Types

    Relational

    databases only

    manage 10-15% of

    the available data

    New Data Sources

    Individual user via

    Web2.0 applications,

    social sides,

    collaboration, mobile

    devices, sensors, etc

    New Usage Patterns

    Around the clock,

    around the world,

    highly interconnected

    Large Number of Users

    Unprecedented increase

    and fluctuations

    New Type of Apps

    Highly integrated,

    Extremely data

    intensive

    (Good Old)

    Database

  • 8/4/2019 Hacigumus Slides

    5/42

    5 NEC Labs Data Management Research

    Cloud Computing

    A paradigm shift in how and where a workload is generated and it getsexecuted

    Cloud service provider Cloud service consumer

    Market Size Data Management Market ~$20B

    IT Cloud Service ~$42B (by 2012) (IDC)

    Cloud Provider

    A

    P

    I

  • 8/4/2019 Hacigumus Slides

    6/42

    6 NEC Labs Data Management Research

    Cloud Computing

    A paradigm shift in how and where a workload is generated and it getsexecuted

    Cloud service provider Cloud service consumer

    Market Size Data Management Market ~$20B

    IT Cloud Service ~$42B (by 2012) (IDC)

    Cloud Provider

    A

    P

    I

  • 8/4/2019 Hacigumus Slides

    7/427 NEC Labs Data Management Research

    Animoto on Amazon EC2

    Rapid growth in three days, the number of users increased from 25k to 250k Number of servers from 50 to 3500

    Assume $500 per machine, $1.75M!

    Instead, they used Amazon EC2

    A no-infrastructure startup

    Biggest piece of hardware

    A (fancy) espresso

    machine!

    Problem: It is not trivial to distribute users

    accesses to the data by just scaling out

    cloud computing nodes

  • 8/4/2019 Hacigumus Slides

    8/428 NEC Labs Data Management Research

    Database-as-a-Service?

    ICDE 2002!

    Reaction: Cool butTechnology

    Regulations

    Psychological

    Acceptance

    Business

    Model

  • 8/4/2019 Hacigumus Slides

    9/429 NEC Labs Data Management Research

    Data Management in Cloud

    Cloud computing model may provide a platform toaddress new challenges

    But the problem is:

    Data Management Systems were notdesigned andimplemented with cloud computing model in mind

    So the question is:What are the data management challenges we need to

    address before the full potential of cloud computing canbe realized?

  • 8/4/2019 Hacigumus Slides

    10/4210 NEC Labs Data Management Research

    Need for New Solutions

    Massive scalability to handle Very large amount of data

    Very large number of diverse users/requests

    Elasticity to handle varying demand

    optimize operating costs

    Flexibility to handle different data and processing models

    Massively multi-tenanted to achieve economies of scale

    More intelligent system monitoring and management

  • 8/4/2019 Hacigumus Slides

    11/4211 NEC Labs Data Management Research

    Cloud Data Management Challenges

    # of queries / sec

    # of records / query

    Large

    Analytic

    apps

    (OLAP)

    Large

    Transactional

    apps (OLTP)

    Small

    appsKey challenge:

    scalable multi-tenant hosting

    Keychallenge:

    scalable

    read/write

    Key challenge:scalable scan

    and

    aggregation

    Key challenge:

    seamless data

    management

    Ultimate goal

    Query scalability

    Data scalability

    Multi-tenancy

    CloudDB

  • 8/4/2019 Hacigumus Slides

    12/4212 NEC Labs Data Management Research

    Buy All Sizes?

    OLTPOLAP

    ? NO!

  • 8/4/2019 Hacigumus Slides

    13/4213 NEC Labs Data Management Research

    Buy One Size?

    OLTP

    OLAP

  • 8/4/2019 Hacigumus Slides

    14/4214NEC Labs Data Management Research

    Let Someone Else Do All That

    OLTPOLAP

    Access and Management

  • 8/4/2019 Hacigumus Slides

    15/4215 NEC Labs Data Management Research

    Let Someone Else Do All That

    OLTPOLAP

    Access and Management

    Leveraging very

    specializeddatabase

    technologies

    Easier integration

    with applications

    Easier adoption bydevelopers

    (dominant force for

    adoption of cloud!)

    Easier and more flexible

    deployment options in the

    middleware

  • 8/4/2019 Hacigumus Slides

    16/4216 NEC Labs Data Management Research

    Wish Lists

    Clients

    - Standard language API (e.g.,

    SQL)

    - Identifiable and verifiable

    Service Level Agreements

    - Common DBMS maintenance

    tasks, (e.g. backup, versioning,

    patching etc.)

    - Availability of value-add

    services, such as business

    analytics, information sharing,

    collaboration etc.

    Service Provider

    - Satisfying clients SLAs to

    sustain revenue

    - Great cost efficiency via highlevel of automation and resource

    sharing to ensure profitability

    - Maintaining an extendable

    platform for value-add services

  • 8/4/2019 Hacigumus Slides

    17/4217 NEC Labs Data Management Research

    (Some) Storage Models

    Store Type Main Purpose Pro Con

    Relational

    - Transaction processing - Standardization

    - Higher performance on

    Online Transaction

    Processing (OLTP)

    - ACID properties

    - Scalability

    Key/Value

    - Scalable data storage

    - Read/Write intensive

    workload

    -Scalability - Standardization

    - Performance issues

    - Complex query

    capability

    - ACID properties(?)

    Column-Oriented

    - Analytics processing

    - Read optimized,

    throughput oriented

    -Higher performance on

    Online Analytical

    Processing (OLAP)

    - More flexible schema

    evolution (?)

    - Standardization

    - Complex query

    capability

  • 8/4/2019 Hacigumus Slides

    18/4218 NEC Labs Data Management Research

    Application Scenario

    Personal Profile

    Management

    Address

    Phone

    Notes

    Contacts

    Calendars

    Reminders

    Application v1

    Profile

    Data

    User 1

    Data

    User 2

    Data

    Information

    Portal

    Online Shopping

    Catalogs

    Product Reviews

    Subscriptions

    Application v2

    Portal

    Data

    Products

    Reviews

    .

    .

    .

    .

    .

    External Sources

    RelationalDatabase Key/ValueStore

    Very difficult migrationApplication developers (skills, time)

    Architects (redesign)

    Company (investment)

  • 8/4/2019 Hacigumus Slides

    19/4219 NEC Labs Data Management Research

    Data Model Decisions

    Problem: Users are forced to make a decision on the data modelbased on the current needs of the applications

    Is it possible to make the right decision all the time?

    Problem: The developer (client) has to re-architect their

    application in order to take advantage of different data models How easy is it to change the architecture and the implementation?

    # of queries /sec

    Single

    RDBMSClustering

    Sharding

    Key-value store

    Application

    Ver 1.0

    Ver

    2.0

    Ver

    3.0

    Ver

    4.0Workload evolves

  • 8/4/2019 Hacigumus Slides

    20/42

    20 NEC Labs Data Management Research

    Remember Data Independence?

    1968

    1970

  • 8/4/2019 Hacigumus Slides

    21/42

    21 NEC Labs Data Management Research

    Data Independence

    Decouple application logic

    from data processing

    Let them be optimized and

    managed independently

    Enabled decades of

    innovation and improvement

    in databases

  • 8/4/2019 Hacigumus Slides

    22/42

    22 NEC Labs Data Management Research

    Data Independence

    The application should not have to be aware of the physical

    organization of the data (and how it can be accessed)

    All it needs is a logical (declarative) specification

    CloudDB makes decisions based on application context, workload

    characteristics, etc.

    # of queries /sec

    Application

    CloudDB: A layer for data independence

    SQL API

    Relational

    Store

    Key/Value

    Store

    Analytics

    Store

    Data Load

    Query/Update

  • 8/4/2019 Hacigumus Slides

    23/42

    23 NEC Labs Data Management Research

    Language?

    New Breed Databases CouchDB, Project Voldemort (Dynamo), Cassandra,

    BigTable, Tokyo Cabinet, MangoDB, SimpleDB, .

    MapReduce/Hadoop

  • 8/4/2019 Hacigumus Slides

    24/42

    24 NEC Labs Data Management Research

    Some Reminders about SQL

    By far the most widely used data access language

    It has nothing to do with

    How the data is stored How the queries are executed

    How the transactions are handled

    Very large number of skilled programmers

    Huge amount of existing applications and tools

  • 8/4/2019 Hacigumus Slides

    25/42

    25 NEC Labs Data Management Research

    SQL is actually good?

    HIVE: SQL APIop top of MapReduce

    Google BigQuery: SQL over data stored in non-relational

    databases

    .

  • 8/4/2019 Hacigumus Slides

    26/42

    26 NEC Labs Data Management Research

    CloudDB - Guiding Principals

    Embrace heterogeneity One size does not fit all

    Leverage specialized technologies

    Maintain and restore declarative nature of data

    processing

    Understand and Define dimensions of scalability

    Cl dDB Middl

  • 8/4/2019 Hacigumus Slides

    27/42

    27 NEC Labs Data Management Research

    CloudDB MiddlewareOpaque vs. Transparent

    System Independence?

    The middleware would be responsible for making all the decisions regarding the choice of data

    stores, processing the queries, and end-to-end system optimization

    While the middleware can abstract away the underlying storage systems, it should explicitly

    express certain essential aspects of the system, such as consistency levels and scalability of

    transactions

    Results

    Applications

    SQLQueries

    API/Language Support (SQL)

    C

    loudDB

    Middleware

    .DataStores

    Transaction Patterns

    Consistency / Scalability

    Opaque Transparent

    Distributed Query Processor

  • 8/4/2019 Hacigumus Slides

    28/42

    28 NEC Labs Data Management Research

    CloudDB Platform

    Results

    (External) Applications

    SQLQueries

    Distributed Query Processor

    API/Language Support (JDBC,SQL)Intelligent Cloud Database

    Coordinator (ICDC)

    WorkloadAnalysis

    DesignOptimizer

    System MonitorDatabase

    ClusterController

    Client SLAs

    SLA Aware Dispatcher

    Scheduler Scheduler Scheduler

    CapacityPlanner

    Multi TenancyManager (MTM)

    Auto Sharding

    Relational Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Analytics Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Internal Query

    Processing

    Key-Value Store

    CloudDB Store

    Data Migration

  • 8/4/2019 Hacigumus Slides

    29/42

    29 NEC Labs Data Management Research

    CloudDB Platform Key Points

    Results

    (External) Applications

    SQLQueries

    Distributed Query Processor

    API/Language Support (JDBC,SQL)Intelligent Cloud Database

    Coordinator (ICDC)

    WorkloadAnalysis

    DesignOptimizer

    System MonitorDatabase

    ClusterController

    Client SLAs

    SLA Aware Dispatcher

    Scheduler Scheduler Scheduler

    CapacityPlanner

    Multi TenancyManager (MTM)

    Auto Sharding

    Relational Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Analytics Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Internal Query

    Processing

    Key-Value Store

    CloudDB Store

    Data Migration

    One Unified,

    Standard API

    Intelligent Analysis and

    Decision MakingSpecialized Stores

    for Specific Needs

    O D t M t Pl tf

  • 8/4/2019 Hacigumus Slides

    30/42

    30 NEC Labs Data Management Research

    Our Data Management Platform

    Key Research Areas

    Results

    (External) Applications

    SQLQueries

    Distributed Query Processor

    API/Language Support (JDBC,SQL)Intelligent Cloud Database

    Coordinator (ICDC)

    WorkloadAnalysis

    DesignOptimizer

    System MonitorDatabase

    ClusterController

    Client SLAs

    SLA Aware Dispatcher

    Scheduler Scheduler Scheduler

    CapacityPlanner

    Multi TenancyManager (MTM)

    Auto Sharding

    Relational Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Analytics Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Internal Query

    Processing

    Key-Value Store

    CloudDB Store

    Data Migration

    Intelligent

    Management

    Workload

    Management

    Data Stores Specialized Storesfor Specific NeedsIntelligent Analysis and

    Decision Making

    One Unified,

    Standard API

    Cl dDB S t A hit t

  • 8/4/2019 Hacigumus Slides

    31/42

    31 NEC Labs Data Management Research

    CloudDB System Architecture --

    Microsharding is a partof CloudDB

    Results

    (External) Applications

    SQLQueries

    Distributed Query Processor

    API/Language Support (JDBC,SQL)Intelligent Cloud Database

    Coordinator (ICDC)

    WorkloadAnalysis

    DesignOptimizer

    System MonitorDatabase

    ClusterController

    Client SLAs

    SLA Aware Dispatcher

    Scheduler Scheduler Scheduler

    CapacityPlanner

    Multi TenancyManager (MTM)

    Auto Sharding

    Relational Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Analytics Store

    Internal Query

    Processing

    Auto Replication Auto Partitioning

    Internal Query

    Processing

    Key-Value Store

    CloudDB Store

    Data Migration

    Microsharding

  • 8/4/2019 Hacigumus Slides

    32/42

    32 NEC Labs Data Management Research

    Pool ofServers

    SQL over Key-Value Stores

    Microsharding to enable SQL over key-value stores

    Application

    SQL

    Key-

    access

    Applications

    Storage nodes

    (Storage cloud)

    Query execution nodes

    (Relational middleware)

    Key-Value Store

    Application

    Pool ofServers

    Key challenge:

    limited access

    capabilities

    (only key-based

    put/ get)

  • 8/4/2019 Hacigumus Slides

    33/42

    33 NEC Labs Data Management Research

    Microsharding

    Key-Value stores are good at scaling write intensiveworkloads

    But, they dont leverage a large body of technologies

    developed in databases over the decades such as: Relationships

    Transactions

    Advanced query functions etc.

    These are hand-codedby developers

    Microsharding aims at bringing those capabilities into key-value stores in a principled way

  • 8/4/2019 Hacigumus Slides

    34/42

    34 NEC Labs Data Management Research

    Key Technical Questions Addressed

    How can we map relational schemas to key-value storedata models?

    How can we map relational tuples to key-value objects?

    Once we have those mappings, how can we definetransaction classes that can be supported in a scalableway in key-value stores?

    What are the system implementation issues with such amiddleware?

  • 8/4/2019 Hacigumus Slides

    35/42

    35 NEC Labs Data Management Research

    Query and Data Transformation

    Physical design: mapping between relational dataand K/V data

    TABLE users (

    id primary key)

    TABLE reviews (

    id: primary key

    user_id : foreign key to orders

    )

    SELECT * FROM users, reviews

    WEHRE users.id= reviews.user_id

    and users.id = ?

    NEST reviews BY user_id

    .

    users

    reviewsreviewsreviews

    GET UNNEST

    Physical Design

    Query planTransformed data

    (KV data)

    Schema

    (+data)

    Query (template)

    Microshard

    User[Review]

  • 8/4/2019 Hacigumus Slides

    36/42

    36 NEC Labs Data Management Research

    Microsharding

    A microshard is

    a logical unit of data

    a principled way to shard a database into small fragments

    a unit of transactional data access

    is accessed by its key, key of root relation

    Key= 1 Key= 2 Key= 3 Key= N

    microshard microshard microshard microshard

    Transaction on

    Users key =1

    Transaction on

    Users key =1

    Transaction on

    Users key =2

    Transaction on

    Users key =3

  • 8/4/2019 Hacigumus Slides

    37/42

    37 NEC Labs Data Management Research

    Isolation Levels

    No consistency guarantee on read/write outside of a microshard

    T T TT T T

    transaction grouptransaction group

    microshardmicroshard

    Distributed on

    key-value store

    Distributed

    on query

    execution

    nodes

  • 8/4/2019 Hacigumus Slides

    38/42

    38 NEC Labs Data Management Research

    Scale Independence

    Experiment Setup

    RUBiS benchmark (eBay type auction application)

    Read/Write workload (transition matrix)

    Short think time to saturate the system

    Voldemort (Dynamo) key-value store

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    0 2.5 5 7.5 10 12.5 15 17.5 20

    Throughput

    (1000

    sessions

    /sec)

    Number of emulated concurrent clients (thousands)

    3 Voldemort nodes

    4 Voldemort nodes

    5 Voldemort nodes

    6 Voldemort nodes

    Message:

    Ability to automatically

    scale to more concurrent

    sessions (throughput)simply by increasing the

    number of key-value nodes

  • 8/4/2019 Hacigumus Slides

    39/42

    39 NEC Labs Data Management Research

    Directions/Questions

    Support for Specifying Relaxed Consistency Tooling to relax consistency just to the degree that there

    exists a feasible solution (physical design and query plans)

    for the specification

    Scalable Data Organization over heterogeneous data

    stores

    Physical design over heterogeneous stores such that theservice level specifications are met

    Scalability vs. Consistency

  • 8/4/2019 Hacigumus Slides

    40/42

    40 NEC Labs Data Management Research

    The Cast

    NEC Labs Researchers

    Hakan Hacigumus

    Yun Chi

    Wang-Pin Hsiung

    Hojjat Jafarpour

    Hyun J. Moon Oliver Po

    Junichi Tatemura

    Jagan Sankaranarayanan

    Advisors/Collaborators Michael Carey (U. of California, Irvine)

    Hector Garcia-Molina (Stanford)

    Jeff Naughton (U. of Wisconsin, Madison)

  • 8/4/2019 Hacigumus Slides

    41/42

    41 NEC Labs Data Management Research

    CloudDB would be

    A unified data management platform that provides

    capabilities to transparentlyand efficientlysupport

    heterogeneous workloads by leveraging specialized

    storage models with SLA-conscious profit optimization

    in the cloud.

  • 8/4/2019 Hacigumus Slides

    42/42

    Thank You!


Recommended