+ All Categories
Home > Documents > 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

Date post: 27-Mar-2015
Category:
Upload: john-waddell
View: 216 times
Download: 1 times
Share this document with a friend
Popular Tags:
63
1 陆陆陆 2009-08-25 ud-based Data Management: Challenges & Opportunitie 陆陆陆陆陆 陆陆陆陆陆 中中中中中中 中中中中中中
Transcript
Page 1: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

1

陆嘉恒2009-08-25

Cloud-based Data Management: Challenges & Opportunities

云数据管理:挑战和机遇

中科院软件所 中国人民大学

Page 2: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

2

National University of Singapore PhD• XML query processing and XML keyword search

University of California, Irvine Postdoc• Approximate string processing• Data integration and data cleaning

Renmin University of China • Cloud data management• XML data management

Research experience and interesting

Page 3: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

3

Outline

Motivation: cloud data management

Database Future and Challenges:• Large-scale Data management & transaction

processing• Cloud-based data indexing and query optimization

Recent research work:• An efficient multiple-dimensional indexes for cloud

data management• CIKM Workshop CloudDB 2009

Page 4: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

4

Motivation: Internet Chatter

Page 5: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

5

BLOG Wisdom

“If you want vast, on-demand scalability, you need a non-relational database.” Since scalability requirements:• Can change very quickly and,• Can grow very rapidly.

Difficult to manage with a single in-house RDBMS server.

Although RDBMS scale well:• When limited to a single node.• Overwhelming complexity to scale on multiple sever

nodes.

Page 6: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

6

Current State

Most enterprise solutions are based on RDBMS technology.Significant Operational Challenges:• Provisioning for Peak Demand• Resource under-utilization• Capacity planning: too many variables• Storage management: a massive challenge• System upgrades: extremely time-consuming

Page 7: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

7

Internet Search Data Analytics: A Case Study

Data analytics:• Parsed WEB Logs ingested in a RDBMS store.• Hourly and Daily summarization for custom reporting.

Operational nightmare:• Maintaining live reporting system ON at all costs and at all

times.• Timely completion of hourly summarization.• Constant tension between Ad-hoc workload versus

reporting workload.• Data-driven feedback to live products.• Temporal depth of detailed data

Page 8: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

8

Internet Search Data Analytics: A Case Study

Various solutions explored:• Data Warehousing appliance for fast summarization.• Parallel RDBMS technology for fast ad-hoc queries.• Business Intelligence Products (Data Cubes) for fast and

intuitive reporting and analysis.

None of the solutions completely satisfactory:• Plans to migrate low-level data to file-based system to

overcome Database scalability bottlenecks

Page 9: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

9

Paradigm Shift in Computing

Page 10: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

10

WEB is replacing the Desktop

Page 11: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

11

What is Cloud Computing?

Old idea: Software as a service (SaaS)• Def: delivering applications over the internet

Recently: “[Hardware, infrastructure, Platform] as a service”• Poorly defined so we avoid all “X as a service”

Utility Computing: pay-as-you-go computing• Illusion of infinite resources• No up-front cost• Fine-grained billing (e.g. hourly)

Page 12: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

12

Why Now?

Experience with very large datacenters• Unprecedented economies of scale

Other factors• Pervasive broadband internet• Pay-as-you-go billing model

Page 13: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

13

Cloud Computing Spectrum

Instruction Set VM (Amazon EC2, 3Tera)

Framework VM• Google AppEngine, Force.com

Page 14: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

14

Cloud Killer Apps

Mobile and web applications

Extensions of desktop software• Matlab, Mathematica

Batch processing/MapReduce

Page 15: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

15

Economics of Cloud Users

Pay by use instead of provisioning for peak

Page 16: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

16

Economics of Cloud Users

Risk of over-provisioning: underutilization

Page 17: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

17

Economics of Cloud Users

Heavy penalty for under-provisioning

Page 18: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

18

Economics of Cloud Providers

5-7X economies of scale [Hamilton 2008]

Extra benefits• Amazon: utilize off-peak capacity• Microsoft: sell .NET tools• Google: reuse existing infrastructure

Page 19: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

19

Engineering Definition

Providing services on virtual machines allocated on top of a large physical machine pool.

Page 20: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

20

Business Definition

A method to address scalability and availability concerns for large scale applications.

Page 21: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

21

Data Management in the Cloud?

Page 22: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

22

Cloud Computing Implications on DBMSs

Where do Databases fit in this paradigm?Generational reality:• Animoto.com

• Started with 50 servers on Amazon EC2• Growth of 25,000 users/hour• Need to scale to 3,500 servers in 2 days.

• Many similar stories:• RightScale• Joyent• …

Page 23: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

23

Clouded Data?

Reality Number Ⅰ:• Unlimited processing assumption• Interactive page views:

• By targeting large number of SQL queries against MySQL

• Still Expect sub-millisecond object retrieval

Reality Number :Ⅱ• Why can’t the database tier be replicated in the same

way as the Web Server and App Server can?

→These are the major challenges for Data Management in the cloud.

Page 24: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

24

The Vision

R&D Challenges at the macro level:• Where and how does the DBMS fit into this

model.

R&D Challenges at micro level:• Specific technology components that must be

developed to enable the migration of enterprise data into the clouds.

Page 25: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

25

Data and Networks: Attempt Ⅰ

Distributed Database (1980s):• Idealized view: unified access to distributed data• Prohibitively expensive: global synchronization

Remained a laboratory prototype:• Associated technology widely in-use: 2PC

Page 26: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

26

Data and Networks: Attempt Ⅱ

Page 27: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

27

Data and Networks: Pragmatics

Page 28: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

28

Database on S3: SIGMOD’08

Amazon’s Simple Storage Service(S3):• Updates may not preserve initiation order• No “force” writes• Eventual guarantee

Proposed solution:• Pending Update Queue• Checkpoint protocol to ensure consistent ordering• ACID: only Atomicity + Durability

Page 29: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

29

Unbundling Txns in the Cloud

Research results:• CIDR’09 proposal to unbundle Transactions

Management for Cloud Infrastructures

• Attempts to refit the DBMS engine in the cloud storage and computing

Page 30: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

30

Analytical Processing

Page 31: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

31

Architectural and System Impacts

Current state:• MapReduce Paradigm for data analysis

What is missing:• Auxiliary structures and indexes for associative access to

data (i.e., attribute-based access)• Caveat: inherent inconsistency and approximation

Future projection:• Eventual merger of databases (ODSs) and data

warehouses as we have learned to use and implement them.

Page 32: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

32

Underlying Principles: CIDR’2009

Business data may not always reflect the state of the world or the business:• Inherent lack of perfect information

Secondary data need not be updated with primary data:• Inherent latency

Transactions/Events may temporarily violate integrity constraints:• Referential integrity may need to be compromised

Page 33: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

33

Data Security & Privacy

Data privacy remains a show-stopper in the context of database outsourcing.Encryption-based solutions are too expensive and are projected to be so in the foreseeable future:• Private Information Retrieval (Sion’2008)

Other approaches:• Information-theoretic approaches that uses data-

partitioning for security (Emekci’2007)• Hardware-based solution for information security

Page 34: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

34

Self management and self tuning in cloud-based data management

Self management and self tuning

Query optimization on thousands of nodes

Page 35: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

35

Remarks

Data Management for Cloud Computing poses a fundamental challenge to database researchers:• Scalability• Reliability• Data Consistency

Radically different approaches and solution are warranted to overcome this challenge:• Need to understand the nature of new applications

Page 36: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

36

References

Life Beyond Distributed Transactions: An Apostate’s Opinion by P.Helland, CIDR’07

Building a Database on S3 M.Brartner, D.Florescu, D.Graf, D.Kossman, T.Kraska, SIGMOD’08

Unbundling Transaction Services in the Cloud D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR’09

Principles of Inconsistency S.Finkelstein, R.Brendle, D.Jacobs, CIDR’09

VLDB Database School (China) 2009 http://www.sei.ecnu.edu.cn/~vldbschool2009/VLDBSchool2009English.htm

Page 37: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

37

CIKM workshop CloudDB09

Page 38: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

38

INTRODUCTION

MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE

Extended Nodes partition• Node partition• Cost Estimation Strategy

EVALUATION

Page 39: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

39

Google File System

Yahoo PNUTS

Page 40: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

40

• BigTable

• HBase

How to query on other attributes besides primary key?

Page 41: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

41

S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009.

H. chih Yang and D. S. Parker, “Traverse: Simplified indexing on large map-reduce-merge clusters,” in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308–322.

M. K. Aguilera, W. Golab, and M. A. Shah, “A practical scalable distributed b-tree,” in Proceedings of VLDB’08, Auckland, New Zealand, August 2008, pp. 598–609.

Page 42: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

42

INTRODUCTION

MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE

Extended Nodes partition• Node partition• Cost Estimation Strategy

EVALUATION

Page 43: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

43

Page 44: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

44

R-trees is a tree data structure that is similar to a B-tree, but is used for spatial access methods

Page 45: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

45

kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space.

Page 46: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

46

Master

Slave Slave Slave Slave Slave

range :0 ~ 2000,500~1200

range :800 ~3500,300~1300

range :6300 ~7000,599~1400

range :2000 ~40000,3400~8900

range :6800 ~9000,3400~8900

Page 47: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

47

INTRODUCTION

MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE

Extended Nodes partition

• Node partition

• Cost Estimation Strategy

EVALUATION

Page 48: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

48

Random cutting: Pick several random values on the attribute and cut by the points. with the random method you may receive great performance, but also possible to have poor performance.

Equal cutting: Cut the attribute into several equal intervals. This method is relatively stable since no extreme case will happen.

Clustering-based cutting: Cut the attribute by clustering values on the attribute and cut between clusters. This method may receive foreseeable better performance, but the time cost is also apparently higher. The time complexity of a clustering algorithm is typically O(nlogn) or even higher.

Nodes partition for data summary

Page 49: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

49

Random cutting Equal cutting Clustering-based cutting

Nodes partition

Page 50: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

50

Page 51: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

51

Update of node cube:• Why? If the data distribution in the node cube

have “greatly” changed and caused the cube to be sparse or greatly uneven

• How? Reorganize the nodes partition again• When? A two-phase approach

• After each update, compute the minimal ΔT for next update

• When the ΔT expires, check if needs update

Dynamic maintenance of Indexes

Page 52: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

52

Basic idea: benefit > cost

Volume of a node cube is defined as the number of combination of records can be made out of the cube. The volume can be calculated as the product of lengths of all the intervals. We note volume of a cube by v.

For the cube \{[1, 11], [2, 5]\}, the volume is (11-1)*(5-2) = 30.

Dynamic maintenance of Indexes

Page 53: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

53

Assumption:• The amount of queries forwarded to each slave

node is proportional to the total volume of all the node cubes of the slave node.

Dynamic maintenance of Indexes

Page 54: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

54

benefit = (Δv/v) * nq * ΔT• Δ v: decrement of volume after update• nq: number of queries this node must process

before update.

cost = mt/qt• mt: time cost of last update• qt: time needed for processing one query

benefit > cost => T > (mt * v)/(qt * Δ v * nq)

Dynamic maintenance of Indexes

Page 55: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

55

After Δ T expires, check if an update is needed. This check involves following:• Record update frequency• Expected benefit ratio• Performance requirement

We leave this as a future work.

Dynamic maintenance of Indexes

Page 56: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

56

6 machines• 1 master• 5 slaves : 100~1000 nodes

Each machine had a 2.33GHz Intel Core2 Quad CPU, 4GB of main memory, and a 320G disk.

Machines ran Ubuntu 9.04 Server OS.

Page 57: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

57

Page 58: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

58

Result Cover Rate: one ten thousandth

Page 59: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

59

In this paper we presented a series of approaches on building efficient multi-dimensional index in cloud platform.

We used the combination of R-tree and KD-tree to support the index structure.

We developed the node partition technique to reduce query processing cost on the cloud platform.

In order to maintain efficiency of the index, we proposed a cost estimation-based approach for index update.

Page 60: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

60

Better node partition algorithms

Improve the estimation-based approach

Consider multiple replicas of data

Future works

Page 61: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

61

谢谢,敬请提问交流!

Page 62: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

62

Result Cover Rate: one thousandth1 ‰ ~ 2 ‰

Page 63: 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

63

Result Cover Rate: one thousandth4 ‰ ~ 5 ‰


Recommended