Michael Goshey University of Minnesota, Fall 2006 CSci 8701: Overview of Database Research

Post on 13-Jan-2016

25 views 2 download

description

An Analysis of the Publication "An Overview of Data Warehousing and OLAP Technology” by Surajit Chaudhuri, Umeshwar Dayal. Michael Goshey University of Minnesota, Fall 2006 CSci 8701: Overview of Database Research. Outline. Introduction Problem Addressed Major Contributions Key Concepts - PowerPoint PPT Presentation

transcript

An Analysis of the Publication "An Overview of Data Warehousing and OLAP Technology” by Surajit Chaudhuri, Umeshwar Dayal

Michael GosheyUniversity of Minnesota, Fall 2006CSci 8701: Overview of Database Research

Michael Goshey: 9/19/2006 2

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 3

Introduction

Selected paper S. Chaudhuri and U. Dayal, An Overview of

Data Warehousing and OLAP Technology, SIGMOD Record 26(1): 65-74(1997).

Motivation Personal Interest

Michael Goshey: 9/19/2006 4

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 5

Problem Addressed

Problem Statement Survey: organizing the data warehousing space Differing requirements between OLTP and

OLAP Significance

Growth area Reference work establishing consensus on

terms, architectures and issues

Michael Goshey: 9/19/2006 6

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 7

Major Contributions

Bridging the gulf between industry and academia OLTP vs. OLAP: clarifying the differences Concise survey of relevant issues, architectures

and tools Concrete list of data warehouse design and build

steps

Michael Goshey: 9/19/2006 8

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 9

Key Concepts

Data warehouses and data marts OLTP, OLAP, ROLAP vs. MOLAP) Relational and dimensional data models Bitmap Index ETL Metadata Managed query vs. ad hoc environments Materialized views SQL extensions (cube, rollup, rank, percentile, etc.)

Michael Goshey: 9/19/2006 10

Data Warehouse, Data Mart

Data Staging

Area

MetadataCatalog

Typical Data Warehouse Architecture

ETL Services

Dimensional Data Marts including atomic data

Other uses

Source Systems

Ad Hoc Query and Analysis Tools

Reporting ToolsDimensional Data Marts with

only aggregated data

Michael Goshey: 9/19/2006 11

Relational or Dimensional?Categories

PK CategoryID

U1 CategoryName Description Picture

Shippers

PK ShipperID

CompanyName Phone

Order Details

PK,FK1,I2,I1 OrderIDPK,FK2,I4,I3 ProductID

UnitPrice Quantity Discount

Customers

PK CustomerID

I2 CompanyName ContactName ContactTitle AddressI1 CityI4 RegionI3 PostalCode Country Phone Fax

Suppliers

PK SupplierID

I1 CompanyName ContactName ContactTitle Address City RegionI2 PostalCode Country Phone Fax HomePage

Orders

PK OrderID

FK1,I2,I1 CustomerIDFK2,I3,I4 EmployeeIDI5 OrderDate RequiredDateI6 ShippedDateFK3,I7 ShipVia Freight ShipName ShipAddress ShipCity ShipRegionI8 ShipPostalCode ShipCountry

Employees

PK EmployeeID

I1 LastName FirstName Title TitleOfCourtesy BirthDate HireDate Address City RegionI2 PostalCode Country HomePhone Extension Photo Notes ReportsTo

Products

PK ProductID

I3 ProductNameFK2,I5,I4 SupplierIDFK1,I1,I2 CategoryID QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued

Michael Goshey: 9/19/2006 12

Relational or Dimensional?

(image from http://www.laynetworks.com)

Michael Goshey: 9/19/2006 13

Bitmap Indices

customer

age 0-10 age 11-20 age 21-30 age 31-40

Mary 1 0 0 0

John 0 1 0 0

Steve 0 0 1 0

Tom 0 0 0 1

Lisa 0 0 1 0

cardinality: unique values/total rows B-Tree vs. bitmap: 1% rule, uniqueness Boolean algebra directly on indices

Michael Goshey: 9/19/2006 14

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 15

Validation Methodology

Survey paper goals Academic and industry citations Referencing tools, vendors Case studies

Michael Goshey: 9/19/2006 16

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 17

Assumptions

Read-only environments Shortcomings

(occasional) transactional commitments the data revision problem

Michael Goshey: 9/19/2006 18

Outline

1. Introduction

2. Problem Addressed

3. Major Contributions

4. Key Concepts

5. Validation Methodology

6. Assumptions

7. 2006 Rewrite

Michael Goshey: 9/19/2006 19

2006 Rewrite

Changes in terminology, tools, vendors Fact constellations -> conformed dimensions Decision support -> BI Vendors and tools in BI, ETL, OLAP

Multiple user constituencies Data history difficulties

petabyte databases -> very large warehouses common

data expiry challenges slowly changing dimensions

Michael Goshey: 9/19/2006 20

Slowly Changing Dimensions

CustomerID Name Status

001 Mary Johnson

Gold

CustomerID Name Status

001 Mary Johnson

Platinum

CustomerID Name Status

001 Mary Johnson

Gold

001 Mary Johnson

Platinum

CustomerID Name Original Status

Current Status

Effective Date

001 Mary Johnson

Gold Platinum 10/1/2006

Before

After: Type 1

After: Type 2

After: Type 3

CustomerID Name Status

001 Mary Johnson

Platinum

Michael Goshey: 9/19/2006 21

Questions?