+ All Categories
Home > Documents > DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data...

DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data...

Date post: 15-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
18
1 DW - course information • Teachers: Petia Wohed Erik Perjons Gudrun Jeppesen • Literature: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling Ralph Kimbal & Margy Ross [K&R] Compendium with extra reading material [Comp] Reference Literature: Fundamentals of Database Systems, Elmasri & Navathe [EN] Database Systems, Connolly & Begg [CB] DW - course pedagogy F1 DW Introduction (A3 + Extra assignment handed out) • F2 Multidimensional Modelling 1 (A1 handed out) • F3 Multidimensional Modelling 2 (A2 handed out) • F4 DW Lifecycle • S1 Multidimensional Modelling-Theory (A1 reported) • F5 DW Physical design (A4 handed out) • S2 Multidimensional Modelling-Practice (A2 reported) • F6 Data Mining • S3 Presentation of Articles (A3 reported) A4 reported (individual time for each group has to be booked) Optional: Extra assignment handed in individually. Written Examination
Transcript
Page 1: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

1

DW - course information

• Teachers:• Petia Wohed• Erik Perjons• Gudrun Jeppesen

• Literature:• The Data Warehouse Toolkit: The Complete Guide to

Dimensional ModelingRalph Kimbal & Margy Ross [K&R]

• Compendium with extra reading material [Comp]

• Reference Literature:• Fundamentals of Database Systems, Elmasri & Navathe [EN]• Database Systems, Connolly & Begg [CB]

DW - course pedagogy

• F1 DW Introduction (A3 + Extra assignment handed out)

• F2 Multidimensional Modelling 1 (A1 handed out)

• F3 Multidimensional Modelling 2 (A2 handed out)

• F4 DW Lifecycle• S1 Multidimensional Modelling-Theory (A1 reported)

• F5 DW Physical design (A4 handed out)

• S2 Multidimensional Modelling-Practice (A2 reported)

• F6 Data Mining• S3 Presentation of Articles (A3 reported)

• A4 reported (individual time for each group has to be booked)• Optional: Extra assignment handed in individually.

• Written Examination

Page 2: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

2

DW - reading directions

• F1 DW Introduction– [Comp] article 1, [K&R] chapter 1

• F2 Multidimensional Modelling 1– [K&R] chapters 2,3,4

• F3 Multidimensional Modelling 2– [K&R] chapters 5,6,7,8

• F4 DW Lifecycle– [K&R] chapter 16

• F5 DW Physical design– [Comp] article 2

• F6 Data Mining– [Comp] article 3

A1 Multidimensional Modelling- Theory

A2 Multidimensional Modelling- Practice

- [K&R] chapter 9

A4 Tool Practice

A3 Presentation of Article Extra assignment (optional) - periodicals, i.e., ACM, IEEE, - conf. proc., i.e., WLDB, CAiSE, ER

”We are drowing in information,but starving for knowledge”

- John Naisbett

Page 3: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

3

Lecture 1 - Introduction to DW

Reading Requirements[Comp] R. Ramakrishnan and J. Gehrke, Chapter 23,Decision Support[K&R] Kimbal, Chapter 1

[EN] chapter 26[CB] chapter 25

KeywordsDW, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, BitmapIndex, Join Index, Data Mart

The Data Warehouse - definition

”A data warehouse is a subject oriented, integrated,non-volatile, and time-variant collection of data insupport of manadement’s decisions”.

B. Imnon:

S. Chaudhiri & U. Dayal:

”Data warehousing is a collection of decision supporttechnologies, aimed at enabling the knowledge worker(executive, manager, analyst) to make better andfaster decisions.”

En data lager är en verksamhetsorienterat, integrerat,icke-ombytlig och tids-beroende samling av data ämnatatt stödja beslutsfattande på strategisk nivå.

Page 4: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

4

Data Warehouse — Subject-Oriented

• Organized around major subjects, such as customer, product,sales.

• Focusing on the modeling and analysis of data for decisionmakers, not on daily operations or transaction processing.

• Provide a simple and concise view around particular subjectissues by excluding data that are not useful in the decisionsupport process.

PayrollSystem

ProductionSystem

SalesSystem

Operational Systems

ProductData

SalesData

CustomerData

DW

Data Warehouse — Integrated

• Constructed by integrating multiple, heterogeneous datasources– relational or other databases, flat files, external data

• Data cleaning and data integration techniques are applied.– Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different datasources

– When data is moved to the warehouse, it is converted.

Operational Systems DWCustomer

DataOrderSystem

BillingSystem

MarketingSystem

Page 5: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

5

Data Warehouse — Time Variant

• The time horizon for the data warehouse is significantly longerthan that of operational systems.– Operational database: current value data.– Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)

• Every key structure in the data warehouse– Contains an element of time– But the key of operational data may or may not contain “time

element”.

Operational Systems

DW CustomerData

OrderSystem

60-90 days 5-10 years

Data Warehouse — Non-Volatile

• A physically separate store of data transformed from theoperational environment.

• Operational update of data does not occur in the datawarehouse environment.– Does not require transaction processing, recovery, and

concurrency control mechanisms

– Requires only : loading and access of data.

DWOperational Systems

OrderSystem

Create Delete

InsertUpdateCustomer

Data

Load Access

Page 6: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

6

Decision Support and OLAP (Navathe)

• Information technology to help the knowledgeworker (executive, manager, analyst) make fasterand better decisions.– Will a 10% discount increase sales volume

sufficiently?– Which of two new medications will result in the best

best outcome: higher recovery rate & shorterhospitality rate?

– How did the share price of computer manufacturerscorrelate with quarterly profits over the past 10years?

• On-Line Analytical Processing (OLAP) is an elementof decision support system (DSS).

Data Warehouse (Navathe)

• A decision support database that is maintainedseparately from the organisation’s operationaldatabases.

• A data warehouse is a– subject oriented,– integrated,– time-varying,– non-volatile

collection of data that is used primarily in theorganisational decision making.

Page 7: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

7

Why separate data warehouse?

• Performance– The operational DBs are tuned to support known OLTP

workloads– Supporting OLAP requires special data organisations,

access methods and implementation methods

• Function– The decision support requires data that may be missing

from the operational DBs– Decision support usually requires consolidating data from

many heterogeneous sources

OLTP vs. OLAP• holds current data• stores detailed data

• data is dynamic• repetitive processing

• high level of transactionthroughput

• predictable pattern of usage• transaction driven• application oriented• support day-to-day decisions• serves large number of

operational users

• holds historic and integrated data• stores detailed and summarised

data

• data is largely static• ad-hoc, unstructured and heuristic

processing• medium or low-level of transaction

throughput• unpredictable pattern of usage• analysis driven• subject oriented• supports strategic decisions• serves relatively lower level of

managerial users

Page 8: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

8

DW Architecture

Query/ReportingExtractTransformLoadRefresh

Data sources

Operational DBs

Serve

External sourcesData warehouse

Metadatarepository

Monitoring & Administration

Data marts

OLAP servers

Analysis

Tools

Falö aöldfflaöd aklödfalö alksdf

Data mining

Productt Time1 Value1 Value11

Product2 Time2 Value2 Value21

Product3 Time3 Value3 Value31

Product4 Time4 Value4 Value41

Data Warehouse vs. Data Mart

• Enterprise warehouse: collects all informationabout subject (customer, products, sales, assets,personnel) that span the entire organisation– Requires extensive business modelling– May take years to design and build

• Data Mart: Departmental subsets that focus onselected subjects: Marketing data mart: customer,product, sales– Faster roll-out– Complex integration in the long term

Page 9: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

9

To Meet the Requirements within DW

• The data is organised differently, i.e.“multidimensional”– star-joins schemas– snowflake schemas

• The data is viewed differently• The data is stored differently

– vector (array) storage• The data is indexed differently

– bitmap indexes– join indexes

Spreadsheets:

mo n

th

country

product

2 300

200

130

A data cube:

5 024

country

130

product

m

onth

product

5 024

2 300

200

From Spreadsheets to Data Cubes

Page 10: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

10

mo n

th

country

product

“Multidimensional” view of the data

mo n

th

country

product

mon

thcou

ntry

product

mon

thcou

ntry

product

customer group

promotioncampaign

Example - Star-Join Schema

Sales FactLocationKeyProductKeyTimeKeyQuantitySold…

LocationKeyCityCountry…

TimeKeyMonthYear…

ProductKeyNameCategory…

Page 11: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

11

Example

rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30

LocationKey City …1 Stockholm …2 London …3 Paris …

SalesLKey PKey TKey Qnt

rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30

ProductKey Name …

rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …

TimeKey Month …

rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …

Star-Join Schema

• A single fact table and a single table for eachdimension

• Every fact points to one tuple in each of thedimensions and has additional attributes

• The fact table is highly normalised, whereas thedimension tables not normalised.

• Dimensions does not capture hierarchies directly• Generated keys are used for performance and

maintenance reasons

• Fact constellation: Multiple Fact tables that sharemany dimension tables

Page 12: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

12

Snowflake Schema

• Represent dimensional hierarchy directly bynormalising the dimension tables

• Save storage• Reduces the effectiveness of browsing

Service used

Telephone calls

Time

Sales DimensionCustomer

- service name

- date

- customer name- address- seller name

- sum ($)- number of calls

Region

Income group

Month

Year

Office

Service groupQuarter

Example - Snowflake Schema

Page 13: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

13

Typical OLAP Operations

• Roll up (drill-up): summarize data– by climbing up hierarchy or by dimension reduction

• Drill down (roll down): reverse of roll-up– from higher level summary to lower level summary or

detailed data, or introducing new dimensions• Slice and dice:

– project and select• Pivot (rotate):

– reorient the cube, visualization, 3D to series of 2D planes.• Other operations

– drill across: involving (across) more than one fact table– drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Approaches to OLAP Servers

• Relational OLAP (ROLAP)– Relational and Extended Relational DBMS to store

and manage warehouse data

• Multidimensional OLAP (MOLAP)– Array-based storage structure (n-dimensional array)– Direct access to array data structure– Good indexing properties– Poor storage utilisation when the data is sparse.

Page 14: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

14

Bitmap Indexing

• An effective indexing technique for attributes withlow-cardinality domains

• There is a distinct bit vector BV for each value Vof the domain

• Example: the attribute sex has value M and F. Atable of 100 million people needs 2 lists of 100million bits.

Bitmap Index

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 W H

RowId N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 0 0 0 1

RowId H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0

Base Table Region Index Rating Index

SELECT Customers FROM Base TableWHERE Region = W AND Rating = L

Page 15: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

15

Bitmap Index

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 W H

RowId N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 0 0 0 1

RowId H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0

Base Table Region Index Rating Index

Region = W Rating = LAND

Bitmap Index

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 W H

RowId N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 0 0 0 1

RowId H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0

Base Table Region Index Rating Index

Region = W Rating = LAND

Page 16: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

16

Join Index

• Join index roughly: JI(Cf, R-id), whereD(Cd,R-id,…) >< Cd=Cf F(Cf,R-id,…)

• Traditional indixes map the values to a list ofrecord ids.

• In data warehouse, join index relates the values ofthe dimensions of a star schema to rows in the facttable

• Join indices can span multiple dimensions

Example

Sales FactLocationKeyProductKeyTimeKeyQuantitySold…

LocationKeyCityCountry…

TimeKeyMonthYear…

ProductKeyNameCategory…

Page 17: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

17

Join Index - Ex rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30

LocationKey City …

rid1 1 Stockholm …rid2 2 London …rid3 3 Paris …

SalesLKey PKey TKey Qnt

rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30

ProductKey Name …

rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …

TimeKey Month …

rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …

Join Index - Ex1 rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30

LocationKey City …

rid1 1 Stockholm …rid2 2 London …rid3 3 Paris …

SalesLKey PKey TKey Qnt

rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30

ProductKey Name …

rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …

TimeKey Month …

rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …

CityJICityK Rid1 rid41 rid51 rid61 rid131 rid141 rid152 rid72 rid82 rid92 rid162 rid172 rid18…

Page 18: DW - course informationpeople.dsv.su.se/~petia/is5/Lectures/F1.pdf · 2004. 5. 2. · Data Warehouse — Time Variant • The time horizon for the data warehouse is significantly

18

Join Index - Ex2 rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30

LocationKey City …

rid1 1 Stockholm …rid2 2 London …rid3 3 Paris …

SalesLKey PKey TKey Qnt

rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30

ProductKey Name …

rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …

TimeKey Month …

rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …

City-Product JICityK PrdK Rid1 1 rid41 1 rid131 2 rid51 2 rid141 3 rid61 3 rid15…

Summary

• Data warehouse– A subject-oriented, integrated, time-variant,

and nonvolatile collection of data in support ofmanagement’s decision-making process

• A multi-dimensional model of a data warehouse– Star schema, snowflake schema, fact

constellations– A data cube consists of dimensions & measures

• OLAP operations: drilling, rolling, slicing, dicing andpivoting

• OLAP servers: ROLAP, MOLAP


Recommended