1
DW - course information
• Teachers:• Petia Wohed• Erik Perjons• Gudrun Jeppesen
• Literature:• The Data Warehouse Toolkit: The Complete Guide to
Dimensional ModelingRalph Kimbal & Margy Ross [K&R]
• Compendium with extra reading material [Comp]
• Reference Literature:• Fundamentals of Database Systems, Elmasri & Navathe [EN]• Database Systems, Connolly & Begg [CB]
DW - course pedagogy
• F1 DW Introduction (A3 + Extra assignment handed out)
• F2 Multidimensional Modelling 1 (A1 handed out)
• F3 Multidimensional Modelling 2 (A2 handed out)
• F4 DW Lifecycle• S1 Multidimensional Modelling-Theory (A1 reported)
• F5 DW Physical design (A4 handed out)
• S2 Multidimensional Modelling-Practice (A2 reported)
• F6 Data Mining• S3 Presentation of Articles (A3 reported)
• A4 reported (individual time for each group has to be booked)• Optional: Extra assignment handed in individually.
• Written Examination
2
DW - reading directions
• F1 DW Introduction– [Comp] article 1, [K&R] chapter 1
• F2 Multidimensional Modelling 1– [K&R] chapters 2,3,4
• F3 Multidimensional Modelling 2– [K&R] chapters 5,6,7,8
• F4 DW Lifecycle– [K&R] chapter 16
• F5 DW Physical design– [Comp] article 2
• F6 Data Mining– [Comp] article 3
A1 Multidimensional Modelling- Theory
A2 Multidimensional Modelling- Practice
- [K&R] chapter 9
A4 Tool Practice
A3 Presentation of Article Extra assignment (optional) - periodicals, i.e., ACM, IEEE, - conf. proc., i.e., WLDB, CAiSE, ER
”We are drowing in information,but starving for knowledge”
- John Naisbett
3
Lecture 1 - Introduction to DW
Reading Requirements[Comp] R. Ramakrishnan and J. Gehrke, Chapter 23,Decision Support[K&R] Kimbal, Chapter 1
[EN] chapter 26[CB] chapter 25
KeywordsDW, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, BitmapIndex, Join Index, Data Mart
The Data Warehouse - definition
”A data warehouse is a subject oriented, integrated,non-volatile, and time-variant collection of data insupport of manadement’s decisions”.
B. Imnon:
S. Chaudhiri & U. Dayal:
”Data warehousing is a collection of decision supporttechnologies, aimed at enabling the knowledge worker(executive, manager, analyst) to make better andfaster decisions.”
En data lager är en verksamhetsorienterat, integrerat,icke-ombytlig och tids-beroende samling av data ämnatatt stödja beslutsfattande på strategisk nivå.
4
Data Warehouse — Subject-Oriented
• Organized around major subjects, such as customer, product,sales.
• Focusing on the modeling and analysis of data for decisionmakers, not on daily operations or transaction processing.
• Provide a simple and concise view around particular subjectissues by excluding data that are not useful in the decisionsupport process.
PayrollSystem
ProductionSystem
SalesSystem
Operational Systems
ProductData
SalesData
CustomerData
DW
Data Warehouse — Integrated
• Constructed by integrating multiple, heterogeneous datasources– relational or other databases, flat files, external data
• Data cleaning and data integration techniques are applied.– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different datasources
– When data is moved to the warehouse, it is converted.
Operational Systems DWCustomer
DataOrderSystem
BillingSystem
MarketingSystem
5
Data Warehouse — Time Variant
• The time horizon for the data warehouse is significantly longerthan that of operational systems.– Operational database: current value data.– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse– Contains an element of time– But the key of operational data may or may not contain “time
element”.
Operational Systems
DW CustomerData
OrderSystem
60-90 days 5-10 years
Data Warehouse — Non-Volatile
• A physically separate store of data transformed from theoperational environment.
• Operational update of data does not occur in the datawarehouse environment.– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only : loading and access of data.
DWOperational Systems
OrderSystem
Create Delete
InsertUpdateCustomer
Data
Load Access
6
Decision Support and OLAP (Navathe)
• Information technology to help the knowledgeworker (executive, manager, analyst) make fasterand better decisions.– Will a 10% discount increase sales volume
sufficiently?– Which of two new medications will result in the best
best outcome: higher recovery rate & shorterhospitality rate?
– How did the share price of computer manufacturerscorrelate with quarterly profits over the past 10years?
• On-Line Analytical Processing (OLAP) is an elementof decision support system (DSS).
Data Warehouse (Navathe)
• A decision support database that is maintainedseparately from the organisation’s operationaldatabases.
• A data warehouse is a– subject oriented,– integrated,– time-varying,– non-volatile
collection of data that is used primarily in theorganisational decision making.
7
Why separate data warehouse?
• Performance– The operational DBs are tuned to support known OLTP
workloads– Supporting OLAP requires special data organisations,
access methods and implementation methods
• Function– The decision support requires data that may be missing
from the operational DBs– Decision support usually requires consolidating data from
many heterogeneous sources
OLTP vs. OLAP• holds current data• stores detailed data
• data is dynamic• repetitive processing
• high level of transactionthroughput
• predictable pattern of usage• transaction driven• application oriented• support day-to-day decisions• serves large number of
operational users
• holds historic and integrated data• stores detailed and summarised
data
• data is largely static• ad-hoc, unstructured and heuristic
processing• medium or low-level of transaction
throughput• unpredictable pattern of usage• analysis driven• subject oriented• supports strategic decisions• serves relatively lower level of
managerial users
8
DW Architecture
Query/ReportingExtractTransformLoadRefresh
Data sources
Operational DBs
Serve
External sourcesData warehouse
Metadatarepository
Monitoring & Administration
Data marts
OLAP servers
Analysis
Tools
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
Data Warehouse vs. Data Mart
• Enterprise warehouse: collects all informationabout subject (customer, products, sales, assets,personnel) that span the entire organisation– Requires extensive business modelling– May take years to design and build
• Data Mart: Departmental subsets that focus onselected subjects: Marketing data mart: customer,product, sales– Faster roll-out– Complex integration in the long term
9
To Meet the Requirements within DW
• The data is organised differently, i.e.“multidimensional”– star-joins schemas– snowflake schemas
• The data is viewed differently• The data is stored differently
– vector (array) storage• The data is indexed differently
– bitmap indexes– join indexes
Spreadsheets:
mo n
th
country
product
2 300
200
130
A data cube:
5 024
country
130
product
m
onth
product
5 024
2 300
200
From Spreadsheets to Data Cubes
10
mo n
th
country
product
“Multidimensional” view of the data
mo n
th
country
product
mon
thcou
ntry
product
mon
thcou
ntry
product
customer group
promotioncampaign
Example - Star-Join Schema
Sales FactLocationKeyProductKeyTimeKeyQuantitySold…
LocationKeyCityCountry…
TimeKeyMonthYear…
ProductKeyNameCategory…
11
Example
rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30
LocationKey City …1 Stockholm …2 London …3 Paris …
SalesLKey PKey TKey Qnt
rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30
ProductKey Name …
rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …
TimeKey Month …
rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …
Star-Join Schema
• A single fact table and a single table for eachdimension
• Every fact points to one tuple in each of thedimensions and has additional attributes
• The fact table is highly normalised, whereas thedimension tables not normalised.
• Dimensions does not capture hierarchies directly• Generated keys are used for performance and
maintenance reasons
• Fact constellation: Multiple Fact tables that sharemany dimension tables
12
Snowflake Schema
• Represent dimensional hierarchy directly bynormalising the dimension tables
• Save storage• Reduces the effectiveness of browsing
Service used
Telephone calls
Time
Sales DimensionCustomer
- service name
- date
- customer name- address- seller name
- sum ($)- number of calls
Region
Income group
Month
Year
Office
Service groupQuarter
Example - Snowflake Schema
13
Typical OLAP Operations
• Roll up (drill-up): summarize data– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up– from higher level summary to lower level summary or
detailed data, or introducing new dimensions• Slice and dice:
– project and select• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.• Other operations
– drill across: involving (across) more than one fact table– drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Approaches to OLAP Servers
• Relational OLAP (ROLAP)– Relational and Extended Relational DBMS to store
and manage warehouse data
• Multidimensional OLAP (MOLAP)– Array-based storage structure (n-dimensional array)– Direct access to array data structure– Good indexing properties– Poor storage utilisation when the data is sparse.
14
Bitmap Indexing
• An effective indexing technique for attributes withlow-cardinality domains
• There is a distinct bit vector BV for each value Vof the domain
• Example: the attribute sex has value M and F. Atable of 100 million people needs 2 lists of 100million bits.
Bitmap Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 W H
RowId N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 0 0 0 1
RowId H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0
Base Table Region Index Rating Index
SELECT Customers FROM Base TableWHERE Region = W AND Rating = L
15
Bitmap Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 W H
RowId N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 0 0 0 1
RowId H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0
Base Table Region Index Rating Index
Region = W Rating = LAND
Bitmap Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 W H
RowId N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 0 0 0 1
RowId H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0
Base Table Region Index Rating Index
Region = W Rating = LAND
16
Join Index
• Join index roughly: JI(Cf, R-id), whereD(Cd,R-id,…) >< Cd=Cf F(Cf,R-id,…)
• Traditional indixes map the values to a list ofrecord ids.
• In data warehouse, join index relates the values ofthe dimensions of a star schema to rows in the facttable
• Join indices can span multiple dimensions
Example
Sales FactLocationKeyProductKeyTimeKeyQuantitySold…
LocationKeyCityCountry…
TimeKeyMonthYear…
ProductKeyNameCategory…
17
Join Index - Ex rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30
LocationKey City …
rid1 1 Stockholm …rid2 2 London …rid3 3 Paris …
SalesLKey PKey TKey Qnt
rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30
ProductKey Name …
rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …
TimeKey Month …
rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …
Join Index - Ex1 rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30
LocationKey City …
rid1 1 Stockholm …rid2 2 London …rid3 3 Paris …
SalesLKey PKey TKey Qnt
rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30
ProductKey Name …
rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …
TimeKey Month …
rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …
CityJICityK Rid1 rid41 rid51 rid61 rid131 rid141 rid152 rid72 rid82 rid92 rid162 rid172 rid18…
18
Join Index - Ex2 rid13 1 1 2 10rid14 1 2 2 9rid15 1 3 2 7rid16 2 1 2 5rid17 2 2 2 10rid18 2 3 2 8rid19 3 1 2 20rid20 3 2 2 50rid21 3 3 2 30
LocationKey City …
rid1 1 Stockholm …rid2 2 London …rid3 3 Paris …
SalesLKey PKey TKey Qnt
rid4 1 1 1 5rid5 1 2 1 7rid6 1 3 1 4rid7 2 1 1 8rid8 2 2 1 3rid9 2 3 1 5rid10 3 1 1 20rid11 3 2 1 10rid12 3 3 1 30
ProductKey Name …
rid22 1 # 5 …rid23 2 Noah …rid24 3 Opium …
TimeKey Month …
rid25 1 Jan …rid26 2 Feb …rid27 3 Mar …rid27 4 Apr …
City-Product JICityK PrdK Rid1 1 rid41 1 rid131 2 rid51 2 rid141 3 rid61 3 rid15…
Summary
• Data warehouse– A subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support ofmanagement’s decision-making process
• A multi-dimensional model of a data warehouse– Star schema, snowflake schema, fact
constellations– A data cube consists of dimensions & measures
• OLAP operations: drilling, rolling, slicing, dicing andpivoting
• OLAP servers: ROLAP, MOLAP