Download - Datawarehouse Intro Ch1 Ch2

7/30/2019 Datawarehouse Intro Ch1 Ch2

1/193

3

Which are ourlowest/highest margin

customers ? Who are my customers

and what productsare they buying?

Which customersare most likely to goto the competition ?

What impact willnew products/services

have on revenueand margins?

What product prom--otions have the biggest

impact on revenue?

What is the mosteffective distribution

channel?

A producer wants to know.


2/193

4

Data, Data everywhere yet ... I cant find the data I need

data is scattered over thenetworkmany versions, subtledifferences

I cant get the data I need need an expert to get the data

I cant understand the data Ifound

available data poorly documented

I cant use the data I found results are unexpected

data needs to be transformedfrom one form to other


3/193

5

What is a Data Warehouse?

A single, complete andconsistent store of dataobtained from a variety

of different sourcesmade available to endusers in a what theycan understand and usein a business context.

[Barry Devlin]


4/193

6

What are the users saying...

Data should be integratedacross the enterpriseSummary data has a realvalue to the organizationHistorical data holds thekey to understanding data

over timeWhat-if capabilities arerequired


5/193

7

What is Data Warehousing?

A process of transforming data intoinformation andmaking it available tousers in a timelyenough manner to

make a difference

[Forrester Research, April1996]Data

Information


6/193

8

Evolution

60s: Batch reports hard to find and analyze informationinflexible and expensive, reprogram every newrequest

70s: Terminal -based DSS and EIS (executiveinformation systems)

still inflexible, not integrated with desktop tools

80s: Desktop data access and analysis tools query tools, spreadsheets, GUIseasier to use, but only access operational databases

90s: Data warehousing with integrated OLAP

engines and tools


7/193

9

Warehouses are Very LargeDatabases

35%

30%

25%

20%

15%

10%

5%

0%5GB

5-9GB

10-19GB 50-99GB 250-499GB

20-49GB 100-249GB 500GB-1TB

InitialProjected 2Q96

Source: META Group, Inc.

R e s p o n

d e n t s


8/193

10

Very Large Data Bases

Terabytes -- 10^12 bytes:

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21

bytes:

Zottabytes -- 10^24bytes:

Walmart -- 24 Terabytes

Geographic Information

SystemsNational Medical Records

Weather images

Intelligence AgencyVideos


9/193


10/193

12

Data Warehousing--* It is a process* It is a product* It is an

environment


11/193

13

Data Warehousing --It is a process

Technique for assembling andmanaging data from varioussources for the purpose of

answering businessquestions. Thus makingdecisions that were notprevious possibleA decision support databasemaintained separately fromthe organizations operational

database


12/193

14

Data Warehouse (2 nd Chapter)

A data warehouse is asubject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily inorganizational decision making.


13/193

15

Data Warehouse

Subject Oriented

The data in the data warehouse is organizedso that all the data elements relating to thesame real-world event or object are linkedtogether.


14/193

16

Data Warehouse

Integrated

The data warehouse contains data from mostor all of an organization's operational systemsand this data is made consistent


15/193

17

Data Warehouse (2 nd Chapter)

Non volatile Data

Data in the data warehouse is never over-written or deleted - once committed, the datais static, read-only, and retained for futurereporting


16/193

18

Data Warehouse

Time variant

In a data warehouse environment,

t he decision makers can view the data acrossthe field of time at whichever level of detailthey may wish


17/193

Data Granuality

Granularity is the extent to which asystem is broken down into smallparts, either the system itself or itsdescription or observation. It is the"extent to which a larger entity issubdivided. For example, a yardbroken into inches has finergranularity than a yard broken intofeet."

19
http://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/System


18/193

Cont.

Granularity is usually mentioned in thecontext of dimensional data structures(i.e., facts and dimensions) and refers to

the level of detail in a given fact table.The more detail there is in the fact table,the higher its granularity and vice versa.Another way to look at it is that thehigher the granularity of a fact table, themore rows it will have .

20


19/193

Example:

Say we have a data mart with a single fact(Sales) and three dimensions (Time,Organization and Product). The fact tablecontains three metrics (Unit Price, Units Sold andTotal Sale Amount). The Time dimension consistsof four hierarchical elements (Year, Quarter,Month and Day). The Organization dimensionconsists of three hierarchical elements (Region,

District and Store). The Product dimensionconsists of two hierarchical elements (ProductFamily and SKU).

21


20/193

Cont.As always, the metrics in the Sales fact table must bestored at some intersection of the dimensions (i.e., Time,Organization and Product). Hence, in this data mart, thehighest granularity that we can store Sales metrics is byDay/Store/SKU (i.e., the lowest level in each dimensionalhierarchy). Conversely, the lowest granularity that we canaggregate Sales metrics to in this data mart is byYear/Region/Product Family (i.e., the highest level in eachdimensional hierarchy). We may also (for a variety of performance reasons) choose to store Sales metrics atsome intermediate level of granularity (e.g., byMonth/District/SKU) .

22


21/193

The information flow mechanism

23Extract Transform Load Operational data store


22/193

Data extraction from source

Identify the sourceFinalize the filters for each sourceProduce automatic extract file from operational

dataGenerate intermediate fileRender automated job control services forcreating extract files

Reformat and standardized inputProduce common application code for dataextractionResolve inconsistencies for common data that

will be extracted from multiple source systems 24


23/193

Meta data in warehouse

Metadata is one of the importantkeys to the success of the datawarehousing and businessintelligence effort.Metadata is your control panel to thedata warehouse. It is data thatdescribes the data warehousing andbusiness intelligence system:

25


24/193

What is Metadata? ReportsCubesTables (Records, Segments, Entities, etc.)Columns (Fields, Attributes, Data Elements, etc.)

KeysIndexes

Metadata is often used to control the handling of data and describes:

RulesTransformationsAggregationsMappings

26


25/193

Data Warehouse Metadata

Data warehousing has specific metadatarequirements. Metadata that describes tablestypically includes:

Physical NameLogical NameType: Fact, Dimension, BridgeRole: Legacy, OLTP, Stage,

DBMS: DB2, Informix, MS SQL Server, Oracle,SybaseLocationDefinition

Notes 27


26/193

29

Data Warehouse for DecisionSupport & OLAP

Putting Information technology to help theknowledge worker make faster and betterdecisions

Which of my customers are most likely to goto the competition?What product promotions have the biggestimpact on revenue?How did the share price of softwarecompanies correlate with profits over last 10years?


27/193

30

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than updateUse of the system is loosely defined andcan be ad-hoc

Used by managers and end-users tounderstand the business and make

judgements


28/193

31

Data Mining works with WarehouseData

Data Warehousingprovides the Enterprisewith a memory

Data Mining providesthe Enterprise withintelligence


29/193

32

We want to know ...Given a database of 100,000 names, which persons are theleast likely to default on their credit cards?Which types of transactions are likely to be fraudulentgiven the demographics and transactional history of aparticular customer?

If I raise the price of my product by Rs. 2, what is theeffect on my ROI?

If I offer only 2,500 airline miles as an incentive topurchase rather than 5,000, how many lost responses willresult?

If I emphasize ease-of-use of the product as opposed to itstechnical capabilities, what will be the net effect on myrevenues?

Which of my customers are likely to be the most loyal?

Data Mining helps extract such information


30/193

33

Application Areas

Industry Application Finance Credit Card Analysis

Insurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis


31/193

34

Data Mining in Use

The US Government uses Data Mining totrack fraudA Supermarket becomes an informationbrokerBasketball teams use it to track gamestrategy

Cross SellingWarranty Claims RoutingHolding on to Good Customers

Weeding out Bad Customers


32/193

35

What makes data mining possible?

Advances in the following areas aremaking data mining deployable:

data warehousingbetter and more data (i.e., operational,

behavioral, and demographic)the emergence of easily deployed data

mining tools andthe advent of new data mining

techniques.


33/193

36

Why Separate Data Warehouse?

PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementationmethods needed for multidimensional views & queries.

FunctionMissing data: Decision support requires historical data, which

op dbs do not typically maintain.Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.Data quality: Different sources typically use inconsistent data

representations, codes, and formats which have to bereconciled.


34/193

37

What are Operational Systems?

They are OLTP systemsRun mission criticalapplicationsNeed to work withstringent performancerequirements forroutine tasksUsed to run abusiness!


35/193

38

RDBMS used for OLTP

Database Systems have been usedtraditionally for OLTP

clerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are

critical


36/193

39

Operational Systems

Run the business in real timeBased on up-to-the-second dataOptimized to handle largenumbers of simple read/writetransactionsOptimized for fast response topredefined transactionsUsed by people who deal with

customers, products -- clerks,salespeople etc.They are increasingly used bycustomers


37/193

40

Examples of Operational DataData Industry Usage Technology Volumes

Customer File

All Track Customer Details

Legacy application, flat files, main frames

Small-medium

Account Balance Finance Control account activities

Legacy applications, hierarchical databases, mainframe

Large

Point-of- Sale data

Retail Generate bills, manage stock

ERP, Client/Server, relational databases

Very Large

Call Record Telecomm- unications Billing Legacy application, hierarchical database, mainframe

Very Large

Production Record

Manufact- uring

Control Production

ERP, relational databases,

AS/400

Medium


38/193

So, whats different?


39/193

42

Application-Orientation vs.Subject-Orientation

Application-Orientation

Operational

Database

LoansCreditCard

Trust

Savings

Subject-Orientation

Data

Warehouse

Customer

VendorProduct

Activity


40/193

43

OLTP vs. Data Warehouse

OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data

warehouseSpecial data organization, access methodsand implementation methods are neededto support data warehouse queries(typically multidimensional queries)

e.g ., average amount spent on phone callsbetween 9AM-5PM in Pune during the monthof December


41/193

44

OLTP vs Data Warehouse

OLTPApplicationOriented

Used to runbusinessDetailed dataCurrent up to date

Isolated DataRepetitive accessClerical User

Warehouse (DSS)Subject OrientedUsed to analyze

businessSummarized andrefinedSnapshot data

Integrated DataAd-hoc accessKnowledge User(Manager)


42/193

45


OLTPPerformance SensitiveFew Records accessed at

a time (tens)

Read/Update Access

No data redundancy

Database Size 100MB-100 GB

Data WarehousePerformance relaxedLarge volumes accessed

at a time(millions)Mostly Read (BatchUpdate)Redundancy presentDatabase Size

100 GB - few terabytes


43/193

46


OLTPTransactionthroughput is the

performance metricThousands of usersManaged inentirety

Data WarehouseQuery throughputis the performance

metricHundreds of usersManaged bysubsets


44/193

47

To summarize ...

OLTP Systems areused to run abusiness

The DataWarehouse helpsto optimize thebusiness


45/193

48

Why Now?

Data is being producedERP provides clean data

The computing power is availableThe computing power is affordableThe competitive pressures are

strongCommercial products are available

M th di OLAP S


46/193

49

Myths surrounding OLAP Serversand Data Marts

Data marts and OLAP servers are departmentalsolutions supporting a handful of usersMillion dollar massively parallel hardware is

needed to deliver fast time for complex queriesOLAP servers require massive and unwieldyindicesComplex OLAP queries clog the network with

dataData warehouses must be at least 100 GB to beeffective


47/193

50

Wal*Mart Case Study

Founded by Sam WaltonOne of the largest Super MarketChains in the US

Wal*Mart: 2000+ Retail Stores

SAM's Clubs 100+WholesalersStores

This case study is from Felipe Carinos (NCR

Teradata) presentation made at Stanford DatabaseSeminar


48/193

51

Old Retail Paradigm

Wal*MartInventoryManagement

Merchandise AccountsPayablePurchasingSupplier Promotions:

National, Region,Store Level

SuppliersAccept OrdersPromote Products

Provide specialIncentivesMonitor and TrackThe Incentives

Bill and CollectReceivablesEstimate RetailerDemands

Ne (J st In Time) Ret il


49/193

52

New (Just-In-Time) RetailParadigm

No more dealsShelf-Pass Through (POS Application)

One Unit PriceSuppliers paid once a week on ACTUAL items sold

Wal*Mart ManagerDaily Inventory RestockSuppliers (sometimes SameDay) ship to Wal*Mart

Warehouse-Pass ThroughStock some Large Items

Delivery may come from supplierDistribution Center

Suppliers merchandise unloaded directly onto Wal*MartTrucks


50/193

53

Wal*Mart System

NCR 5100M 96Nodes;Number of Rows:Historical Data:New Daily Volume:

Number of Users:Number of Queries:

24 TB Raw Disk; 700 -1000 Pentium CPUs

> 5 Billions65 weeks (5 Quarters)Current Apps: 75 MillionNew Apps: 100 Million +

Thousands60,000 per week


51/193

54

Course Overview

0. IntroductionI. Data Warehousing

II. Decision Supportand OLAPIII. Data MiningIV. Looking Ahead

Demos and Labs

I Data Warehouses:


52/193

55

I. Data Warehouses:Architecture, Design & Construction

DW ArchitectureLoading, refreshingStructuring/ModelingDWs and Data MartsQuery Processing


53/193

56

Data Warehouse Architecture

Data WarehouseEngine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased

Data

ERPSystems

Characteristics of data warehouse


54/193

Characteristics of data warehousearchitecture

Different objectives and scope(analytical)

Data content (read only)Complex analysis and quickresponseFlexible and dynamicMeta data driven

57


55/193

Goal

Architecture of data warehousebecomes the framework for productselectionIt is collection of documents, plans,models, drawing, and specificationsArchitecture has to be driven by thebusiness

58


56/193

DW arctitecture

It is a way of representing overallstructure of the data, processing andpresentation that exists for end-usercomputing within the organization

It has number of interconnectedcomponents

59


57/193

Components

Operational database layerInformation access layerData access layerData directory layerProcess management layerApplication messaging layer

Data warehouse (physical) layerData staging layer

60


58/193

61

Components of the Warehouse

Data Extraction and Loading(The Warehouse

Analyze and Query -- OLAP ToolsMetadata

Data Mining tools ETL(extract, transfer, load)


59/193

Loading the Warehouse

Cleaning the databefore it is loaded


60/193

63

Source Data

Typically host based, legacy applicationsCustomized applications, COBOL, 3GL,4GL

Point of Contact DevicesPOS(point of sale), ATM, Callswitches( Call Switch makes managinginbound telephone calls )

Sequential Legacy Relational ExternalOperational/ Source Data


61/193

External SourcesNielsens( Nielsen monitors and measures morethan 90% of global Internet activity andprovides insights about the online universe -

including audiences, advertising),Acxiom(Provides range of information servicesand products geared towards enterprise datamanagement and retrieval),CMIE( Centre for Monitoring Indian Economy

), Vendors, Partners

64


62/193

65

Data Quality - The Reality

Tempting to think creating a datawarehouse is simply extractingoperational data and entering into adata warehouse

Nothing could be farther from thetruthWarehouse data comes fromdisparate questionable sources


63/193

66

Data Quality - The Reality

Legacy systems no longer documented

Outside sources with questionable qualityproceduresProduction systems with no built inintegrity checks and no integration

Operational systems are usually designed to

solve a specific business problem and arerarely developed to a a corporate plan

And get it done quickly, we do not have time toworry about corporate standards...


64/193

67


65/193

68


66/193

69

Data Integration Across Sources

Trust Credit cardSavings Loans

Same datadifferent name

Different dataSame name

Data found herenowhere else

Different keyssame data


67/193

70

Data Transformation Example

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

appl A - m,f appl B - 1,0appl C - x,yappl D - male, female

Data Warehouse


68/193

71

Data Integrity Problems

Same person, different spellingsAgarwal, Agrawal, Aggarwal etc...

Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt.LTD.

Use of different namesmumbai, bombay

Different account numbers generated bydifferent applications for the same customerRequired fields left blankInvalid product codes collected at point of sale

manual entry leads to mistakes

in case of a problem use 9999999


69/193

72

Data Transformation Terms

ExtractingConditioning

ScrubbingMergingHouseholding

EnrichmentScoring

LoadingValidatingDelta Updating


70/193

73


ExtractingCapture of data from operational source in

as is status

Sources for data generally in legacymainframes in VSAM (virtual storage access method) ,IMS (information management system) , IDMS (integrated dbms) , DB2;more data today in relational databases on

UnixConditioning

The conversion of data types from the sourceto the target data store (warehouse) --


71/193

74


HouseholdingIdentifying all members of a household

(living at the same address)Ensures only one mail is sent to a

householdCan result in substantial savings: 1

lakh catalogues at Rs. 50 each costs Rs.50 lakhs. A 2% savings would save Rs.1 lakh.


72/193

75


EnrichmentBring data from external sources to

augment/enrich operational data. Data

sources include Dunn and Bradstreet, A.C. Nielsen, CMIE, IMRA (provides an extensive digest of media,polls, and significant interviews and events.

)etc...Scoring

computation of a probability of anevent. e.g..., chance that a customerwill defect to AT&T from MCI (American telecomcompany) , chance that a customer is likely tobuy a new product


73/193

76

Loads

After extracting, scrubbing, cleaning,validating etc. need to load the datainto the warehouse

Issueshuge volumes of data to be loadedsmall time window available when warehouse can betaken off line (usually nights)when to build index and summary tablesallow system administrators to monitor, cancel, resume,change load ratesRecover gracefully -- restart after failure from whereyou were and without loss of data integrity


74/193

77

Load Techniques

Use SQL to append or insert newdata

record at a time interfacewill lead to random disk I/Os

Use batch load utility

d


75/193

78

Load Taxonomy

Incremental versus Full loadsOnline versus Offline loads

f h


76/193

79

Refresh

Propagate updates on source data tothe warehouseIssues:

when to refreshhow to refresh -- refresh techniques


77/193

80

When to Refresh?

periodically (e.g., every night, everyweek) or after significant eventson every update: not warranted unlesswarehouse data require current data (upto the minute stock quotes)refresh policy set by administrator based

on user needs and trafficpossibly different policies for differentsources

R f h T h i


78/193

81

Refresh Techniques

Full Extract from base tablesread entire source table: too expensivemaybe the only choice for legacy

systems

H T D Ch


79/193

82

How To Detect Changes

Create a snapshot log table to recordids of updated rows of source dataand timestampDetect changes by:

Defining after row triggers to updatesnapshot log when source table

changesUsing regular transaction log to detect

changes to source data


80/193

83

Data Extraction and Cleansing

Extract data from existingoperational and legacy dataIssues:

Sources of data for the warehouseData quality at the sourcesMerging different data sourcesData Transformation

How to propagate updates (on the sources) tothe warehouseTerabytes of data to be loaded


81/193

84

Scrubbing Data

Sophisticatedtransformation tools.Used for cleaning thequality of dataClean data is vital for thesuccess of thewarehouse

ExampleSeshadri, Sheshadri,Sesadri, Seshadri S.,Srinivasan Seshadri, etc.are the same person


82/193

85

Scrubbing Tools

Apertus -- Enterprise/IntegratorVality -- IPE

Postal Soft


83/193

Structuring/Modeling Issues

Data -- Heart of the Data


84/193

87

Warehouse

Heart of the data warehouse is thedata itself!Single version of the truthCorporate memoryData is organized in a way thatrepresents business -- subjectorientation

D t W h St t


85/193

88

Data Warehouse Structure

Subject Orientation -- customer,product, policy, account etc... Asubject may be implemented as a

set of related tables. E.g.,customer may be five tables


86/193

89

Data Warehouse Structure

base customer (1985-87)custid, from date, to date, name, phone, dob

base customer (1988-90)custid, from date, to date, name, credit rating,employer

customer activity (1986-89) -- monthlysummarycustomer activity detail (1987-89)

custid, activity date, amount, clerk id, order nocustomer activity detail (1990-91)

custid, activity date, amount, line item no, order no

Time is part of

key of each table

D t G l it i W h


87/193

90

Data Granularity in Warehouse

Summarized data storedreduce storage costsreduce cpu usageincreases performance since smaller

number of records to be processeddesign around traditional high level

reporting needstradeoff with volume of data to be

stored and detailed usage of data

Gran larit in Wareho se


88/193

91

Granularity in Warehouse

Can not answer some questions withsummarized data

Did Anand call Seshadri last month?Not possible to answer if total durationof calls by Anand over a month is onlymaintained and individual call detailsare not.

Detailed data too voluminous


89/193

92

Granularity in Warehouse

Tradeoff is to have dual level of granularity

Store summary data on disks95% of DSS processing done against this

data

Store detail on tapes5% of DSS processing against this data


90/193

93

Vertical Partitioning

Frequentlyaccessed Rarelyaccessed

Smaller tableand so less I/O

Acct.No Name Balance Date Opened

InterestRate Address

Acct.No Balance

Acct.No Name Date Opened

InterestRate Address


91/193

94

Derived Data

Introduction of derived (calculateddata) may often helpHave seen this in the context of duallevels of granularityCan keep auxiliary views andindexes to speed up queryprocessing

Schema Design


92/193

95

Schema Design

Database organizationmust look like businessmust be recognizable by business user

approachable by business userMust be simple

Schema Types

Star SchemaFact Constellation SchemaSnowflake schema

Dimension Tables


93/193

96

Dimension Tables

Dimension tablesDefine business in terms already

familiar to users

Wide rows with lots of descriptive textSmall tables (about a million rows)Joined to fact table by a foreign keyheavily indexedtypical dimensions

time periods, geographic region (markets,cities), products, customers, salesperson,etc.

In data warehousing, a dimension


94/193

table is one of the set of companiontables to a fact table.The fact table contains businessfacts or measures and foreign keyswhich refer to candidate keys(normally primary keys) in thedimension tables.The dimension tables containattributes (or fields) used toconstrain and group data whenperforming data warehousing

queries 97

Fact Table


95/193

98

Fact Table

Central tablemostly raw numeric itemsnarrow rows, a few columns at mostlarge number of rows (millions to a

billion)Access via dimensions

In data warehousing, a fact table


96/193

g,consists of the measurements,

metrics or facts of a businessprocess.Fact tables provide the (usually)additive values that act asindependent variables by whichdimensional attributes are analyzed.

99

Star Schema


97/193

100

Star Schema

A single fact table and for eachdimension one dimension tableDoes not capture hierarchies directly

T i

m e

p r o d

c u s t

c i t y

f a c t

date, custno, prodno, cityname, ...


98/193

101

Snowflake schema


99/193

102

Snowflake schema

Represent dimensional hierarchy directlyby normalizing tables.Easy to maintain and saves storage

T i

m e

p r o d

c u s t

c i t y

f a c t

date, custno, prodno, cityname, ...

r e g i o n

A is a logical arrangement of tables
http://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Logical_schema


100/193

A is a logical arrangement of tablesin a multidimensional database such that the entityrelationship diagram resembles a snowflake in shape.

Closely related to the star schema ,The snowflake schema is represented by centralizedfact tables which are connected to multipledimensions . In the snowflake schema, however,dimensions are normalized into multiple related tables

whereas the star schema's dimensions aredenormalized with each dimension being representedby a single table.When the dimensions of a snowflake schema areelaborate, having multiple levels of relationships, and

where child tables have multiple parent tables ("forksin the road"), a complex snowflake shape starts toemerge. The "snowflaking" effect only affects thedimension tables and not the fact tables.

103

Fact Constellation
http://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Multidimensional_databasehttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Fact_tablehttp://en.wikipedia.org/wiki/Dimension_(data_warehouse)http://en.wikipedia.org/wiki/Dimension_(data_warehouse)http://en.wikipedia.org/wiki/Fact_tablehttp://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Multidimensional_databasehttp://en.wikipedia.org/wiki/Logical_schema


101/193

104

Fact Constellation

Fact ConstellationMultiple fact tables that share many

dimension tables

Booking and Checkout may share manydimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

D li i


102/193

105

De-normalization

Normalization in a data warehousemay lead to lots of small tablesCan lead to excessive I/Os sincemany tables have to be accessedDe-normalization is the answerespecially since updates are rare

C i A


103/193

106

Creating Arrays

Many times each occurrence of a sequence of data is in a different physical locationBeneficial to collect all occurrences together

and store as an array in a single rowMakes sense only if there are a stablenumber of occurrences which are accessedtogetherIn a data warehouse, such situations arisenaturally due to time based orientation

can create an array by month

S l i R d d


104/193

107

Selective Redundancy

Description of an item can be storedredundantly with order table --most often item description is alsoaccessed with order tableUpdates have to be careful

P i i i


105/193

108

Partitioning

Breaking data into severalphysical units that can behandled separatelyNot a question of whether to do it in datawarehouses but how to doitGranularity andpartitioning are key toeffective implementationof a warehouse

Wh P i i ?


106/193

109

Why Partition?

Flexibility in managing dataSmaller physical units allow

easy restructuringfree indexingsequential scans if neededeasy reorganizationeasy recoveryeasy monitoring

C it i f P titi i


107/193

110

Criterion for Partitioning

Typically partitioned bydateline of businessgeographyorganizational unitany combination of above

Wh t P titi ?


108/193

111

Where to Partition?

Application level or DBMS levelMakes sense to partition atapplication level

Allows different definition for each yearImportant since warehouse spans many

years and as business evolves definitionchanges

Allows data to be moved betweenprocessing complexes easily


109/193

Data Warehouse vs. Data Marts

What comes first

From the Data Warehouse to DataM t


110/193

113

Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

D t W h d D t M t


111/193

114

Data Warehouse and Data Marts

OLAPData MartLightly summarizedDepartmentally structured

Organizationally structured AtomicDetailed Data Warehouse Data

Characteristics of theD t t l D t M t


112/193

115

Departmental Data Mart

OLAPSmallFlexible

Customized byDepartmentSource is

departmentallystructured datawarehouse

Techniques for CreatingDepartmental Data Mart


113/193

116

Departmental Data Mart

OLAP

Subset

SummarizedSuperset

Indexed

Arrayed

Sales Mktg.Finance

Data Mart Centric


114/193

117

Data Mart Centric

Data Marts

Data Sources

Data Warehouse

Problems with Data Mart CentricSolution


115/193

118

Solution

If you end up creating multiple warehouses,integrating them is a problem

True Warehouse


116/193

119

True Warehouse

Data Marts

Data Sources

Data Warehouse

Query Processing (end)


117/193

120

Query Processing (end)

Indexing

Pre computedviews/aggregatesSQL extensions

Indexing Techniques


118/193

121

Indexing Techniques

Exploiting indexes to reducescanning of data is of crucialimportance

Bitmap IndexesJoin IndexesOther Issues

Text indexingParallelizing and sequencing of index

builds and incremental updates

Indexing Techniques


119/193

122

g q

Bitmap index:A collection of bitmaps -- one for each

distinct value of the column

Each bitmap has N bits where N is thenumber of rows in the tableA bit corresponding to a value v for a

row r is set if and only if r has the valuefor the indexed attribute

BitMap Indexes


120/193

123

BitMap Indexes

An alternative representation of RID-listSpecially advantageous for low-cardinalitydomains

Represent each row of a table by a bitand the table as a bit vectorThere is a distinct bit vector Bv for eachvalue v for the domainExample: the attribute sex has values Mand F. A table of 100 million peopleneeds 2 lists of 100 million bits

Bitmap Index


121/193

124Customer Query : select * from customer where

gender = F and vote = Y

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

Bitmap Index

M

F

F

F

F

M

Y

Y

Y

N

N

N

Bit Map Index


122/193

125

Bit Map Index

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W L

C7 N H

Base Table

Row ID N S E W

1 1 0 0 0

2 0 1 0 0

3 0 0 0 1

4 0 0 0 1

5 0 1 0 0

6 0 0 0 1

7 1 0 0 0

Row ID H M L

1 1 0 0

2 0 1 0

3 0 0 0

4 0 0 0

5 0 1 0

6 0 0 0

7 1 0 0

Rating Index Region Index

Customers where Region = W Rating = M And

BitMap Indexes


123/193

126

BitMap Indexes

Comparison, join and aggregation operationsare reduced to bit arithmetic with dramaticimprovement in processing time

Significant reduction in space and I/O (30:1)Adapted for higher cardinality domains as well.Compression (e.g., run-length encoding)exploitedProducts that support bitmaps: Model 204,TargetIndex (Redbrick), IQ (Sybase), Oracle7.3

Join Indexes


124/193

127

Pre-computed joinsA join index between a fact table and adimension table correlates a dimension

tuple with the fact tuples that have thesame value on the common dimensionalattribute

e.g., a join index on city dimension of calls

fact tablecorrelates for each city the calls (in the calls table) from that city

Join Indexes


125/193

128

Join Indexes

Join indexes can also span multipledimension tables

e.g., a join index on city and time

dimension of calls fact table

Star Join Processing


126/193

129

g

Use join indexes to join dimensionand fact table

Calls C+T

C+T+L

C+T+L +P

Time

Loca- tion

Plan

Optimized Star Join Processing


127/193

130

p g

Time

Loca- tion

Plan

Calls

Virtual Cross Product of T, L and P

Apply Selections

Bitmapped Join Processing


128/193

131

AND

Time

Loca- tion

Plan

Calls

Calls

Calls

Bitmaps 1 0

1

0 0 1

1 1 0

Intelligent Scan


129/193

132

Piggyback multiple scans of arelation (Redbrick)

piggybacking also done if second scan

starts a little while after the first scan

Parallel Query Processing


130/193

133

Three forms of parallelismIndependentPipelined

Partitioned and partition and replicate Deterrents to parallelism

startup

communication

Parallel Query Processing


131/193

134

Partitioned DataParallel scansYields I/O parallelism

Parallel algorithms for relational operatorsJoins, Aggregates, Sort

Parallel UtilitiesLoad, Archive, Update, Parse, Checkpoint,Recovery

Parallel Query Optimization

Pre-computed Aggregates


132/193

135

Keep aggregated data forefficiency (pre-computed queries)

QuestionsWhich aggregates to compute?How to update aggregates?How to use pre-computed

aggregates in queries?

Pre-computed Aggregates


133/193

136

Pre computed Aggregates

Aggregated table can be maintainedby the

warehouse server

middle tierclient applications

Pre-computed aggregates -- special

case of materialized views -- samequestions and issues remain

SQL Extensions


134/193

137

Extended family of aggregatefunctions

rank (top 10 customers)percentile (top 30% of customers)median, modeObject Relational Systems allow

addition of new aggregate functions

SQL Extensions


135/193

138

SQL Extensions

Reporting featuresrunning total, cumulative totals

Cube operatorgroup by on all subsets of a set of

attributes (month,city)redundant scan and sorting of data can

be avoided

Red Brick has Extended set ofAggregates


136/193

139

Aggregates

Select month, dollars, cume(dollars) asrun_dollars, weight, cume(weight) asrun_weightsfrom sales, market, product, period t

where year = 1993and product like Columbian% and city like San Fr% order by t.perkey

RISQL (Red Brick Systems)Extensions


137/193

140

Extensions

AggregatesCUMEMOVINGAVGMOVINGSUMRANKTERTILERATIOTOREPORT

Calculating RowSubtotals

BREAK BY

Sophisticated DateTime SupportDATEDIFF

Using SubQueriesin calculations

Using SubQueries in Calculations


138/193

141

Using SubQueries in Calculations

select product, dollars as jun97_sales,(select sum(s1.dollars)from market mi, product pi, period, ti, sales si

where pi.product = product.productand ti.year = period.yearand mi.city = market.city) as total97_sales,100 * dollars/

(select sum(s1.dollars)from market mi, product pi, period, ti, sales si where pi.product = product.product

and ti.year = period.yearand mi.city = market.city) as percent_of_yr

from market, product, period, sales where year = 1997

and month = June and city like Ahmed% order by product;

Course Overview


139/193

142

Course Overview

The course:what and how

0. IntroductionI. Data WarehousingII. Decision Supportand OLAP

III. Data MiningIV. Looking Ahead

Demos and Labs


140/193

II. On-Line Analytical Processing (OLAP)

Making DecisionSupport Possible

Limitations of SQL


141/193

144

Q

A Freshman inBusiness needs

a Ph.D. in SQL

-- Ralph Kimball

Typical OLAP Queries


142/193

145

yp Q

Write a multi-table join to compare sales for eachproduct line YTD this year vs. last year.

Repeat the above process to find the top 5

product contributors to margin.Repeat the above process to find the sales of aproduct line to new vs. existing customers.

Repeat the above process to find the customersthat have had negative sales growth.

What Is OLAP?


143/193

146

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software * Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)

The OLAP Market


144/193

147

Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion

Significant consolidation activity amongmajor DBMS vendors

10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama

Result: OLAP shifted from small verticalniche to mainstream DBMS category

Strengths of OLAP


145/193

148

g

It is a powerful visualization paradigm

It provides fast, interactive responsetimes

It is good for analyzing time series

It can be useful to find some clusters and

outliersMany vendors offer OLAP tools

OLAP Is FASMI


146/193

149

Nigel Pendse, Richard Creath - The OLAP Report

FastAnalysisSharedMultidimensionalInformation

Multi-dimensional Data


147/193

150Month

1 2 3 4 765

P r o

d u c

t

Toothpaste

JuiceColaMilk

Cream

Soap

WSN

Dimensions: Product, Region, TimeHierarchical summarization paths

Product Region Time Industry Country Year

Category Region Quarter

Product City Month Week

Office Day

Multi-dimensional Data

HeyI sold $100M worth of goods

Data Cube Lattice


148/193

151

Cube latticeABC

AB AC BCA B C

noneCan materialize some groupbys, compute otherson demandQuestion: which groupbys to materialze?

Question: what indices to createQuestion: how to organize data (chunks, etc)

Visualizing Neighbors is simpler


149/193

152

g g p

1 2 3 4 5 6 7 8 AprMayJunJul AugSepOctNovDecJanFebMar

Month Store Sales Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2

A Visual Operation: Pivot (Rotate)


150/193

153

p ( )

10

47

30

12

JuiceCola

Milk

Cream

3/1 3/2 3/3 3/4

Date

Product

Slicing and Dicing


151/193

154

g g

Product

Sales Channel Retail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

Roll-up and Drill Down


152/193

155

Sales ChannelRegionCountryStateLocation Address

SalesRepresentative

Higher Level of Aggregation

Low-levelDetails

Nature of OLAP Analysis


153/193

156

Aggregation -- (total sales,percent-to-total)Comparison -- Budget vs.Expenses

Ranking -- Top 10, quartileanalysisAccess to detailed and

aggregate dataComplex criteriaspecificationVisualization

Organizationally Structured Data


154/193

157

Different Departments look at the samedetailed data in different ways. Withoutthe detailed, organizationally structureddata as a foundation, there is noreconcilability of data

marketing

manufacturing

sales

finance

Multidimensional Spreadsheets


155/193

158

Analysts needspreadsheets that support

pivot tables (cross-tabs)drill-down and roll-up

slice and dicesortselectionsderived attributes

Popular in retail domain

OLAP - Data Cube


156/193

159

Idea: analysts need to group data in manydifferent ways

eg. Sales(region, product, prodtype,prodstyle, date, saleamount)

saleamount is a measure attribute, rest aredimension attributesgroupby every subset of the other attributes

materialize (precompute and store)

groupbys to give online responseAlso: hierarchies on attributes: date ->weekday,date -> month -> quarter -> year

SQL Extensions


157/193

160

Front-end tools requireExtended Family of Aggregate Functionsrank, median, mode

Reporting Featuresrunning totals, cumulative totals

Results of multiple group bytotal sales by month and total sales by

productData Cube

Relational OLAP: 3 Tier DSS


158/193

161

Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomicdata in industrystandardRDBMS.

Generate SQLexecution plans inthe ROLAP engineto obtain OLAPfunctionality.

Obtain multi-dimensionalreports from theDSS Client.

MD-OLAP: 2 Tier DSS


159/193

162

MDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietarydata structure (MDDB), pre-calculateas many outcomes as possible, obtainOLAP functionality via proprietaryalgorithms running against this data.

Obtain multi-dimensionalreports from theDSS Client.

Typical OLAP ProblemsData Explosion


160/193

163

Data Explosion Syndrome

Number of Dimensions

N u m

b e r o

f A g g r e g a

t i o n s

(4 levels in each dimension)

Data Explosion

Microsoft TechEd98

Metadata Repository


161/193

164

Administrative metadatasource databases and their contentsgateway descriptionswarehouse schema, view & derived data definitions

dimensions, hierarchiespre-defined queries and reportsdata mart locations and contentsdata partitionsdata extraction, cleansing, transformation rules,defaultsdata refresh and purging rulesuser profiles, user groupssecurity: user authorization, access control

Metdata Repository .. 2


162/193

165

Business databusiness terms and definitionsownership of data

charging policiesoperational metadata

data lineage: history of migrated data andsequence of transformations appliedcurrency of data: active, archived, purgedmonitoring information: warehouse usagestatistics, error reports, audit trails.

Recipe for a SuccessfulW h


163/193

Warehouse

For a Successful Warehouse


164/193

167

From day one establish that warehousingis a joint user/builder project

Establish that maintaining data quality willbe an ONGOING joint user/builderresponsibilityTrain the users one step at a timeConsider doing a high level corporate datamodel in no more than three weeks

From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.html



165/193

168

Look closely at the data extracting,cleaning, and loading toolsImplement a user accessible automated

directory to information stored in thewarehouseDetermine a plan to test the integrity of the data in the warehouseFrom the start get warehouse users in thehabit of 'testing' complex queries



166/193

169

Coordinate system roll-out with networkadministration personnelWhen in a bind, ask others who have

done the same thing for adviceBe on the lookout for small, but strategic,projectsMarket and sell your data warehousingsystems

Data Warehouse Pitfalls


167/193

170

You are going to spend much time extracting,cleaning, and loading data

Despite best efforts at project management, datawarehousing project scope will increase

You are going to find problems with systemsfeeding the data warehouse

You will find the need to store data not beingcaptured by any existing system

You will need to validate data not being validatedby transaction processing systems

Data Warehouse Pitfalls


168/193

171

Some transaction processing systems feeding thewarehousing system will not contain detail

Many warehouse end users will be trained andnever or seldom apply their training

After end users receive query and report tools,requests for IS written reports may increase

Your warehouse users will develop conflictingbusiness rules

Large scale data warehousing can become anexercise in data homogenizing


169/193

DW and OLAP Research Issues


170/193

173

Data cleaningfocus on data inconsistencies, not schema differencesdata mining techniques

Physical Designdesign of summary tables, partitions, indexes

tradeoffs in use of different indexesQuery processing

selecting appropriate summary tablesdynamic optimization with feedbackacid test for query optimization: cost estimation, use of transformations, search strategiespartitioning query processing between OLAP server andbackend server.

DW and OLAP Research Issues .. 2


171/193

174

Warehouse Managementdetecting runaway queriesresource managementincremental refresh techniquescomputing summary tables during loadfailure recovery during load and refreshprocess management: scheduling queries,load and refreshQuery processing, cachinguse of workflow technology for processmanagement

P d t R f U f l Li k


172/193

Products, References, Useful Links

Reporting Tools


173/193

176

Andyne Computing -- GQLBrio -- BrioQueryBusiness Objects -- Business ObjectsCognos -- ImpromptuInformation Builders Inc. -- Focus for WindowsOracle -- Discoverer2000Platinum Technology -- SQL*Assist, ProReportsPowerSoft -- InfoMakerSAS Institute -- SAS/AssistSoftware AG -- EsperantSterling Software -- VISION:Data

OLAP and Executive InformationSystems


174/193

177

Andyne Computing -- PabloArbor Software -- Essbase

Cognos -- PowerPlay

Comshare -- Commander

OLAPHolistic Systems -- Holos

Information Advantage --AXSYS, WebOLAP

Informix -- MetacubeMicrostrategies --DSS/Agent

Microsoft -- PlatoOracle -- Express

Pilot -- LightShip

Planning Sciences --

GentiumPlatinum Technology --ProdeaBeacon, Forest & Trees

SAS Institute -- SAS/EIS,OLAP++

Speedware -- Media

Other Warehouse RelatedProducts


175/193

178

Data extract, clean, transform,refresh

CA-Ingres replicator

Carleton PassportPrism Warehouse ManagerSAS Access

Sybase Replication ServerPlatinum Inforefiner, Infopump

Extraction and TransformationTools


176/193

179

Carleton Corporation -- PassportEvolutionary Technologies Inc. -- Extract

Informatica -- OpenBridge

Information Builders Inc. -- EDA Copy Manager

Platinum Technology -- InfoRefiner

Prism Solutions -- Prism Warehouse Manager

Red Brick Systems -- DecisionScape Formation

Scrubbing Tools


177/193

180

Apertus -- Enterprise/IntegratorVality -- IPEPostal Soft

Warehouse Products


178/193

181

Computer Associates -- CA-IngresHewlett-Packard -- Allbase/SQLInformix -- Informix, Informix XPS

Microsoft -- SQL ServerOracle -- Oracle7, Oracle Parallel ServerRed Brick -- Red Brick WarehouseSAS Institute -- SASSoftware AG -- ADABASSybase -- SQL Server, IQ, MPP

Warehouse Server Products


179/193

182

Oracle 8InformixOnline Dynamic ServerXPS --Extended Parallel ServerUniversal Server for object relational

applicationsSybase

Adaptive Server 11.5Sybase MPPSybase IQ

Warehouse Server Products


180/193

183

Red Brick WarehouseTandem NonstopIBM

DB2 MVSUniversal ServerDB2 400

Teradata



181/193

184

Connectivity to SourcesApertusInformation Builders EDA/SQL

Platimum InfohubSAS ConnectIBM Data Joiner

Oracle Open ConnectInformix Express Gateway



182/193

185

Query/Reporting EnvironmentsBrio/QueryCognos Impromptu

Informix ViewpointCA Visual ExpressBusiness Objects

Platinum Forest and Trees

4GL's, GUI Builders, and PCDatabases


183/193

186

Information Builders -- FocusLotus -- ApproachMicrosoft -- Access, Visual BasicMITI -- SQR/WorkbenchPowerSoft -- PowerBuilder

SAS Institute -- SAS/AF

Data Mining Products


184/193

187

DataMind -- neurOagentInformation Discovery -- IDISSAS Institute -- SAS/Neuronets

Data Warehouse


185/193

188

W.H. Inmon, Building the DataWarehouse, Second Edition, John Wileyand Sons, 1996W.H. Inmon, J. D. Welch, Katherine L.Glassey, Managing the Data Warehouse,John Wiley and Sons, 1997Barry Devlin, Data Warehouse from

Architecture to Implementation, AddisonWesley Longman, Inc 1997

Data Warehouse


186/193

189

W.H. Inmon, John A. Zachman, JonathanG. Geiger, Data Stores Data Warehousingand the Zachman Framework, McGraw HillSeries on Data Warehousing and DataManagement, 1997Ralph Kimball, The Data WarehouseToolkit, John Wiley and Sons, 1996

OLAP and DSS


187/193

190

Erik Thomsen, OLAP Solutions, John Wileyand Sons 1997Microsoft TechEd Transparencies fromMicrosoft TechEd 98Essbase Product LiteratureOracle Express Product LiteratureMicrosoft Plato Web SiteMicrostrategy Web Site

Data Mining


188/193

191

Michael J.A. Berry and Gordon Linoff, DataMining Techniques, John Wiley and Sons1997Peter Adriaans and Dolf Zantinge, DataMining, Addison Wesley Longman Ltd.1996KDD Conferences

Other Tutorials


189/193

192

Donovan Schneider, Data Warehousing Tutorial,Tutorial at International Conference forManagement of Data (SIGMOD 1996) andInternational Conference on Very Large Data

Bases 97Umeshwar Dayal and Surajit Chaudhuri, DataWarehousing Tutorial at International Conferenceon Very Large Data Bases 1996

Anand Deshpande and S. Seshadri, Tutorial onDatawarehousing and Data Mining, CSI-97

Useful URLs


190/193

193

Ralph Kimballs home page http://www.rkimball.com

Larry Greenfields Data WarehouseInformation Center

http://pwp.starnetinc.com/larryg/

Data Warehousing Institutehttp://www.dw-institute.com/

OLAP Councilhttp://www.olapcouncil.com/

Data Mining Motivation
http://www.rkimball.com/http://pwp.starnetinc.com/larryg/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://pwp.starnetinc.com/larryg/http://www.rkimball.com/


191/193

194

Changes in the Business EnvironmentCustomers becoming more demandingMarkets are saturated

Databases today are huge:More than 1,000,000 entities/records/rowsFrom 10 to 10,000 fields/attributes/variablesGigabytes and terabytes

Databases a growing at an unprecedentedrateDecisions must be made rapidlyDecisions must be made with maximum

k l d

Data Mining Applications:Retail

P f i g b k t l i


192/193

195

Performing basket analysis

Which items customers tend to purchase together. Thisknowledge can improve stocking, store layout strategies, andpromotions.

Sales forecastingExamining time-based patterns helps retailers make stockingdecisions. If a customer purchases an item today, when arethey likely to purchase a complementary item?

Database marketingRetailers can develop profiles of customers with certainbehaviors, for example, those who purchase designer labelsclothing or those who attend sales. This information can beused to focus cost effective promotions.

Merchandise planning and allocation

When retailers add new stores, they can improve merchandiseplanning and allocation by examining patterns in stores withsimilar demographic characteristics. Retailers can also usedata mining to determine the ideal layout for a specific store.

Data Mining Applications:Banking


193/193

Card marketingBy identifying customer segments, card issuers andacquirers can improve profitability with more effectiveacquisition and retention programs, targeted productdevelopment, and customized pricing.

Cardholder pricing and profitabilityCard issuers can take advantage of data miningtechnology to price their products so as to maximizeprofit and minimize loss of customers. Includes risk-based pricing.

Fraud detectionFraud is enormously costly. By analyzing pasttransactions that were later determined to be