7/30/2019 Datawarehouse Intro Ch1 Ch2
1/193
3
Which are ourlowest/highest margin
customers ? Who are my customers
and what productsare they buying?
Which customersare most likely to goto the competition ?
What impact willnew products/services
have on revenueand margins?
What product prom--otions have the biggest
impact on revenue?
What is the mosteffective distribution
channel?
A producer wants to know.
7/30/2019 Datawarehouse Intro Ch1 Ch2
2/193
4
Data, Data everywhere yet ... I cant find the data I need
data is scattered over thenetworkmany versions, subtledifferences
I cant get the data I need need an expert to get the data
I cant understand the data Ifound
available data poorly documented
I cant use the data I found results are unexpected
data needs to be transformedfrom one form to other
7/30/2019 Datawarehouse Intro Ch1 Ch2
3/193
5
What is a Data Warehouse?
A single, complete andconsistent store of dataobtained from a variety
of different sourcesmade available to endusers in a what theycan understand and usein a business context.
[Barry Devlin]
7/30/2019 Datawarehouse Intro Ch1 Ch2
4/193
6
What are the users saying...
Data should be integratedacross the enterpriseSummary data has a realvalue to the organizationHistorical data holds thekey to understanding data
over timeWhat-if capabilities arerequired
7/30/2019 Datawarehouse Intro Ch1 Ch2
5/193
7
What is Data Warehousing?
A process of transforming data intoinformation andmaking it available tousers in a timelyenough manner to
make a difference
[Forrester Research, April1996]Data
Information
7/30/2019 Datawarehouse Intro Ch1 Ch2
6/193
8
Evolution
60s: Batch reports hard to find and analyze informationinflexible and expensive, reprogram every newrequest
70s: Terminal -based DSS and EIS (executiveinformation systems)
still inflexible, not integrated with desktop tools
80s: Desktop data access and analysis tools query tools, spreadsheets, GUIseasier to use, but only access operational databases
90s: Data warehousing with integrated OLAP
engines and tools
7/30/2019 Datawarehouse Intro Ch1 Ch2
7/193
9
Warehouses are Very LargeDatabases
35%
30%
25%
20%
15%
10%
5%
0%5GB
5-9GB
10-19GB 50-99GB 250-499GB
20-49GB 100-249GB 500GB-1TB
InitialProjected 2Q96
Source: META Group, Inc.
R e s p o n
d e n t s
7/30/2019 Datawarehouse Intro Ch1 Ch2
8/193
10
Very Large Data Bases
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21
bytes:
Zottabytes -- 10^24bytes:
Walmart -- 24 Terabytes
Geographic Information
SystemsNational Medical Records
Weather images
Intelligence AgencyVideos
7/30/2019 Datawarehouse Intro Ch1 Ch2
9/193
7/30/2019 Datawarehouse Intro Ch1 Ch2
10/193
12
Data Warehousing--* It is a process* It is a product* It is an
environment
7/30/2019 Datawarehouse Intro Ch1 Ch2
11/193
13
Data Warehousing --It is a process
Technique for assembling andmanaging data from varioussources for the purpose of
answering businessquestions. Thus makingdecisions that were notprevious possibleA decision support databasemaintained separately fromthe organizations operational
database
7/30/2019 Datawarehouse Intro Ch1 Ch2
12/193
14
Data Warehouse (2 nd Chapter)
A data warehouse is asubject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily inorganizational decision making.
7/30/2019 Datawarehouse Intro Ch1 Ch2
13/193
15
Data Warehouse
Subject Oriented
The data in the data warehouse is organizedso that all the data elements relating to thesame real-world event or object are linkedtogether.
7/30/2019 Datawarehouse Intro Ch1 Ch2
14/193
16
Data Warehouse
Integrated
The data warehouse contains data from mostor all of an organization's operational systemsand this data is made consistent
7/30/2019 Datawarehouse Intro Ch1 Ch2
15/193
17
Data Warehouse (2 nd Chapter)
Non volatile Data
Data in the data warehouse is never over-written or deleted - once committed, the datais static, read-only, and retained for futurereporting
7/30/2019 Datawarehouse Intro Ch1 Ch2
16/193
18
Data Warehouse
Time variant
In a data warehouse environment,
t he decision makers can view the data acrossthe field of time at whichever level of detailthey may wish
7/30/2019 Datawarehouse Intro Ch1 Ch2
17/193
Data Granuality
Granularity is the extent to which asystem is broken down into smallparts, either the system itself or itsdescription or observation. It is the"extent to which a larger entity issubdivided. For example, a yardbroken into inches has finergranularity than a yard broken intofeet."
19
http://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/System7/30/2019 Datawarehouse Intro Ch1 Ch2
18/193
Cont.
Granularity is usually mentioned in thecontext of dimensional data structures(i.e., facts and dimensions) and refers to
the level of detail in a given fact table.The more detail there is in the fact table,the higher its granularity and vice versa.Another way to look at it is that thehigher the granularity of a fact table, themore rows it will have .
20
7/30/2019 Datawarehouse Intro Ch1 Ch2
19/193
Example:
Say we have a data mart with a single fact(Sales) and three dimensions (Time,Organization and Product). The fact tablecontains three metrics (Unit Price, Units Sold andTotal Sale Amount). The Time dimension consistsof four hierarchical elements (Year, Quarter,Month and Day). The Organization dimensionconsists of three hierarchical elements (Region,
District and Store). The Product dimensionconsists of two hierarchical elements (ProductFamily and SKU).
21
7/30/2019 Datawarehouse Intro Ch1 Ch2
20/193
Cont.As always, the metrics in the Sales fact table must bestored at some intersection of the dimensions (i.e., Time,Organization and Product). Hence, in this data mart, thehighest granularity that we can store Sales metrics is byDay/Store/SKU (i.e., the lowest level in each dimensionalhierarchy). Conversely, the lowest granularity that we canaggregate Sales metrics to in this data mart is byYear/Region/Product Family (i.e., the highest level in eachdimensional hierarchy). We may also (for a variety of performance reasons) choose to store Sales metrics atsome intermediate level of granularity (e.g., byMonth/District/SKU) .
22
7/30/2019 Datawarehouse Intro Ch1 Ch2
21/193
The information flow mechanism
23Extract Transform Load Operational data store
7/30/2019 Datawarehouse Intro Ch1 Ch2
22/193
Data extraction from source
Identify the sourceFinalize the filters for each sourceProduce automatic extract file from operational
dataGenerate intermediate fileRender automated job control services forcreating extract files
Reformat and standardized inputProduce common application code for dataextractionResolve inconsistencies for common data that
will be extracted from multiple source systems 24
7/30/2019 Datawarehouse Intro Ch1 Ch2
23/193
Meta data in warehouse
Metadata is one of the importantkeys to the success of the datawarehousing and businessintelligence effort.Metadata is your control panel to thedata warehouse. It is data thatdescribes the data warehousing andbusiness intelligence system:
25
7/30/2019 Datawarehouse Intro Ch1 Ch2
24/193
What is Metadata? ReportsCubesTables (Records, Segments, Entities, etc.)Columns (Fields, Attributes, Data Elements, etc.)
KeysIndexes
Metadata is often used to control the handling of data and describes:
RulesTransformationsAggregationsMappings
26
7/30/2019 Datawarehouse Intro Ch1 Ch2
25/193
Data Warehouse Metadata
Data warehousing has specific metadatarequirements. Metadata that describes tablestypically includes:
Physical NameLogical NameType: Fact, Dimension, BridgeRole: Legacy, OLTP, Stage,
DBMS: DB2, Informix, MS SQL Server, Oracle,SybaseLocationDefinition
Notes 27
7/30/2019 Datawarehouse Intro Ch1 Ch2
26/193
29
Data Warehouse for DecisionSupport & OLAP
Putting Information technology to help theknowledge worker make faster and betterdecisions
Which of my customers are most likely to goto the competition?What product promotions have the biggestimpact on revenue?How did the share price of softwarecompanies correlate with profits over last 10years?
7/30/2019 Datawarehouse Intro Ch1 Ch2
27/193
30
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than updateUse of the system is loosely defined andcan be ad-hoc
Used by managers and end-users tounderstand the business and make
judgements
7/30/2019 Datawarehouse Intro Ch1 Ch2
28/193
31
Data Mining works with WarehouseData
Data Warehousingprovides the Enterprisewith a memory
Data Mining providesthe Enterprise withintelligence
7/30/2019 Datawarehouse Intro Ch1 Ch2
29/193
32
We want to know ...Given a database of 100,000 names, which persons are theleast likely to default on their credit cards?Which types of transactions are likely to be fraudulentgiven the demographics and transactional history of aparticular customer?
If I raise the price of my product by Rs. 2, what is theeffect on my ROI?
If I offer only 2,500 airline miles as an incentive topurchase rather than 5,000, how many lost responses willresult?
If I emphasize ease-of-use of the product as opposed to itstechnical capabilities, what will be the net effect on myrevenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
7/30/2019 Datawarehouse Intro Ch1 Ch2
30/193
33
Application Areas
Industry Application Finance Credit Card Analysis
Insurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis
7/30/2019 Datawarehouse Intro Ch1 Ch2
31/193
34
Data Mining in Use
The US Government uses Data Mining totrack fraudA Supermarket becomes an informationbrokerBasketball teams use it to track gamestrategy
Cross SellingWarranty Claims RoutingHolding on to Good Customers
Weeding out Bad Customers
7/30/2019 Datawarehouse Intro Ch1 Ch2
32/193
35
What makes data mining possible?
Advances in the following areas aremaking data mining deployable:
data warehousingbetter and more data (i.e., operational,
behavioral, and demographic)the emergence of easily deployed data
mining tools andthe advent of new data mining
techniques.
7/30/2019 Datawarehouse Intro Ch1 Ch2
33/193
36
Why Separate Data Warehouse?
PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementationmethods needed for multidimensional views & queries.
FunctionMissing data: Decision support requires historical data, which
op dbs do not typically maintain.Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to bereconciled.
7/30/2019 Datawarehouse Intro Ch1 Ch2
34/193
37
What are Operational Systems?
They are OLTP systemsRun mission criticalapplicationsNeed to work withstringent performancerequirements forroutine tasksUsed to run abusiness!
7/30/2019 Datawarehouse Intro Ch1 Ch2
35/193
38
RDBMS used for OLTP
Database Systems have been usedtraditionally for OLTP
clerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are
critical
7/30/2019 Datawarehouse Intro Ch1 Ch2
36/193
39
Operational Systems
Run the business in real timeBased on up-to-the-second dataOptimized to handle largenumbers of simple read/writetransactionsOptimized for fast response topredefined transactionsUsed by people who deal with
customers, products -- clerks,salespeople etc.They are increasingly used bycustomers
7/30/2019 Datawarehouse Intro Ch1 Ch2
37/193
40
Examples of Operational DataData Industry Usage Technology Volumes
Customer File
All Track Customer Details
Legacy application, flat files, main frames
Small-medium
Account Balance Finance Control account activities
Legacy applications, hierarchical databases, mainframe
Large
Point-of- Sale data
Retail Generate bills, manage stock
ERP, Client/Server, relational databases
Very Large
Call Record Telecomm- unications Billing Legacy application, hierarchical database, mainframe
Very Large
Production Record
Manufact- uring
Control Production
ERP, relational databases,
AS/400
Medium
7/30/2019 Datawarehouse Intro Ch1 Ch2
38/193
So, whats different?
7/30/2019 Datawarehouse Intro Ch1 Ch2
39/193
42
Application-Orientation vs.Subject-Orientation
Application-Orientation
Operational
Database
LoansCreditCard
Trust
Savings
Subject-Orientation
Data
Warehouse
Customer
VendorProduct
Activity
7/30/2019 Datawarehouse Intro Ch1 Ch2
40/193
43
OLTP vs. Data Warehouse
OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data
warehouseSpecial data organization, access methodsand implementation methods are neededto support data warehouse queries(typically multidimensional queries)
e.g ., average amount spent on phone callsbetween 9AM-5PM in Pune during the monthof December
7/30/2019 Datawarehouse Intro Ch1 Ch2
41/193
44
OLTP vs Data Warehouse
OLTPApplicationOriented
Used to runbusinessDetailed dataCurrent up to date
Isolated DataRepetitive accessClerical User
Warehouse (DSS)Subject OrientedUsed to analyze
businessSummarized andrefinedSnapshot data
Integrated DataAd-hoc accessKnowledge User(Manager)
7/30/2019 Datawarehouse Intro Ch1 Ch2
42/193
45
OLTP vs Data Warehouse
OLTPPerformance SensitiveFew Records accessed at
a time (tens)
Read/Update Access
No data redundancy
Database Size 100MB-100 GB
Data WarehousePerformance relaxedLarge volumes accessed
at a time(millions)Mostly Read (BatchUpdate)Redundancy presentDatabase Size
100 GB - few terabytes
7/30/2019 Datawarehouse Intro Ch1 Ch2
43/193
46
OLTP vs Data Warehouse
OLTPTransactionthroughput is the
performance metricThousands of usersManaged inentirety
Data WarehouseQuery throughputis the performance
metricHundreds of usersManaged bysubsets
7/30/2019 Datawarehouse Intro Ch1 Ch2
44/193
47
To summarize ...
OLTP Systems areused to run abusiness
The DataWarehouse helpsto optimize thebusiness
7/30/2019 Datawarehouse Intro Ch1 Ch2
45/193
48
Why Now?
Data is being producedERP provides clean data
The computing power is availableThe computing power is affordableThe competitive pressures are
strongCommercial products are available
M th di OLAP S
7/30/2019 Datawarehouse Intro Ch1 Ch2
46/193
49
Myths surrounding OLAP Serversand Data Marts
Data marts and OLAP servers are departmentalsolutions supporting a handful of usersMillion dollar massively parallel hardware is
needed to deliver fast time for complex queriesOLAP servers require massive and unwieldyindicesComplex OLAP queries clog the network with
dataData warehouses must be at least 100 GB to beeffective
7/30/2019 Datawarehouse Intro Ch1 Ch2
47/193
50
Wal*Mart Case Study
Founded by Sam WaltonOne of the largest Super MarketChains in the US
Wal*Mart: 2000+ Retail Stores
SAM's Clubs 100+WholesalersStores
This case study is from Felipe Carinos (NCR
Teradata) presentation made at Stanford DatabaseSeminar
7/30/2019 Datawarehouse Intro Ch1 Ch2
48/193
51
Old Retail Paradigm
Wal*MartInventoryManagement
Merchandise AccountsPayablePurchasingSupplier Promotions:
National, Region,Store Level
SuppliersAccept OrdersPromote Products
Provide specialIncentivesMonitor and TrackThe Incentives
Bill and CollectReceivablesEstimate RetailerDemands
Ne (J st In Time) Ret il
7/30/2019 Datawarehouse Intro Ch1 Ch2
49/193
52
New (Just-In-Time) RetailParadigm
No more dealsShelf-Pass Through (POS Application)
One Unit PriceSuppliers paid once a week on ACTUAL items sold
Wal*Mart ManagerDaily Inventory RestockSuppliers (sometimes SameDay) ship to Wal*Mart
Warehouse-Pass ThroughStock some Large Items
Delivery may come from supplierDistribution Center
Suppliers merchandise unloaded directly onto Wal*MartTrucks
7/30/2019 Datawarehouse Intro Ch1 Ch2
50/193
53
Wal*Mart System
NCR 5100M 96Nodes;Number of Rows:Historical Data:New Daily Volume:
Number of Users:Number of Queries:
24 TB Raw Disk; 700 -1000 Pentium CPUs
> 5 Billions65 weeks (5 Quarters)Current Apps: 75 MillionNew Apps: 100 Million +
Thousands60,000 per week
7/30/2019 Datawarehouse Intro Ch1 Ch2
51/193
54
Course Overview
0. IntroductionI. Data Warehousing
II. Decision Supportand OLAPIII. Data MiningIV. Looking Ahead
Demos and Labs
I Data Warehouses:
7/30/2019 Datawarehouse Intro Ch1 Ch2
52/193
55
I. Data Warehouses:Architecture, Design & Construction
DW ArchitectureLoading, refreshingStructuring/ModelingDWs and Data MartsQuery Processing
7/30/2019 Datawarehouse Intro Ch1 Ch2
53/193
56
Data Warehouse Architecture
Data WarehouseEngine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased
Data
ERPSystems
Characteristics of data warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
54/193
Characteristics of data warehousearchitecture
Different objectives and scope(analytical)
Data content (read only)Complex analysis and quickresponseFlexible and dynamicMeta data driven
57
7/30/2019 Datawarehouse Intro Ch1 Ch2
55/193
Goal
Architecture of data warehousebecomes the framework for productselectionIt is collection of documents, plans,models, drawing, and specificationsArchitecture has to be driven by thebusiness
58
7/30/2019 Datawarehouse Intro Ch1 Ch2
56/193
DW arctitecture
It is a way of representing overallstructure of the data, processing andpresentation that exists for end-usercomputing within the organization
It has number of interconnectedcomponents
59
7/30/2019 Datawarehouse Intro Ch1 Ch2
57/193
Components
Operational database layerInformation access layerData access layerData directory layerProcess management layerApplication messaging layer
Data warehouse (physical) layerData staging layer
60
7/30/2019 Datawarehouse Intro Ch1 Ch2
58/193
61
Components of the Warehouse
Data Extraction and Loading(The Warehouse
Analyze and Query -- OLAP ToolsMetadata
Data Mining tools ETL(extract, transfer, load)
7/30/2019 Datawarehouse Intro Ch1 Ch2
59/193
Loading the Warehouse
Cleaning the databefore it is loaded
7/30/2019 Datawarehouse Intro Ch1 Ch2
60/193
63
Source Data
Typically host based, legacy applicationsCustomized applications, COBOL, 3GL,4GL
Point of Contact DevicesPOS(point of sale), ATM, Callswitches( Call Switch makes managinginbound telephone calls )
Sequential Legacy Relational ExternalOperational/ Source Data
7/30/2019 Datawarehouse Intro Ch1 Ch2
61/193
External SourcesNielsens( Nielsen monitors and measures morethan 90% of global Internet activity andprovides insights about the online universe -
including audiences, advertising),Acxiom(Provides range of information servicesand products geared towards enterprise datamanagement and retrieval),CMIE( Centre for Monitoring Indian Economy
), Vendors, Partners
64
7/30/2019 Datawarehouse Intro Ch1 Ch2
62/193
65
Data Quality - The Reality
Tempting to think creating a datawarehouse is simply extractingoperational data and entering into adata warehouse
Nothing could be farther from thetruthWarehouse data comes fromdisparate questionable sources
7/30/2019 Datawarehouse Intro Ch1 Ch2
63/193
66
Data Quality - The Reality
Legacy systems no longer documented
Outside sources with questionable qualityproceduresProduction systems with no built inintegrity checks and no integration
Operational systems are usually designed to
solve a specific business problem and arerarely developed to a a corporate plan
And get it done quickly, we do not have time toworry about corporate standards...
7/30/2019 Datawarehouse Intro Ch1 Ch2
64/193
67
7/30/2019 Datawarehouse Intro Ch1 Ch2
65/193
68
7/30/2019 Datawarehouse Intro Ch1 Ch2
66/193
69
Data Integration Across Sources
Trust Credit cardSavings Loans
Same datadifferent name
Different dataSame name
Data found herenowhere else
Different keyssame data
7/30/2019 Datawarehouse Intro Ch1 Ch2
67/193
70
Data Transformation Example
appl A - balanceappl B - balappl C - currbalappl D - balcurr
appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds
appl A - m,f appl B - 1,0appl C - x,yappl D - male, female
Data Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
68/193
71
Data Integrity Problems
Same person, different spellingsAgarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt.LTD.
Use of different namesmumbai, bombay
Different account numbers generated bydifferent applications for the same customerRequired fields left blankInvalid product codes collected at point of sale
manual entry leads to mistakes
in case of a problem use 9999999
7/30/2019 Datawarehouse Intro Ch1 Ch2
69/193
72
Data Transformation Terms
ExtractingConditioning
ScrubbingMergingHouseholding
EnrichmentScoring
LoadingValidatingDelta Updating
7/30/2019 Datawarehouse Intro Ch1 Ch2
70/193
73
Data Transformation Terms
ExtractingCapture of data from operational source in
as is status
Sources for data generally in legacymainframes in VSAM (virtual storage access method) ,IMS (information management system) , IDMS (integrated dbms) , DB2;more data today in relational databases on
UnixConditioning
The conversion of data types from the sourceto the target data store (warehouse) --
7/30/2019 Datawarehouse Intro Ch1 Ch2
71/193
74
Data Transformation Terms
HouseholdingIdentifying all members of a household
(living at the same address)Ensures only one mail is sent to a
householdCan result in substantial savings: 1
lakh catalogues at Rs. 50 each costs Rs.50 lakhs. A 2% savings would save Rs.1 lakh.
7/30/2019 Datawarehouse Intro Ch1 Ch2
72/193
75
Data Transformation Terms
EnrichmentBring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A.C. Nielsen, CMIE, IMRA (provides an extensive digest of media,polls, and significant interviews and events.
)etc...Scoring
computation of a probability of anevent. e.g..., chance that a customerwill defect to AT&T from MCI (American telecomcompany) , chance that a customer is likely tobuy a new product
7/30/2019 Datawarehouse Intro Ch1 Ch2
73/193
76
Loads
After extracting, scrubbing, cleaning,validating etc. need to load the datainto the warehouse
Issueshuge volumes of data to be loadedsmall time window available when warehouse can betaken off line (usually nights)when to build index and summary tablesallow system administrators to monitor, cancel, resume,change load ratesRecover gracefully -- restart after failure from whereyou were and without loss of data integrity
7/30/2019 Datawarehouse Intro Ch1 Ch2
74/193
77
Load Techniques
Use SQL to append or insert newdata
record at a time interfacewill lead to random disk I/Os
Use batch load utility
d
7/30/2019 Datawarehouse Intro Ch1 Ch2
75/193
78
Load Taxonomy
Incremental versus Full loadsOnline versus Offline loads
f h
7/30/2019 Datawarehouse Intro Ch1 Ch2
76/193
79
Refresh
Propagate updates on source data tothe warehouseIssues:
when to refreshhow to refresh -- refresh techniques
7/30/2019 Datawarehouse Intro Ch1 Ch2
77/193
80
When to Refresh?
periodically (e.g., every night, everyweek) or after significant eventson every update: not warranted unlesswarehouse data require current data (upto the minute stock quotes)refresh policy set by administrator based
on user needs and trafficpossibly different policies for differentsources
R f h T h i
7/30/2019 Datawarehouse Intro Ch1 Ch2
78/193
81
Refresh Techniques
Full Extract from base tablesread entire source table: too expensivemaybe the only choice for legacy
systems
H T D Ch
7/30/2019 Datawarehouse Intro Ch1 Ch2
79/193
82
How To Detect Changes
Create a snapshot log table to recordids of updated rows of source dataand timestampDetect changes by:
Defining after row triggers to updatesnapshot log when source table
changesUsing regular transaction log to detect
changes to source data
7/30/2019 Datawarehouse Intro Ch1 Ch2
80/193
83
Data Extraction and Cleansing
Extract data from existingoperational and legacy dataIssues:
Sources of data for the warehouseData quality at the sourcesMerging different data sourcesData Transformation
How to propagate updates (on the sources) tothe warehouseTerabytes of data to be loaded
7/30/2019 Datawarehouse Intro Ch1 Ch2
81/193
84
Scrubbing Data
Sophisticatedtransformation tools.Used for cleaning thequality of dataClean data is vital for thesuccess of thewarehouse
ExampleSeshadri, Sheshadri,Sesadri, Seshadri S.,Srinivasan Seshadri, etc.are the same person
7/30/2019 Datawarehouse Intro Ch1 Ch2
82/193
85
Scrubbing Tools
Apertus -- Enterprise/IntegratorVality -- IPE
Postal Soft
7/30/2019 Datawarehouse Intro Ch1 Ch2
83/193
Structuring/Modeling Issues
Data -- Heart of the Data
7/30/2019 Datawarehouse Intro Ch1 Ch2
84/193
87
Warehouse
Heart of the data warehouse is thedata itself!Single version of the truthCorporate memoryData is organized in a way thatrepresents business -- subjectorientation
D t W h St t
7/30/2019 Datawarehouse Intro Ch1 Ch2
85/193
88
Data Warehouse Structure
Subject Orientation -- customer,product, policy, account etc... Asubject may be implemented as a
set of related tables. E.g.,customer may be five tables
7/30/2019 Datawarehouse Intro Ch1 Ch2
86/193
89
Data Warehouse Structure
base customer (1985-87)custid, from date, to date, name, phone, dob
base customer (1988-90)custid, from date, to date, name, credit rating,employer
customer activity (1986-89) -- monthlysummarycustomer activity detail (1987-89)
custid, activity date, amount, clerk id, order nocustomer activity detail (1990-91)
custid, activity date, amount, line item no, order no
Time is part of
key of each table
D t G l it i W h
7/30/2019 Datawarehouse Intro Ch1 Ch2
87/193
90
Data Granularity in Warehouse
Summarized data storedreduce storage costsreduce cpu usageincreases performance since smaller
number of records to be processeddesign around traditional high level
reporting needstradeoff with volume of data to be
stored and detailed usage of data
Gran larit in Wareho se
7/30/2019 Datawarehouse Intro Ch1 Ch2
88/193
91
Granularity in Warehouse
Can not answer some questions withsummarized data
Did Anand call Seshadri last month?Not possible to answer if total durationof calls by Anand over a month is onlymaintained and individual call detailsare not.
Detailed data too voluminous
7/30/2019 Datawarehouse Intro Ch1 Ch2
89/193
92
Granularity in Warehouse
Tradeoff is to have dual level of granularity
Store summary data on disks95% of DSS processing done against this
data
Store detail on tapes5% of DSS processing against this data
7/30/2019 Datawarehouse Intro Ch1 Ch2
90/193
93
Vertical Partitioning
Frequentlyaccessed Rarelyaccessed
Smaller tableand so less I/O
Acct.No Name Balance Date Opened
InterestRate Address
Acct.No Balance
Acct.No Name Date Opened
InterestRate Address
7/30/2019 Datawarehouse Intro Ch1 Ch2
91/193
94
Derived Data
Introduction of derived (calculateddata) may often helpHave seen this in the context of duallevels of granularityCan keep auxiliary views andindexes to speed up queryprocessing
Schema Design
7/30/2019 Datawarehouse Intro Ch1 Ch2
92/193
95
Schema Design
Database organizationmust look like businessmust be recognizable by business user
approachable by business userMust be simple
Schema Types
Star SchemaFact Constellation SchemaSnowflake schema
Dimension Tables
7/30/2019 Datawarehouse Intro Ch1 Ch2
93/193
96
Dimension Tables
Dimension tablesDefine business in terms already
familiar to users
Wide rows with lots of descriptive textSmall tables (about a million rows)Joined to fact table by a foreign keyheavily indexedtypical dimensions
time periods, geographic region (markets,cities), products, customers, salesperson,etc.
In data warehousing, a dimension
7/30/2019 Datawarehouse Intro Ch1 Ch2
94/193
table is one of the set of companiontables to a fact table.The fact table contains businessfacts or measures and foreign keyswhich refer to candidate keys(normally primary keys) in thedimension tables.The dimension tables containattributes (or fields) used toconstrain and group data whenperforming data warehousing
queries 97
Fact Table
7/30/2019 Datawarehouse Intro Ch1 Ch2
95/193
98
Fact Table
Central tablemostly raw numeric itemsnarrow rows, a few columns at mostlarge number of rows (millions to a
billion)Access via dimensions
In data warehousing, a fact table
7/30/2019 Datawarehouse Intro Ch1 Ch2
96/193
g,consists of the measurements,
metrics or facts of a businessprocess.Fact tables provide the (usually)additive values that act asindependent variables by whichdimensional attributes are analyzed.
99
Star Schema
7/30/2019 Datawarehouse Intro Ch1 Ch2
97/193
100
Star Schema
A single fact table and for eachdimension one dimension tableDoes not capture hierarchies directly
T i
m e
p r o d
c u s t
c i t y
f a c t
date, custno, prodno, cityname, ...
7/30/2019 Datawarehouse Intro Ch1 Ch2
98/193
101
Snowflake schema
7/30/2019 Datawarehouse Intro Ch1 Ch2
99/193
102
Snowflake schema
Represent dimensional hierarchy directlyby normalizing tables.Easy to maintain and saves storage
T i
m e
p r o d
c u s t
c i t y
f a c t
date, custno, prodno, cityname, ...
r e g i o n
A is a logical arrangement of tables
http://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Logical_schema7/30/2019 Datawarehouse Intro Ch1 Ch2
100/193
A is a logical arrangement of tablesin a multidimensional database such that the entityrelationship diagram resembles a snowflake in shape.
Closely related to the star schema ,The snowflake schema is represented by centralizedfact tables which are connected to multipledimensions . In the snowflake schema, however,dimensions are normalized into multiple related tables
whereas the star schema's dimensions aredenormalized with each dimension being representedby a single table.When the dimensions of a snowflake schema areelaborate, having multiple levels of relationships, and
where child tables have multiple parent tables ("forksin the road"), a complex snowflake shape starts toemerge. The "snowflaking" effect only affects thedimension tables and not the fact tables.
103
Fact Constellation
http://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Multidimensional_databasehttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Fact_tablehttp://en.wikipedia.org/wiki/Dimension_(data_warehouse)http://en.wikipedia.org/wiki/Dimension_(data_warehouse)http://en.wikipedia.org/wiki/Fact_tablehttp://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Multidimensional_databasehttp://en.wikipedia.org/wiki/Logical_schema7/30/2019 Datawarehouse Intro Ch1 Ch2
101/193
104
Fact Constellation
Fact ConstellationMultiple fact tables that share many
dimension tables
Booking and Checkout may share manydimension tables in the hotel industry
Hotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
D li i
7/30/2019 Datawarehouse Intro Ch1 Ch2
102/193
105
De-normalization
Normalization in a data warehousemay lead to lots of small tablesCan lead to excessive I/Os sincemany tables have to be accessedDe-normalization is the answerespecially since updates are rare
C i A
7/30/2019 Datawarehouse Intro Ch1 Ch2
103/193
106
Creating Arrays
Many times each occurrence of a sequence of data is in a different physical locationBeneficial to collect all occurrences together
and store as an array in a single rowMakes sense only if there are a stablenumber of occurrences which are accessedtogetherIn a data warehouse, such situations arisenaturally due to time based orientation
can create an array by month
S l i R d d
7/30/2019 Datawarehouse Intro Ch1 Ch2
104/193
107
Selective Redundancy
Description of an item can be storedredundantly with order table --most often item description is alsoaccessed with order tableUpdates have to be careful
P i i i
7/30/2019 Datawarehouse Intro Ch1 Ch2
105/193
108
Partitioning
Breaking data into severalphysical units that can behandled separatelyNot a question of whether to do it in datawarehouses but how to doitGranularity andpartitioning are key toeffective implementationof a warehouse
Wh P i i ?
7/30/2019 Datawarehouse Intro Ch1 Ch2
106/193
109
Why Partition?
Flexibility in managing dataSmaller physical units allow
easy restructuringfree indexingsequential scans if neededeasy reorganizationeasy recoveryeasy monitoring
C it i f P titi i
7/30/2019 Datawarehouse Intro Ch1 Ch2
107/193
110
Criterion for Partitioning
Typically partitioned bydateline of businessgeographyorganizational unitany combination of above
Wh t P titi ?
7/30/2019 Datawarehouse Intro Ch1 Ch2
108/193
111
Where to Partition?
Application level or DBMS levelMakes sense to partition atapplication level
Allows different definition for each yearImportant since warehouse spans many
years and as business evolves definitionchanges
Allows data to be moved betweenprocessing complexes easily
7/30/2019 Datawarehouse Intro Ch1 Ch2
109/193
Data Warehouse vs. Data Marts
What comes first
From the Data Warehouse to DataM t
7/30/2019 Datawarehouse Intro Ch1 Ch2
110/193
113
Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
D t W h d D t M t
7/30/2019 Datawarehouse Intro Ch1 Ch2
111/193
114
Data Warehouse and Data Marts
OLAPData MartLightly summarizedDepartmentally structured
Organizationally structured AtomicDetailed Data Warehouse Data
Characteristics of theD t t l D t M t
7/30/2019 Datawarehouse Intro Ch1 Ch2
112/193
115
Departmental Data Mart
OLAPSmallFlexible
Customized byDepartmentSource is
departmentallystructured datawarehouse
Techniques for CreatingDepartmental Data Mart
7/30/2019 Datawarehouse Intro Ch1 Ch2
113/193
116
Departmental Data Mart
OLAP
Subset
SummarizedSuperset
Indexed
Arrayed
Sales Mktg.Finance
Data Mart Centric
7/30/2019 Datawarehouse Intro Ch1 Ch2
114/193
117
Data Mart Centric
Data Marts
Data Sources
Data Warehouse
Problems with Data Mart CentricSolution
7/30/2019 Datawarehouse Intro Ch1 Ch2
115/193
118
Solution
If you end up creating multiple warehouses,integrating them is a problem
True Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
116/193
119
True Warehouse
Data Marts
Data Sources
Data Warehouse
Query Processing (end)
7/30/2019 Datawarehouse Intro Ch1 Ch2
117/193
120
Query Processing (end)
Indexing
Pre computedviews/aggregatesSQL extensions
Indexing Techniques
7/30/2019 Datawarehouse Intro Ch1 Ch2
118/193
121
Indexing Techniques
Exploiting indexes to reducescanning of data is of crucialimportance
Bitmap IndexesJoin IndexesOther Issues
Text indexingParallelizing and sequencing of index
builds and incremental updates
Indexing Techniques
7/30/2019 Datawarehouse Intro Ch1 Ch2
119/193
122
g q
Bitmap index:A collection of bitmaps -- one for each
distinct value of the column
Each bitmap has N bits where N is thenumber of rows in the tableA bit corresponding to a value v for a
row r is set if and only if r has the valuefor the indexed attribute
BitMap Indexes
7/30/2019 Datawarehouse Intro Ch1 Ch2
120/193
123
BitMap Indexes
An alternative representation of RID-listSpecially advantageous for low-cardinalitydomains
Represent each row of a table by a bitand the table as a bit vectorThere is a distinct bit vector Bv for eachvalue v for the domainExample: the attribute sex has values Mand F. A table of 100 million peopleneeds 2 lists of 100 million bits
Bitmap Index
7/30/2019 Datawarehouse Intro Ch1 Ch2
121/193
124Customer Query : select * from customer where
gender = F and vote = Y
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
Bitmap Index
M
F
F
F
F
M
Y
Y
Y
N
N
N
Bit Map Index
7/30/2019 Datawarehouse Intro Ch1 Ch2
122/193
125
Bit Map Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W L
C7 N H
Base Table
Row ID N S E W
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
5 0 1 0 0
6 0 0 0 1
7 1 0 0 0
Row ID H M L
1 1 0 0
2 0 1 0
3 0 0 0
4 0 0 0
5 0 1 0
6 0 0 0
7 1 0 0
Rating Index Region Index
Customers where Region = W Rating = M And
BitMap Indexes
7/30/2019 Datawarehouse Intro Ch1 Ch2
123/193
126
BitMap Indexes
Comparison, join and aggregation operationsare reduced to bit arithmetic with dramaticimprovement in processing time
Significant reduction in space and I/O (30:1)Adapted for higher cardinality domains as well.Compression (e.g., run-length encoding)exploitedProducts that support bitmaps: Model 204,TargetIndex (Redbrick), IQ (Sybase), Oracle7.3
Join Indexes
7/30/2019 Datawarehouse Intro Ch1 Ch2
124/193
127
Pre-computed joinsA join index between a fact table and adimension table correlates a dimension
tuple with the fact tuples that have thesame value on the common dimensionalattribute
e.g., a join index on city dimension of calls
fact tablecorrelates for each city the calls (in the calls table) from that city
Join Indexes
7/30/2019 Datawarehouse Intro Ch1 Ch2
125/193
128
Join Indexes
Join indexes can also span multipledimension tables
e.g., a join index on city and time
dimension of calls fact table
Star Join Processing
7/30/2019 Datawarehouse Intro Ch1 Ch2
126/193
129
g
Use join indexes to join dimensionand fact table
Calls C+T
C+T+L
C+T+L +P
Time
Loca- tion
Plan
Optimized Star Join Processing
7/30/2019 Datawarehouse Intro Ch1 Ch2
127/193
130
p g
Time
Loca- tion
Plan
Calls
Virtual Cross Product of T, L and P
Apply Selections
Bitmapped Join Processing
7/30/2019 Datawarehouse Intro Ch1 Ch2
128/193
131
AND
Time
Loca- tion
Plan
Calls
Calls
Calls
Bitmaps 1 0
1
0 0 1
1 1 0
Intelligent Scan
7/30/2019 Datawarehouse Intro Ch1 Ch2
129/193
132
Piggyback multiple scans of arelation (Redbrick)
piggybacking also done if second scan
starts a little while after the first scan
Parallel Query Processing
7/30/2019 Datawarehouse Intro Ch1 Ch2
130/193
133
Three forms of parallelismIndependentPipelined
Partitioned and partition and replicate Deterrents to parallelism
startup
communication
Parallel Query Processing
7/30/2019 Datawarehouse Intro Ch1 Ch2
131/193
134
Partitioned DataParallel scansYields I/O parallelism
Parallel algorithms for relational operatorsJoins, Aggregates, Sort
Parallel UtilitiesLoad, Archive, Update, Parse, Checkpoint,Recovery
Parallel Query Optimization
Pre-computed Aggregates
7/30/2019 Datawarehouse Intro Ch1 Ch2
132/193
135
Keep aggregated data forefficiency (pre-computed queries)
QuestionsWhich aggregates to compute?How to update aggregates?How to use pre-computed
aggregates in queries?
Pre-computed Aggregates
7/30/2019 Datawarehouse Intro Ch1 Ch2
133/193
136
Pre computed Aggregates
Aggregated table can be maintainedby the
warehouse server
middle tierclient applications
Pre-computed aggregates -- special
case of materialized views -- samequestions and issues remain
SQL Extensions
7/30/2019 Datawarehouse Intro Ch1 Ch2
134/193
137
Extended family of aggregatefunctions
rank (top 10 customers)percentile (top 30% of customers)median, modeObject Relational Systems allow
addition of new aggregate functions
SQL Extensions
7/30/2019 Datawarehouse Intro Ch1 Ch2
135/193
138
SQL Extensions
Reporting featuresrunning total, cumulative totals
Cube operatorgroup by on all subsets of a set of
attributes (month,city)redundant scan and sorting of data can
be avoided
Red Brick has Extended set ofAggregates
7/30/2019 Datawarehouse Intro Ch1 Ch2
136/193
139
Aggregates
Select month, dollars, cume(dollars) asrun_dollars, weight, cume(weight) asrun_weightsfrom sales, market, product, period t
where year = 1993and product like Columbian% and city like San Fr% order by t.perkey
RISQL (Red Brick Systems)Extensions
7/30/2019 Datawarehouse Intro Ch1 Ch2
137/193
140
Extensions
AggregatesCUMEMOVINGAVGMOVINGSUMRANKTERTILERATIOTOREPORT
Calculating RowSubtotals
BREAK BY
Sophisticated DateTime SupportDATEDIFF
Using SubQueriesin calculations
Using SubQueries in Calculations
7/30/2019 Datawarehouse Intro Ch1 Ch2
138/193
141
Using SubQueries in Calculations
select product, dollars as jun97_sales,(select sum(s1.dollars)from market mi, product pi, period, ti, sales si
where pi.product = product.productand ti.year = period.yearand mi.city = market.city) as total97_sales,100 * dollars/
(select sum(s1.dollars)from market mi, product pi, period, ti, sales si where pi.product = product.product
and ti.year = period.yearand mi.city = market.city) as percent_of_yr
from market, product, period, sales where year = 1997
and month = June and city like Ahmed% order by product;
Course Overview
7/30/2019 Datawarehouse Intro Ch1 Ch2
139/193
142
Course Overview
The course:what and how
0. IntroductionI. Data WarehousingII. Decision Supportand OLAP
III. Data MiningIV. Looking Ahead
Demos and Labs
7/30/2019 Datawarehouse Intro Ch1 Ch2
140/193
II. On-Line Analytical Processing (OLAP)
Making DecisionSupport Possible
Limitations of SQL
7/30/2019 Datawarehouse Intro Ch1 Ch2
141/193
144
Q
A Freshman inBusiness needs
a Ph.D. in SQL
-- Ralph Kimball
Typical OLAP Queries
7/30/2019 Datawarehouse Intro Ch1 Ch2
142/193
145
yp Q
Write a multi-table join to compare sales for eachproduct line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.Repeat the above process to find the sales of aproduct line to new vs. existing customers.
Repeat the above process to find the customersthat have had negative sales growth.
What Is OLAP?
7/30/2019 Datawarehouse Intro Ch1 Ch2
143/193
146
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software * Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)
The OLAP Market
7/30/2019 Datawarehouse Intro Ch1 Ch2
144/193
147
Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion
Significant consolidation activity amongmajor DBMS vendors
10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama
Result: OLAP shifted from small verticalniche to mainstream DBMS category
Strengths of OLAP
7/30/2019 Datawarehouse Intro Ch1 Ch2
145/193
148
g
It is a powerful visualization paradigm
It provides fast, interactive responsetimes
It is good for analyzing time series
It can be useful to find some clusters and
outliersMany vendors offer OLAP tools
OLAP Is FASMI
7/30/2019 Datawarehouse Intro Ch1 Ch2
146/193
149
Nigel Pendse, Richard Creath - The OLAP Report
FastAnalysisSharedMultidimensionalInformation
Multi-dimensional Data
7/30/2019 Datawarehouse Intro Ch1 Ch2
147/193
150Month
1 2 3 4 765
P r o
d u c
t
Toothpaste
JuiceColaMilk
Cream
Soap
WSN
Dimensions: Product, Region, TimeHierarchical summarization paths
Product Region Time Industry Country Year
Category Region Quarter
Product City Month Week
Office Day
Multi-dimensional Data
HeyI sold $100M worth of goods
Data Cube Lattice
7/30/2019 Datawarehouse Intro Ch1 Ch2
148/193
151
Cube latticeABC
AB AC BCA B C
noneCan materialize some groupbys, compute otherson demandQuestion: which groupbys to materialze?
Question: what indices to createQuestion: how to organize data (chunks, etc)
Visualizing Neighbors is simpler
7/30/2019 Datawarehouse Intro Ch1 Ch2
149/193
152
g g p
1 2 3 4 5 6 7 8 AprMayJunJul AugSepOctNovDecJanFebMar
Month Store Sales Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2
A Visual Operation: Pivot (Rotate)
7/30/2019 Datawarehouse Intro Ch1 Ch2
150/193
153
p ( )
10
47
30
12
JuiceCola
Milk
Cream
3/1 3/2 3/3 3/4
Date
Product
Slicing and Dicing
7/30/2019 Datawarehouse Intro Ch1 Ch2
151/193
154
g g
Product
Sales Channel Retail Direct Special
Household
Telecomm
Video
Audio IndiaFar East
Europe
The Telecomm Slice
Roll-up and Drill Down
7/30/2019 Datawarehouse Intro Ch1 Ch2
152/193
155
Sales ChannelRegionCountryStateLocation Address
SalesRepresentative
Higher Level of Aggregation
Low-levelDetails
Nature of OLAP Analysis
7/30/2019 Datawarehouse Intro Ch1 Ch2
153/193
156
Aggregation -- (total sales,percent-to-total)Comparison -- Budget vs.Expenses
Ranking -- Top 10, quartileanalysisAccess to detailed and
aggregate dataComplex criteriaspecificationVisualization
Organizationally Structured Data
7/30/2019 Datawarehouse Intro Ch1 Ch2
154/193
157
Different Departments look at the samedetailed data in different ways. Withoutthe detailed, organizationally structureddata as a foundation, there is noreconcilability of data
marketing
manufacturing
sales
finance
Multidimensional Spreadsheets
7/30/2019 Datawarehouse Intro Ch1 Ch2
155/193
158
Analysts needspreadsheets that support
pivot tables (cross-tabs)drill-down and roll-up
slice and dicesortselectionsderived attributes
Popular in retail domain
OLAP - Data Cube
7/30/2019 Datawarehouse Intro Ch1 Ch2
156/193
159
Idea: analysts need to group data in manydifferent ways
eg. Sales(region, product, prodtype,prodstyle, date, saleamount)
saleamount is a measure attribute, rest aredimension attributesgroupby every subset of the other attributes
materialize (precompute and store)
groupbys to give online responseAlso: hierarchies on attributes: date ->weekday,date -> month -> quarter -> year
SQL Extensions
7/30/2019 Datawarehouse Intro Ch1 Ch2
157/193
160
Front-end tools requireExtended Family of Aggregate Functionsrank, median, mode
Reporting Featuresrunning totals, cumulative totals
Results of multiple group bytotal sales by month and total sales by
productData Cube
Relational OLAP: 3 Tier DSS
7/30/2019 Datawarehouse Intro Ch1 Ch2
158/193
161
Data Warehouse ROLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomicdata in industrystandardRDBMS.
Generate SQLexecution plans inthe ROLAP engineto obtain OLAPfunctionality.
Obtain multi-dimensionalreports from theDSS Client.
MD-OLAP: 2 Tier DSS
7/30/2019 Datawarehouse Intro Ch1 Ch2
159/193
162
MDDB Engine MDDB Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in a proprietarydata structure (MDDB), pre-calculateas many outcomes as possible, obtainOLAP functionality via proprietaryalgorithms running against this data.
Obtain multi-dimensionalreports from theDSS Client.
Typical OLAP ProblemsData Explosion
7/30/2019 Datawarehouse Intro Ch1 Ch2
160/193
163
Data Explosion Syndrome
Number of Dimensions
N u m
b e r o
f A g g r e g a
t i o n s
(4 levels in each dimension)
Data Explosion
Microsoft TechEd98
Metadata Repository
7/30/2019 Datawarehouse Intro Ch1 Ch2
161/193
164
Administrative metadatasource databases and their contentsgateway descriptionswarehouse schema, view & derived data definitions
dimensions, hierarchiespre-defined queries and reportsdata mart locations and contentsdata partitionsdata extraction, cleansing, transformation rules,defaultsdata refresh and purging rulesuser profiles, user groupssecurity: user authorization, access control
Metdata Repository .. 2
7/30/2019 Datawarehouse Intro Ch1 Ch2
162/193
165
Business databusiness terms and definitionsownership of data
charging policiesoperational metadata
data lineage: history of migrated data andsequence of transformations appliedcurrency of data: active, archived, purgedmonitoring information: warehouse usagestatistics, error reports, audit trails.
Recipe for a SuccessfulW h
7/30/2019 Datawarehouse Intro Ch1 Ch2
163/193
Warehouse
For a Successful Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
164/193
167
From day one establish that warehousingis a joint user/builder project
Establish that maintaining data quality willbe an ONGOING joint user/builderresponsibilityTrain the users one step at a timeConsider doing a high level corporate datamodel in no more than three weeks
From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.html
For a Successful Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
165/193
168
Look closely at the data extracting,cleaning, and loading toolsImplement a user accessible automated
directory to information stored in thewarehouseDetermine a plan to test the integrity of the data in the warehouseFrom the start get warehouse users in thehabit of 'testing' complex queries
For a Successful Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
166/193
169
Coordinate system roll-out with networkadministration personnelWhen in a bind, ask others who have
done the same thing for adviceBe on the lookout for small, but strategic,projectsMarket and sell your data warehousingsystems
Data Warehouse Pitfalls
7/30/2019 Datawarehouse Intro Ch1 Ch2
167/193
170
You are going to spend much time extracting,cleaning, and loading data
Despite best efforts at project management, datawarehousing project scope will increase
You are going to find problems with systemsfeeding the data warehouse
You will find the need to store data not beingcaptured by any existing system
You will need to validate data not being validatedby transaction processing systems
Data Warehouse Pitfalls
7/30/2019 Datawarehouse Intro Ch1 Ch2
168/193
171
Some transaction processing systems feeding thewarehousing system will not contain detail
Many warehouse end users will be trained andnever or seldom apply their training
After end users receive query and report tools,requests for IS written reports may increase
Your warehouse users will develop conflictingbusiness rules
Large scale data warehousing can become anexercise in data homogenizing
7/30/2019 Datawarehouse Intro Ch1 Ch2
169/193
DW and OLAP Research Issues
7/30/2019 Datawarehouse Intro Ch1 Ch2
170/193
173
Data cleaningfocus on data inconsistencies, not schema differencesdata mining techniques
Physical Designdesign of summary tables, partitions, indexes
tradeoffs in use of different indexesQuery processing
selecting appropriate summary tablesdynamic optimization with feedbackacid test for query optimization: cost estimation, use of transformations, search strategiespartitioning query processing between OLAP server andbackend server.
DW and OLAP Research Issues .. 2
7/30/2019 Datawarehouse Intro Ch1 Ch2
171/193
174
Warehouse Managementdetecting runaway queriesresource managementincremental refresh techniquescomputing summary tables during loadfailure recovery during load and refreshprocess management: scheduling queries,load and refreshQuery processing, cachinguse of workflow technology for processmanagement
P d t R f U f l Li k
7/30/2019 Datawarehouse Intro Ch1 Ch2
172/193
Products, References, Useful Links
Reporting Tools
7/30/2019 Datawarehouse Intro Ch1 Ch2
173/193
176
Andyne Computing -- GQLBrio -- BrioQueryBusiness Objects -- Business ObjectsCognos -- ImpromptuInformation Builders Inc. -- Focus for WindowsOracle -- Discoverer2000Platinum Technology -- SQL*Assist, ProReportsPowerSoft -- InfoMakerSAS Institute -- SAS/AssistSoftware AG -- EsperantSterling Software -- VISION:Data
OLAP and Executive InformationSystems
7/30/2019 Datawarehouse Intro Ch1 Ch2
174/193
177
Andyne Computing -- PabloArbor Software -- Essbase
Cognos -- PowerPlay
Comshare -- Commander
OLAPHolistic Systems -- Holos
Information Advantage --AXSYS, WebOLAP
Informix -- MetacubeMicrostrategies --DSS/Agent
Microsoft -- PlatoOracle -- Express
Pilot -- LightShip
Planning Sciences --
GentiumPlatinum Technology --ProdeaBeacon, Forest & Trees
SAS Institute -- SAS/EIS,OLAP++
Speedware -- Media
Other Warehouse RelatedProducts
7/30/2019 Datawarehouse Intro Ch1 Ch2
175/193
178
Data extract, clean, transform,refresh
CA-Ingres replicator
Carleton PassportPrism Warehouse ManagerSAS Access
Sybase Replication ServerPlatinum Inforefiner, Infopump
Extraction and TransformationTools
7/30/2019 Datawarehouse Intro Ch1 Ch2
176/193
179
Carleton Corporation -- PassportEvolutionary Technologies Inc. -- Extract
Informatica -- OpenBridge
Information Builders Inc. -- EDA Copy Manager
Platinum Technology -- InfoRefiner
Prism Solutions -- Prism Warehouse Manager
Red Brick Systems -- DecisionScape Formation
Scrubbing Tools
7/30/2019 Datawarehouse Intro Ch1 Ch2
177/193
180
Apertus -- Enterprise/IntegratorVality -- IPEPostal Soft
Warehouse Products
7/30/2019 Datawarehouse Intro Ch1 Ch2
178/193
181
Computer Associates -- CA-IngresHewlett-Packard -- Allbase/SQLInformix -- Informix, Informix XPS
Microsoft -- SQL ServerOracle -- Oracle7, Oracle Parallel ServerRed Brick -- Red Brick WarehouseSAS Institute -- SASSoftware AG -- ADABASSybase -- SQL Server, IQ, MPP
Warehouse Server Products
7/30/2019 Datawarehouse Intro Ch1 Ch2
179/193
182
Oracle 8InformixOnline Dynamic ServerXPS --Extended Parallel ServerUniversal Server for object relational
applicationsSybase
Adaptive Server 11.5Sybase MPPSybase IQ
Warehouse Server Products
7/30/2019 Datawarehouse Intro Ch1 Ch2
180/193
183
Red Brick WarehouseTandem NonstopIBM
DB2 MVSUniversal ServerDB2 400
Teradata
Other Warehouse RelatedProducts
7/30/2019 Datawarehouse Intro Ch1 Ch2
181/193
184
Connectivity to SourcesApertusInformation Builders EDA/SQL
Platimum InfohubSAS ConnectIBM Data Joiner
Oracle Open ConnectInformix Express Gateway
Other Warehouse RelatedProducts
7/30/2019 Datawarehouse Intro Ch1 Ch2
182/193
185
Query/Reporting EnvironmentsBrio/QueryCognos Impromptu
Informix ViewpointCA Visual ExpressBusiness Objects
Platinum Forest and Trees
4GL's, GUI Builders, and PCDatabases
7/30/2019 Datawarehouse Intro Ch1 Ch2
183/193
186
Information Builders -- FocusLotus -- ApproachMicrosoft -- Access, Visual BasicMITI -- SQR/WorkbenchPowerSoft -- PowerBuilder
SAS Institute -- SAS/AF
Data Mining Products
7/30/2019 Datawarehouse Intro Ch1 Ch2
184/193
187
DataMind -- neurOagentInformation Discovery -- IDISSAS Institute -- SAS/Neuronets
Data Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
185/193
188
W.H. Inmon, Building the DataWarehouse, Second Edition, John Wileyand Sons, 1996W.H. Inmon, J. D. Welch, Katherine L.Glassey, Managing the Data Warehouse,John Wiley and Sons, 1997Barry Devlin, Data Warehouse from
Architecture to Implementation, AddisonWesley Longman, Inc 1997
Data Warehouse
7/30/2019 Datawarehouse Intro Ch1 Ch2
186/193
189
W.H. Inmon, John A. Zachman, JonathanG. Geiger, Data Stores Data Warehousingand the Zachman Framework, McGraw HillSeries on Data Warehousing and DataManagement, 1997Ralph Kimball, The Data WarehouseToolkit, John Wiley and Sons, 1996
OLAP and DSS
7/30/2019 Datawarehouse Intro Ch1 Ch2
187/193
190
Erik Thomsen, OLAP Solutions, John Wileyand Sons 1997Microsoft TechEd Transparencies fromMicrosoft TechEd 98Essbase Product LiteratureOracle Express Product LiteratureMicrosoft Plato Web SiteMicrostrategy Web Site
Data Mining
7/30/2019 Datawarehouse Intro Ch1 Ch2
188/193
191
Michael J.A. Berry and Gordon Linoff, DataMining Techniques, John Wiley and Sons1997Peter Adriaans and Dolf Zantinge, DataMining, Addison Wesley Longman Ltd.1996KDD Conferences
Other Tutorials
7/30/2019 Datawarehouse Intro Ch1 Ch2
189/193
192
Donovan Schneider, Data Warehousing Tutorial,Tutorial at International Conference forManagement of Data (SIGMOD 1996) andInternational Conference on Very Large Data
Bases 97Umeshwar Dayal and Surajit Chaudhuri, DataWarehousing Tutorial at International Conferenceon Very Large Data Bases 1996
Anand Deshpande and S. Seshadri, Tutorial onDatawarehousing and Data Mining, CSI-97
Useful URLs
7/30/2019 Datawarehouse Intro Ch1 Ch2
190/193
193
Ralph Kimballs home page http://www.rkimball.com
Larry Greenfields Data WarehouseInformation Center
http://pwp.starnetinc.com/larryg/
Data Warehousing Institutehttp://www.dw-institute.com/
OLAP Councilhttp://www.olapcouncil.com/
Data Mining Motivation
http://www.rkimball.com/http://pwp.starnetinc.com/larryg/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://pwp.starnetinc.com/larryg/http://www.rkimball.com/7/30/2019 Datawarehouse Intro Ch1 Ch2
191/193
194
Changes in the Business EnvironmentCustomers becoming more demandingMarkets are saturated
Databases today are huge:More than 1,000,000 entities/records/rowsFrom 10 to 10,000 fields/attributes/variablesGigabytes and terabytes
Databases a growing at an unprecedentedrateDecisions must be made rapidlyDecisions must be made with maximum
k l d
Data Mining Applications:Retail
P f i g b k t l i
7/30/2019 Datawarehouse Intro Ch1 Ch2
192/193
195
Performing basket analysis
Which items customers tend to purchase together. Thisknowledge can improve stocking, store layout strategies, andpromotions.
Sales forecastingExamining time-based patterns helps retailers make stockingdecisions. If a customer purchases an item today, when arethey likely to purchase a complementary item?
Database marketingRetailers can develop profiles of customers with certainbehaviors, for example, those who purchase designer labelsclothing or those who attend sales. This information can beused to focus cost effective promotions.
Merchandise planning and allocation
When retailers add new stores, they can improve merchandiseplanning and allocation by examining patterns in stores withsimilar demographic characteristics. Retailers can also usedata mining to determine the ideal layout for a specific store.
Data Mining Applications:Banking
7/30/2019 Datawarehouse Intro Ch1 Ch2
193/193
Card marketingBy identifying customer segments, card issuers andacquirers can improve profitability with more effectiveacquisition and retention programs, targeted productdevelopment, and customized pricing.
Cardholder pricing and profitabilityCard issuers can take advantage of data miningtechnology to price their products so as to maximizeprofit and minimize loss of customers. Includes risk-based pricing.
Fraud detectionFraud is enormously costly. By analyzing pasttransactions that were later determined to be