Post on 02-Jun-2018
transcript
8/10/2019 Dw HK WhitePaper
1/28
Data Warehousing: A PerspectivebyHemant Kirpekar 10/18/201
Data Warehousing: A Perspective
by Hemant Kirpekar
Introduction
!he "ee# $or proper un#erstan#ing o$ Data Warehousing%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%2!he Key &ssues%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3!he De$inition o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3!he 'i$ecyc(e o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%4!he )oa(s o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
Why Data Warehousing is different from OLTP.................................................6
E! "ode#ing $s Dimension Ta%#es....................................................................&
T'o (amp#e Data Warehouse Designs
Designing a Pro#uct*+riente# Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%10Designing a ,ustomer*+riente# Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%14
"echanics of the Design
&ntervie-ing .n#*sers an# DAs%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%19Assemb(ing the team%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%19,hoosing Har#-are/o$t-are p(at$orms%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%20
Han#(ing Aggregates%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%20erver*i#e activities%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%21,(ient*i#e activities%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%22
)onc#usions.........................................................................................................*+
A )hec,#ist for an Idea# Data Warehouse.........................................................*-
1
8/10/2019 Dw HK WhitePaper
2/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Introduction
The need for proper understanding of Data Warehousing
!he $o((o-ing is an etract $rom 3Kno-(e#ge Asset 4anagement an# ,orporate 4emory3 a White Paperto be pub(ishe# on the WWW possib(y via the Hispacom site in the thir# -eek o$ August 1556%%%%%%
Data Warehousing may -e(( (everage the rising ti#e techno(ogies that everyone -i(( -ant or nee#7ho-ever the current tren# in Data Warehousing marketing (eaves a (ot to be #esire#%
&n many organiations there sti(( eists an enormous #ivi#e that separates &n$ormation !echno(ogy an# amanagers nee# $or Kno-(e#ge an# &n$ormation% &t is common currency that there is a -ho(e host o$avai(ab(e too(s an# techni9ues $or (ocating7 scrubbing7 sorting7 storing7 structuring7 #ocumenting7processing an# presenting in$ormation% n$ortunate(y7 too(s are tangib(e an# business in$ormation an#kno-(e#ge are not7 so they ten# to get con$use#%
o -hy #o -e sti(( have this con$usion ;irst consi#er ho- certain companies market Data Warehousing%!here are companies that se(( #atabase techno(ogies7 other companies that se(( the p(at$orms ects%
&n the main7 most ?D4 ven#ors seem to see Data Warehouse pro>ects as a cha((enge to provi#e greaterper$ormance7 greater capacity an# greater #ivergence% With this ecuse7 most ?D4 pro#ucts carry$unctiona(ity that make them about as tru(y 3open3 as a "&@A, 50/07 i%e% "o stan#ar#s $or @ie-Partitioning7 it 4appe# &n#eing7 Histograms7 +b>ect Partitioning7 B' 9uery #ecomposition or B'eva(uation strategies etc% !his ho-ever is not rea((y the important issue7 the rea( issue is that some ven#orsse(( Data Warehousing as i$ it >ust provi#e# a big #umping groun# $or massive amounts o$ #ata -ith -hichusers are a((o-e# to #o anything they (ike7 -hi(st at the same time $reeing up +perationa( ystems $romthe nee# to support en#*user in$ormationa( re9uirements%
ome har#-are ven#ors have a simi(ar approach7 i%e% a Data Warehouse p(at$orm must inherent(y have a(ot o$ #isks7 a (ot o$ memory an# a (ot o$ ,Ps% Ho-ever7 one o$ the most success$u( Data Warehousepro>ects have -orke# on use# ,+4PAB har#-are7 -hich provi#es an ece((ent cost/bene$it ratio%
ome !echnica( ,onsu(tancy ervices provi#ers ten# to #-e(( on the per$ormance aspects o$ DataWarehousing% !hey see Data Warehousing as a technica( cha((enge7 rather than a business opportunity7 butthe biggest per$ormance payo$$s -i(( be brought about -hen there is a $u(( un#erstan#ing o$ ho- the user-ishes to use the in$ormation%
2
8/10/2019 Dw HK WhitePaper
3/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
The Key Issues
+rganiations are s-imming in #ata% Ho-ever7 most -i(( have to create ne- #ata -ith improve# 9ua(ity7to meet strategic business p(anning re9uirements%
o:
Ho- shou(# & p(an $or the mass o$ en# user in$ormation #eman#
What ven#ors an# too(s -i(( emerge to he(p & bui(# an# maintain a #ata -arehouse architecture
What strategies can users #ep(oy to #eve(op a success$u( #ata -arehouse architecture
What techno(ogy breakthroughs -i(( occur to empo-er kno-(e#ge -orkers an# re#uce operationa(#ata access re9uirements
!hese are some o$ the key 9uestions out(ine# by the )artner )roup in their 155C report on DataWarehousing%
& -i(( try to ans-er some o$ these 9uestions in this report%
The Definition a Data Warehouse
A Data Warehouse is a:
% sub>ect*oriente#
% integrate#
%time*variant
% non*vo(ati(e
co((ection o$ #ata in support o$ management #ecisions%
orsubject areas o$ the corporation that have been #e$ine# in the#ata mo#e(% .amp(es o$ sub>ect areas are: customer7 pro#uct7 activity7 po(icy7 c(aim7 account% !he ma>orsub>ect areas en# up being physica((y imp(emente# as a series o$ re(ate# tab(es in the #ata -arehouse%
Personal Note: Could these be objects? No one to my knowledge has explored this possibility as yet.
!he secon# sa(ient characteristic o$ the #ata -arehouse is that it is integrated.!his is the most importantaspect o$ a #ata -arehouse% !he #i$$erent #esign #ecisions that the app(ication #esigners have ma#e overthe years sho- up in a thousan# #i$$erent -ays% )enera((y7 there is no app(ication consistency inenco#ing7 naming conventions7 physica( attributes7 measurements o$ attributes7 key structure an# physica(characteristics o$ the #ata% .ach app(ication has been most (ike(y been #esigne# in#epen#ent(y% As #ata isentere# into the #ata -arehouse7 inconsistencies o$ the app(ication (eve( are un#one%
!he thir# sa(ient characteristic o$ the #ata -arehouse is that it is time-variant.A C to 10 year time horiono$ #ata is norma( $or the #ata -arehouse% Data Warehouse #ata is a sophisticate# series o$ snapshots takenat one moment in time an# the key structure a(-ays contains some time e(ement%
!he (ast important characteristic o$ the #ata -arehouse is that it is nonvolatile. n(ike operationa( #ata-arehouse #ata is (oa#e# en masse an# is then accesse#% p#ate o$ the #ata #oes not occur in the #ata-arehouse environment%
3
8/10/2019 Dw HK WhitePaper
4/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
The lifecycle of the Data Warehouse
Data $(o-s into the #ata -arehouse $rom the operationa( environment% sua((y a signi$icant amount o$trans$ormation o$ #ata occurs at the passage $rom the operationa( (eve( to the #ata -arehouse (eve(%
+nce the #ata ages7 it passes $rom current #etai( to o(#er #etai(% As the #ata is summarie#7 it passes $romcurrent #etai( to (ight(y summarie# #ata an# then onto summarie# #ata%
At some point in time #ata is purge# $rom the -arehouse% !here are severa( -ays in -hich this can bema#e to happen:
% Data is a##e# to a ro((ing summary $i(e -here the #etai( is (ost%
% Data is trans$erre# to a bu(k me#ium $rom a high*per$ormance me#ium such as DAD%
% Data is trans$erre# $rom one (eve( o$ the architecture to another%
% Data is actua((y purge# $rom the system at the DAs re9uest%
!he $o((o-ing #iagram is $rom 3ui(#ing a Data Warehouse3 2n# .#7 by W%H% &nmon7 Wi(ey 56
high(y summarie#
(ight(y
summarie#
8/10/2019 Dw HK WhitePaper
5/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
The Goals of a Data Warehouse
Accor#ing to ?a(ph Kimba((
8/10/2019 Dw HK WhitePaper
6/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Why Data Warehousing is different from OLTP
+n*(ine transaction processing is pro$oun#(y #i$$erent $rom #ata -arehousing% !he users are #i$$erent7 the#ata content is #i$$erent7 the #ata structures are #i$$erent7 the har#-are is #i$$erent7 the so$t-are is#i$$erent7 the a#ministration is #i$$erent7 the management o$ the systems is #i$$erent7 an# the #ai(y rhythms
are #i$$erent% !he #esign techni9ues an# #esign instincts appropriate $or transaction processing areinappropriate an# even #estructive $or in$ormation -arehousing%
OLTP Transactional Properties
&n +'!P a transaction is #e$ine# by its A,&D properties%
A Transaction is a userdefined se5uence of instructions that maintains consistencyacross a persistent set of va#ues. It is a se5uence of operations that is atomic'ith respect to recovery.
!o remain va(i#7 a transaction must maintain its A,&D properties
!tomicit"is a con#ition that states that $or a transaction to be va(i# the e$$ects o$ a(( its instructions mustbe en$orce# or none at a((%
#onsistenc"is a property o$ the persistent #ata is an# must be preserve# by the eecution o$ a comp(etetransaction%
$solationis a property that states that the e$$ect o$ running transactions concurrent(y must be that o$seria(iabi(ity% i%e% as i$ each o$ the transactions -ere run in iso(ation%
Dura%ilit"is the abi(ity o$ a transaction to preserve its e$$ects i$ it has committe#7 in the presence o$me#ia an# system $ai(ures%
A serious #ata -arehouse -i(( o$ten process on(y one transaction per #ay7 but this transaction -i(( containthousan#s or even mi((ions o$ recor#s% !his kin# o$ transaction has a specia( name in #ata -arehousing% &tis ca((e# a &roduction data load
&n a #ata -arehouse7 consistenc" is measure# glo%all" We #o not care about an in#ivi#ua( transaction7but -e care enormous(y that the current (oa# o$ ne- #ata is a $u(( an# consistent set o$ #ata% What -e care
about is the consistent state o$ the system -e starte# -ith be$ore the pro#uction #ata (oa#7 an# theconsistent state o$ the system -e en#e# up -ith a$ter a success$u( pro#uction #ata (oa#% !he most practica($re9uency o$ this pro#uction #ata (oa# is once per #ay7 usua((y in the ear(y hours o$ the morning% o7instea# o$ a microscopic perspective7 -e have a 9ua(ity assurance managers >u#gment o$ #ata consistency%
+'!P systems are #riven by per$ormance an# re(iabi(ity concerns% sers o$ a #ata -arehouse a(most never#ea( -ith one account at a time7 usua((y re9uiring hun#re#s or thousan#s o$ recor#s to be searche# an#compresse# into a sma(( ans-er set% sers o$ a #ata -arehouse change the kin#s o$ 9uestions they askconstant(y% A(though7 the temp(ates o$ their re9uests may be simi(ar7 the impact o$ these 9ueries -i(( vary-i(#(y on the #atabase system% ma(( sing(e tab(e 9ueries7 ca((e# %ro'ses( nee# to be instantaneous-hereas (arge mu(titab(e 9ueries7 ca((e#)oin *ueries( are epecte# to run $or secon#s or minutes%
Re&orting is the primary activity in a #ata -arehouse% sers consume in$ormation in human*sie# chunkso$ one or t-o pages% (inking numbers on a page can be c(icke# on to ans-er -hy 9uestions% "egatives
be(o- are b(inking numbers%
+
8/10/2019 Dw HK WhitePaper
7/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Example of a Data Warehouse Report
,roduct Region Sales -ro'th in Sales as #hange in #hange in.his onth Sales s of Sales as Sales as
ast onth #ategor" of #at of #at .Dast t s
ast r .D
ramis #entral 110 12 31 3
ramis Eastern 19 6738 2 6718 3
ramis Western 55 5 44 1 5
.otal ramis 344 + 33 1 5
Widget #entral ++ 2 1 2 10
Widget Eastern 102 4 12 5 13
Widget Western 39 6798 9 6718
.otal Widget 20 1 13 4 11
-rand .otal 551 4 20 2
!he t'in:ling nature o$ +'!P #atabases or
bur#en on that system to correct(y #epict o(# history% We have a (ong series o$ transactions thatincrementa((y a(ter history an# it is c(ose to impossib(e to 9uick(y reconstruct the snapshot o$ a business ata speci$ie# point in time%
We make a #ata -arehouse a speci$ic time series We move snapshots o$ the +'!P systems over to the#ata -arehouse as a series o$ #ata (ayers7 (ike geo(ogic (ayers% y bringing static sna&shots to the-arehouse on(y on a regu(ar basis7 -e so(ve both o$ the time representation prob(ems -e ha# on the +'!Psystem% "o up#ates #uring the #ay * so no t'in:ling y storing snapshots7 -e represent prior points intime correct(y% !his a((o-s us to ask comparative 9ueries easi(y% !he snapshot is ca((e# the &roductiondata e;tract( an# -e migratethis etract to the #ata -arehouse system at regu(ar time interva(s% !hisprocess gives rise to the t-o phases o$ the #ata -arehouse: loading and *uer"ing
8/10/2019 Dw HK WhitePaper
8/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
E/R Modeling Vs Dimension Tables
Entit"/Relationshi& mo#e(ing seeks to #rive a(( the re#un#ancy out o$ the #ata% &$ there is no re#un#ancyin the #ata7 then a transaction that changes any #ata on(y nee#s to touch the #atabase in one p(ace% !his isthe secret behin# the phenomena( improvement in transaction processing spee# since the ear(y 80s% ./?
mo#e(ing -orks by #ivi#ing the #ata into many #iscreet entities7 each o$ -hich becomes a tab(e in the+'!P #atabase% A simp(e ./? #iagram (ooks (ike the map o$ a (arge metropo(itan area -here the entitiesare the cities an# the re(ationships are the connecting $ree-ays% !his #iagram is very symmetric or*ueries that s&an man" records or man" ta%les( E/R diagrams are too com for users tounderstand and too com for soft'are to na>=. ?E @SED !S .AE ?!S$S =R E>.ER,R$SE D!.!W!REA=@SES
&n #ata -arehousing7 80G o$ the 9ueries are sing(e*tab(e bro-ses7 an# 20G are mu(titab(e >oins% !hisa((o-s $or a tremen#ous(y simp(e #ata structure% !his structure is the dimensional model or the star )oinschema
!his name is chosen because the ./? #iagram (ooks (ike a star -ith one (arge centra( tab(e ca((e# the factta%le an# a set o$ sma((er atten#ant tab(es ca((e# dimensional ta%les7 #isp(aye# in a ra#ia( pattern aroun#
the $act tab(e% !his structure is very asymmetric% !he $act tab(e in the schema is the on(y one thatparticipates in mu(tip(e >oins -ith the #imension tab(es% !he #imension tab(es a(( have a sing(e >oin to thiscentra( $act tab(e%
.ime Dimension
timeIkey
#ayIo$I-eek
month9uarter
year
ho(i#ayI$(ag
Sales act
timeIkey
pro#uctIkey
storeIkey
#o((arsIso(#
unitsIso(#
#o((arsIcost
,roduct Dimension
Store Dimension
pro#uctIkey
#escription
bran#
category
storeIkey
storeIname
a##ress
$(oorIp(anItype! t"&ical dimensional model
!he above is an eamp(e o$ a star schema $or a typica( grocery store chain% !he a(es ;act tab(e containsdaily item totals o$ a(( the pro#ucts so(#% !his is ca((e# the grain o$ the $act tab(e% .ach recor# in the $acttab(e represents the tota( sa(es o$ a speci$ic pro#uct in a market on a #ay% Any other combination generatesa #i$$erent recor# in the $act tab(e% .he fact ta%le of a t"&ical grocer" retailer 'ith 500 stores( eachcarr"ing 50(000 &roducts on the shel
8/10/2019 Dw HK WhitePaper
9/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201!he fact ta%le is -here the numerica( measurements o$ the business are store#% !hese measurements aretaken at the intersection o$ a(( the #imensions% !he best an# most use$u( $acts are continuousl" oin constraint
and ".productkey $ p.productkey LMMM >oin constraint
and t.%uarter $ &' ( '))*& LMMM app(ication constraint
groupby p.brand LMMM group by c(ause
orderby p.brand LMMM or#er by c(ause
@irtua((y every 9uery (ike this one contains ro- hea#ers an# aggregated facts in the se(ect (ist% !he ro-hea#ers are not summe#7 the aggregate# $acts are%
!he $rom c(ause (ist the tab(es invo(ve# in the >oin%
!he >oin constraints >oin on the &rimar" :e"$rom the #imension tab(e an# the foreign :e" in the $acttab(e% Referential integrit" is etreme(y important in #ata -arehousing an# is en$orce# by the #ata basemanagement system%
!his $act tab(e key is a com&osite :e" consisting o$ concatenate# $oreign keys%
&n +'!P app(ications >oins are usua((y among arti$icia((y generate# numeric keys that have (itt(ea#ministrative signi$icance e(se-here in the company% &n #ata -arehousing one >ob $unction maintains themaster pro#uct $i(e an# overseas the generation o$ ne- pro#uct keys an# another >ob $unction makes surethat every sa(es recor# contains va(i# pro#uct keys% !hese >oins are there$ore ca((e# $S )oins
9
8/10/2019 Dw HK WhitePaper
10/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201!&&lication constraints app(y to in#ivi#ua( #imension tab(es% ?ro'sing the #imension tab(es7 the userspeci$ies app(ication constraints% &t rare(y makes sense to app(y an app(ication constraint simu(taneous(yacross t-o #imensions7 thereby (inking the t-o #imensions% !he #imensions are (inke# on(y through the$act tab(e% &t is possib(e to #irect(y app(y an app(ication constraint to a $act in the $act tab(e% !his can bethought o$ as a filter on the recor#s that -ou(# other-ise be retrieve# by the rest o$ the 9uery%
!he grou& %" clausesummaries recor#s in the ro- hea#ers% !he order %" clause #etermines the sortor#er o$ the ans-er set -hen it is presente# to the user%
;rom a per$ormance vie-point then7 the B' 9uery shou(# be eva(uate# as $o((o-s:
;irst7 the app(ication constraints are eva(uate# #imension by #imension% .ach #imension thus pro#uces aset o$ can#i#ate keys% !he can#i#ate keys are then assemb(e# $rom each #imension into tria( compositekeys to be searche# $or in the $act tab(e% A(( the 3hits3 in the $act tab(e are then groupe# an# summe#accor#ing to the speci$ications in the se(ect (ist an# group by c(ause%
Attri%utes !o#e in Data Warehousing
Attributes are the #rivers o$ the Data Warehouse% !he user begins by p(acing app(ication constraints on the#imensions through the process o$ bro-sing the #imension tab(es one at a time% !he bro-se 9ueries area(-ays on sing(e*#imension tab(es an# are usua((y $ast acting an# (ight-eight% ro-sing is to a((o- the
user to assemb(e the correct constraints on each #imension% !he user (aunches severa( 9ueries in thisphase% !he user a(so #rags ro- hea#ers $rom the #imension tab(es an# a##itive $acts $rom the $act tab(e tothe ans-er staging area < the report=% !he user then (aunches a mu(titab(e >oin% ;ina((y7 the #bms groupsan# summaries mi((ions o$ (o-*(eve( recor#s $rom the $act tab(e into the sma(( ans-er set an# returns theans-er to the user%
To !am"le Data Warehouse Designs
Designing a ,roduct6=riented Data Warehouse
.ime Dimension
,romotion Dimension
Sales act,roduct Dimension
Store Dimension
.he -rocer" Store Schema
timeIkey
#ayIo$I-eek
DayInoIinI4onth
other time #imension attri
promotionIkey
promotionIname
priceIre#uctionItypeother promotion attr
pro#uctIkey
KIno
KI#esc
other pro#uct attr
storeIkey
storeIname
storeInumberstoreIa##r
other store attr
timeIkey
pro#uctIkey
storeIkey
promotionIkey
#o((arIsa(es
unitsIsa(es
#o((arIcost
customerIcount
10
8/10/2019 Dw HK WhitePaper
11/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
7ac,ground
!he above schema is $or a grocery chain -ith C00 (arge grocery stores sprea# over a three*state area% .achstore has a $u(( comp(ement o$ #epartments inc(u#ing grocery7 $roen $oo#s7 #airy7 meat7 pro#uce7 bakery7
$(ora(7 har# goo#s7 (i9uor an# #rugs% .ach store has about 607000 in#ivi#ua( pro#ucts on its she(ves% !hein#ivi#ua( pro#ucts are ca((e# +tock ,eeping nits orSC@s About 07000 o$ the Ks come $romoutsi#e manu$acturers an# have bar co#es imprinte# on the pro#uct package% !hese bar co#es ca((e#niversal Product Codes or @,#s are at the same grain as in#ivi#ua( Ks% !he remaining 207000 Kscome $rom #epartments (ike meat7 pro#uce7 bakery or $(ora( #epartments an# #o not have nationa((yrecognie# P, co#es%
4anagement is concerne# -ith the (ogistics o$ or#ering7 stocking the she(ves an# se((ing the pro#ucts as-e(( as maimiing the pro$it at each store% !he most signi$icant management #ecision has to #o -ithpricing an# promotions% Promotions inc(u#e tem&orar" &rice reductions( a#s in ne-spapers7 #isp(ays inthe grocery store inc(u#ing she($ #isp(ays an# en# ais(e #isp(ays an# coupons%
Identifying the Processes to "ode#
!he $irst step in the #esign is to #eci#e -hat business processes to mo#e(7 by combining an un#erstan#ing
o$ the business -ith an un#erstan#ing o$ -hat #ata is avai(ab(e% !he secon# step is to #eci#e on the graino$ the $act tab(e in each business process%
A #ata -arehouse a(-ays #eman#s #ata epresse# at the (o-est possib(e grain o$ each #imension7 not $orthe 9ueries to see in#ivi#ua( (o-*(eve( recor#s7 but $or the 9ueries to be ab(e to cut through the #atabase invery precise -ays% !he best grain $or the grocery store #ata -arehouse is #ai(y item movement or K bystore by promotion by #ay%
Dimension Ta%#e "ode#ing
A care$u( grain statement #etermines the primary #imensiona((y o$ the $act tab(e% &t is then possib(e to a##a##itiona( #imensions to the basic grain o$ the $act tab(e7 -here these a##itiona( #imensions natura((y takeon on(y a sing(e va(ue un#er each combination o$ the primary #imensions% &$ it is recognie# that ana##itiona( #esire# #imension vio(ates the grain by causing a##itiona( recor#s to be generate#7 then the
grain statement must be revise# to accommo#ate this a##itiona( #imension% !he grain o$ the grocery storetab(e a((o-s the primary #imensions o$ time7 pro#uct an# store to $a(( out imme#iate(y%
4ost #ata -arehouses nee# an ep(icit time dimension ta%leeven though the primary time key may be anB' #ate*va(ue# ob>ect% !he ep(icit time #imension tab(e is nee#e# to #escribe $isca( perio#s7 seasons7ho(i#ays7 -eeken#s an# other ca(en#ar ca(cu(ations that are #i$$icu(t to get $rom the B' #ate machinery%
!ime is usua((y the $irst #imension in the un#er(ying sort or#er in the #atabase because -hen it is the $irstin the sort or#er7 the successive (oa#ing o$ time interva(s o$ #ata -i(( (oa# #ata into virgin territory on the#isk%
!he &roduct dimension is one o$ the t-o or three primary #imensions in near(y every #ata -arehouse%!his type o$ #imension has a great many attributes7 in genera( can go above C0 attributes%
!he other t-o #imensions are an arti$act o$ the grocery store eamp(e%
! note of caution
11
8/10/2019 Dw HK WhitePaper
12/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
,roduct Dimension
pro#uctIkey
KI#esc
KInumber
packageIsieIkeypackageItype
#ietItype
-eight
-eightIunitIo$I
Imeasure
storageItypeIkey
unitsIperIretai(I
case
etc%%
packageIsi8eIkey
packageIsi8e
bran#Ikey
categoryIkey
category
#epartmentIkey
subcategoryIkey
subcategory
categoryIkey
bran#Ikey
bran#
subcategoryI
key
#epartmentIkey
#epartment
storageItypeIkey
storageItype
she($I(i$eItypeIkey
she($I(i$eI
typeIkey
she($I(i$eI
type
! sno'fla:ed &roduct dimension
ro-sing is the act o$ navigating aroun# in a #imension7 either to gain an intuitive un#erstan#ing o$ ho-the various attributes corre(ate -ith each other or to bui(# a constraint on the #imension as a -ho(e% &$ a(arge pro#uct #imension tab(e is sp(it apart into a sno-$(ake7 an# robust bro-sing is attempte# among-i#e(y separate# attributes7 possib(y (ying a(ong various tree structures7 it is inevitab(e that bro-singper$ormance -i(( be compromise#%
12
8/10/2019 Dw HK WhitePaper
13/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
8act Ta%#e "ode#ing
!he sales fact ta%lerecor#s on(y the Ks actua((y so(#% "o recor# is kept o$ the Ks that #i# not se((%
8/10/2019 Dw HK WhitePaper
14/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
To !am"le Data Warehouse Designs
Designing a #ustomer6=riented Data Warehouse
& -i(( out(ine an insurance app(ication as an eamp(e o$ a customer*oriente# #ata -arehouse%
&n this eamp(e the insurance company is a bi((ion property an# casua(ty insurer $or automobi(es7 home$ire protection7 an# persona( (iabi(ity% !here are t-o main pro#uction #ata sources: a(( transactions re(atingto the $ormu(ation o$ po(icies7 an# a(( transactions invo(ve# in processing c(aims% !he insurance company-ants to ana(ye both the -ritten po(icies an# c(aims% &t -ants to see -hich coverages are most pro$itab(ean# -hich are the (east% &t -ants to measure pro$its over time by covere# item type
8/10/2019 Dw HK WhitePaper
15/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
.he follo'ing four schemas outline the star schema for the insurance a&&lication
#laims .ransaction
Schema
#ateIkey
#ayIo$I-eek
$isca(Iperio#
emp(oyeeIkey
name
emp(oyeeItype#epartment
covere#IitemIkey
covere#IitemI#esc
covere#IitemItype
c(aimantInamec(aimantIkeyc(aimantIa##ress
c(aimantItype
thir#IpartyIkey
thir#IpartyIname
thir#IpartyIa##rthor#IpartyItype
insure#IpartyIkey
name
a##resstype
#emographic attributes
coverageIkey
coverageI#esc
marketIsegment(ineIo$Ibusiness
annua(IstatementI(ineautomobile0attributes ...
po(icyIkeyriskIgra#e
c(aimIkey
c(aimI#escc(aimItypeautomobile0attributes ...
transactionIkey
transactionI#escriptionreason
automobile0attributes
...
transactionI#ate
e$$ectiveI#ateinsure#IpartyIkey
emp(oyeeIkey
coverageIkeycovere#IitemIkey
po(icyIkeyc(aimantIkey
c(aimIkey
thir#IpartyIkeytransactionIkey
amount
,olic" .ransaction Schema
#ateIkey
#ayIo$ -eek
$isca(Iperio#
emp(oyeeIkey
name
emp(oyeeItype#epartment
covere#IitemIkey
covere#IitemI#escription
covere#IitemItypeautomobile0attributes
transactionIkey
transactionI#scription
reason
transactionI#ate
e$$ectiveI#ate
insure#IpartyIkeyemp(oyeeIkey
coverageIkey
covere#IitemIkeypo(icyIkey
transactionIkey
amount
insure#IpartyIkey
name
a##resstype
#emographicIattributes%%%
coverageIkey
coverageI#escription
marketIsegment(ineIo$Ibusiness
annua(IstatementI(ine
automobi(eIattributes
po(icyIkeyriskIgra#e
%%%
15
8/10/2019 Dw HK WhitePaper
16/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
,olic" Sna&shot Schema
#ateIkey
$isca(Iperio#
agentIkey
agentIname
agentI(ocationagentItype
covere#IitemIkey
covere#IitemI#escriptioncovere#IitemItypeautomobile0attributes ...
statusIkey
statusI#escription
insure#IpartyIkey
name
a##resstype
#emographic attributes
coverageIkey
coverageI#escmarketIsegment
(ineIo$Ibusiness
annua(IstatementI(ineautomobile0attributes ...
po(icyIkey
riskIgra#e
snapshotI#ate
e$$ectiveI#ateinsure#IpartyIkey
agentIkey
coverageIkeycovere#IitemIkey
po(icyIkeystatusIkey
-rittenIpermission
earne#IpremiumprimaryI(imit
primaryI#e#uctib(e
numberItransactionsautomobile0"acts ...
#laims Sna&shot
Schema
#ateIkey
#ayIo$I-eek$isca(Iperio#
covere#IitemIkey
covere#IitemI#esc
covere#IitemItypeautomobile0attributes ...
agentIkey
agentIname
agentItypeagentI(ocation
c(aimIkey
c(aimI#escc(aimItypeautomobile0attributes ...
insure#IpartyIkeyname
a##ress
type#emographic attributes
coverageIkey
coverageI#escmarketIsegment
(ineIo$Ibusiness
annua(IstatementI(ineautomobile0attributes ...
po(icyIkey
riskIgra#e
statusIkeytatusI#escription
transactionI#ate
e$$ectiveI#ate
insure#IpartyIkeyagentIkey
emp(oyeeIkey
coverageIkeycovere#IitemIkey
po(icyIkeyc(aimIkey
statusIkey
reservetIamountpai#IthisImonth
receive#IthisImonth
numberItransactionsautomobile "acts ...
1+
8/10/2019 Dw HK WhitePaper
17/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201An appropriate #esign $or a property an# casua(ty insurance #ata -arehouse is a short va(ue chainconsisting o$ po(icy creation an# c(aims processing7 -here these t-o ma>or processes are represente# bothby transaction $act tab(es an# month(y snapshot $act tab(es%
!his #ata -arehouse -i(( nee# to represent a number o$ heterogeneous coverage types -ith appropriatecombinations o$ core an# custom #imension tab(es an# $act tab(es%
!he (arge insure# party an# covere# item #imensions -i(( nee# to be #ecompose# into one or moremini#imensions in or#er to provi#e reasonab(e bro-sing per$ormance an# in or#er to accurate(y trackthese s(o-(y changing #imensions%
Data%ase (i/ing for the Insurance App#ication
Policy Transaction #act Table !i$ing
"umber o$ po(icies: 270007000
"umber o$ covere# item coverages
8/10/2019 Dw HK WhitePaper
18/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Policy !na"shot #act Table !i$ing
"umber o$ po(icies: 270007000
"umber o$ covere# item coverages
8/10/2019 Dw HK WhitePaper
19/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Mechanics of the Design
!here are nine #ecision points that nee# to be reso(ve# $or a comp(ete #ata -arehouse #esign:
1% !he processes7 an# hence the i#entity o$ the $act tab(es
2% !he grain o$ each $act tab(e
% !he #imensions o$ each $act tab(e
% !he $acts7 inc(u#ing preca(cu(ate# $acts%
C% !he #imension attributes -ith comp(ete #escriptions an# proper termino(ogy
6% Ho- to track s(o-(y changing #imensions
J% !he aggregations7 heterogeneous #imensions7 mini#imensions7 9uery mo#e(s an# other physica( storage#ecisions
8% !he historica( #uration o$ the #atabase
5% !he urgency -ith -hich the #ata is etracte# an# (oa#e# into the #ata -arehouse
$nter
8/10/2019 Dw HK WhitePaper
20/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
#hoosing the Aard'are/Soft'are &latforms
!hese choices boi( #o-n to t-o primary concerns:
1% Does the propose# system actua((y -ork
2% &s this a ven#or re(ationship that -e -ant to have $or a (ong time
Buestion the ven#or -hether:
1% ,an the system 9uery7 store7 (oa#7 in#e7 an# a(ter a bi((ion*ro- $act tab(e -ith a #oen #imensions
2% ,an the system rapi#(y bro-se a 1007000 ro- #imension tab(e
enchmark the system to simu(ate $act an# #imension tab(e (oa#ing%
,on#uct a 9uery test $or:
1% Average bro-se 9uery response time
2% Average bro-se 9uery #e(ay compare# -ith un(oa#e# system
% ?atio bet-een (ongest an# shortest bro-se 9uery time
% Average >oin 9uery response time
C% Average >oin 9uery #e(ay compare# -ith un(oa#e# system
6% ?ation bet-een (ongest an# shortest >oin 9uery time
8/10/2019 Dw HK WhitePaper
21/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Ser
8/10/2019 Dw HK WhitePaper
22/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201A bu(k #ata (oa#er shou(# a((o- $or:
!he para((e(iation o$ the bu(k #ata (oa# across a number o$ processors in either 4P or 4PPenvironments%
e(ective(y turning o$$ an# then on the master in#e pre an# post bu(k (oa#s
&nsert an# up#ate mo#es se(ectab(e by the DA
?e$erentia( integrity han#(ing options
&t is a goo# i#ea7 as mentione# ear(ier7 to think o$ the (oa# process as one transaction% &$ the (oa# iscorrupte#7 a ro((back an# (oa# in the net (oa# -in#o- shou(# be trie#%
#lient6Side acti
8/10/2019 Dw HK WhitePaper
23/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
%onclusions
!he #ata -arehousing market is moving 9uick(y as a(( ma>or D4 an# too( ven#ors try to satis$y &nee#s% !he in#ustry nee#s to be #riven by the users as oppose# to by the so$t-are/har#-are ven#ors as hasbeen the case upto no-%
o$t-are is the key% A(though there have been severa( a#vances in har#-are7 such as para((e( processing7the main impact -i(( sti(( be $e(t through so$t-are%
Here are a $e- so$t-are issues:
+ptimiation o$ the eecution o$ star >oin 9ueries
&n#eing o$ #imension tab(es $or bro-sing an# constraining7 especia((y mu(ti*mi((ion*ro- #imensiontab(es
&n#eing o$ composite keys o$ $act tab(es
ynta etensions $or B' to han#(e aggregations an# comparisons
upport $or (o-*(eve( #ata compression
upport $or para((e( processing
Database Design too(s $or star schemas
.tract7 a#ministration an# BA too(s $or star schemas
.n# user 9uery too(s
23
8/10/2019 Dw HK WhitePaper
24/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
& %hec'list for an Ideal Data Warehouse
!he $o((o-ing check(ist is $rom ?a(ph Kimba((s * A Data Warehouse !oo(kit7 Wi(ey 56
Pre(iminary comp(ete (ist o$ a$$ecte# user groups prior to intervie-s
Pre(iminary comp(ete (ist o$ (egacy #ata sources prior to intervie-s
Data -arehouse imp(ementation team i#enti$ie#
Data -arehouse manager i#enti$ie#
&ntervie- (ea#er i#enti$ie#
.tract programming manager i#enti$ie#
.n# user groups to be intervie-e# i#enti$ie#
Data -arehouse kicko$$ meeting -ith a(( a$$ecte# en# user groups
.n# user intervie-s
4arketing intervie-s
;inance intervie-s
'ogistics intervie-s
;ie(# management intervie-s
enior management intervie-s
i*inch stack o$ eisting management reports representing a(( intervie-e# groups
'egacy system DA intervie-s
,opy books obtaine# $or can#i#ate (egacy systems
Data #ictionary ep(aining meaning o$ each can#i#ate tab(e an# $ie(#
High*(eve( #escription o$ -hich tab(es an# $ie(#s are popu(ate# -ith 9ua(ity #ata
&ntervie- $in#ings report #istribute#
Prioritie# in$ormation nee#s as epresse# by en# user community
Data au#it per$orme# sho-ing -hat #ata is avai(ab(e to support in$ormation nee#s
Data-arehousing #esign meeting
4a>or processes i#enti$ie# an# $act tab(es (ai# out
)rain $or each $act tab(e chosen
,hoice o$ transaction grain @s time perio# accumu(ating snapshot grain
Dimensions $or each $act tab(e i#enti$ie#
;acts $or each $act tab(e -ith (egacy source $ie(#s i#enti$ie#
24
8/10/2019 Dw HK WhitePaper
25/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Dimension attributes -ith (egacy source $ie(#s i#enti$ie#
,ore an# custom heterogeneous pro#uct tab(es i#enti$ie#
(o-(y changing #imension attributes i#enti$ie#
Demographic mini#imensions i#enti$ie#
&nitia( aggregate# #imensions i#enti$ie#
Duration o$ each $act tab(e
8/10/2019 Dw HK WhitePaper
26/28
8/10/2019 Dw HK WhitePaper
27/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
Dri(( across that a((o-s mu(tip(e $act tab(es to appear in same report
,orrect(y ca(cu(ate# break ro-s
?e#*)reen eception high(ighting -ith inter$ace to #ri(( #o-n
Abi(ity to use net-ork aggregate navigator -ith every atomic 9uery issue# by too(
e9uentia( operations on the ans-er set such as numbering top "7 an# ro((ing
Abi(ity to eten# 9uery synta $or D4 specia( $unctions
Abi(ity to #e$ine very (arge behaviora( groups o$ customers or pro#ucts
Abi(ity to graph #ata or han# o$$ #ata to thir#*party graphics package
Abi(ity to pivot #ata or to han# o$$ #ata to thir#*party pivot package
Abi(ity to support +'. hot (inks -ith other +'. a-are app(ications
Abi(ity to p(ace ans-er set in c(ipboar# or !N! $i(e in 'otus or .ce( $ormats
Abi(ity to print horionta( an# vertica( ti(e# report
atch operation
)raphica( user inter$ace user #eve(opment $aci(ities
Abi(ity to bui(# a startup screen $or the en# user
Abi(ity to #e$ine pu(( #o-n menu items
Abi(ity to #e$ine buttons $or running reports an# invoking the bro-ser
,onsu(tants
,onsu(tant team 9ua(i$ie#
,onsu(tant team has imp(emente# a simi(ar #ata -arehouse
,onsu(tant team agrees -ith the #imensiona( approach
,onsu(tant team #emonstrates competence in prototype test
2
8/10/2019 Dw HK WhitePaper
28/28
Data Warehousing: A PerspectivebyHemant Kirpekar
10/18/201
(ibliogra"hy
1% u(i#ing a Data Warehouse7 econ# .#ition7 by W%H% &nmon7 Wi(ey7 1556
2% !he Data Warehouse !oo(kit7 by Dr% ?a(ph Kimba((7 Wi(ey7 1556
% trategic Database !echno(ogy: 4anagement $or the year 20007 by A(an imon7 4organ Kau$mann7155C
% App(ie# Decision upport7 by 4ichae( W% Davis7 Prentice Ha((7 1588
C% Data Warehousing: Passing ;ancy or trategic &mperative7 -hite paper by the )artner )roup7 155C
6% Kno-(e#ge Asset 4anagement an# ,orporate 4emory7 -hite paper by the Hispacom )roup7 to bepub(ishe# in Aug
1556
The End