+ All Categories
Home > Documents > Dw HK WhitePaper

Dw HK WhitePaper

Date post: 02-Jun-2018
Category:
Upload: anupam20099
View: 217 times
Download: 0 times
Share this document with a friend

of 28

Transcript
  • 8/10/2019 Dw HK WhitePaper

    1/28

    Data Warehousing: A PerspectivebyHemant Kirpekar 10/18/201

    Data Warehousing: A Perspective

    by Hemant Kirpekar

    Introduction

    !he "ee# $or proper un#erstan#ing o$ Data Warehousing%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%2!he Key &ssues%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3!he De$inition o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%3!he 'i$ecyc(e o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%4!he )oa(s o$ a Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5

    Why Data Warehousing is different from OLTP.................................................6

    E! "ode#ing $s Dimension Ta%#es....................................................................&

    T'o (amp#e Data Warehouse Designs

    Designing a Pro#uct*+riente# Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%10Designing a ,ustomer*+riente# Data Warehouse%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%14

    "echanics of the Design

    &ntervie-ing .n#*sers an# DAs%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%19Assemb(ing the team%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%19,hoosing Har#-are/o$t-are p(at$orms%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%20

    Han#(ing Aggregates%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%20erver*i#e activities%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%21,(ient*i#e activities%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%22

    )onc#usions.........................................................................................................*+

    A )hec,#ist for an Idea# Data Warehouse.........................................................*-

    1

  • 8/10/2019 Dw HK WhitePaper

    2/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Introduction

    The need for proper understanding of Data Warehousing

    !he $o((o-ing is an etract $rom 3Kno-(e#ge Asset 4anagement an# ,orporate 4emory3 a White Paperto be pub(ishe# on the WWW possib(y via the Hispacom site in the thir# -eek o$ August 1556%%%%%%

    Data Warehousing may -e(( (everage the rising ti#e techno(ogies that everyone -i(( -ant or nee#7ho-ever the current tren# in Data Warehousing marketing (eaves a (ot to be #esire#%

    &n many organiations there sti(( eists an enormous #ivi#e that separates &n$ormation !echno(ogy an# amanagers nee# $or Kno-(e#ge an# &n$ormation% &t is common currency that there is a -ho(e host o$avai(ab(e too(s an# techni9ues $or (ocating7 scrubbing7 sorting7 storing7 structuring7 #ocumenting7processing an# presenting in$ormation% n$ortunate(y7 too(s are tangib(e an# business in$ormation an#kno-(e#ge are not7 so they ten# to get con$use#%

    o -hy #o -e sti(( have this con$usion ;irst consi#er ho- certain companies market Data Warehousing%!here are companies that se(( #atabase techno(ogies7 other companies that se(( the p(at$orms ects%

    &n the main7 most ?D4 ven#ors seem to see Data Warehouse pro>ects as a cha((enge to provi#e greaterper$ormance7 greater capacity an# greater #ivergence% With this ecuse7 most ?D4 pro#ucts carry$unctiona(ity that make them about as tru(y 3open3 as a "&@A, 50/07 i%e% "o stan#ar#s $or @ie-Partitioning7 it 4appe# &n#eing7 Histograms7 +b>ect Partitioning7 B' 9uery #ecomposition or B'eva(uation strategies etc% !his ho-ever is not rea((y the important issue7 the rea( issue is that some ven#orsse(( Data Warehousing as i$ it >ust provi#e# a big #umping groun# $or massive amounts o$ #ata -ith -hichusers are a((o-e# to #o anything they (ike7 -hi(st at the same time $reeing up +perationa( ystems $romthe nee# to support en#*user in$ormationa( re9uirements%

    ome har#-are ven#ors have a simi(ar approach7 i%e% a Data Warehouse p(at$orm must inherent(y have a(ot o$ #isks7 a (ot o$ memory an# a (ot o$ ,Ps% Ho-ever7 one o$ the most success$u( Data Warehousepro>ects have -orke# on use# ,+4PAB har#-are7 -hich provi#es an ece((ent cost/bene$it ratio%

    ome !echnica( ,onsu(tancy ervices provi#ers ten# to #-e(( on the per$ormance aspects o$ DataWarehousing% !hey see Data Warehousing as a technica( cha((enge7 rather than a business opportunity7 butthe biggest per$ormance payo$$s -i(( be brought about -hen there is a $u(( un#erstan#ing o$ ho- the user-ishes to use the in$ormation%

    2

  • 8/10/2019 Dw HK WhitePaper

    3/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    The Key Issues

    +rganiations are s-imming in #ata% Ho-ever7 most -i(( have to create ne- #ata -ith improve# 9ua(ity7to meet strategic business p(anning re9uirements%

    o:

    Ho- shou(# & p(an $or the mass o$ en# user in$ormation #eman#

    What ven#ors an# too(s -i(( emerge to he(p & bui(# an# maintain a #ata -arehouse architecture

    What strategies can users #ep(oy to #eve(op a success$u( #ata -arehouse architecture

    What techno(ogy breakthroughs -i(( occur to empo-er kno-(e#ge -orkers an# re#uce operationa(#ata access re9uirements

    !hese are some o$ the key 9uestions out(ine# by the )artner )roup in their 155C report on DataWarehousing%

    & -i(( try to ans-er some o$ these 9uestions in this report%

    The Definition a Data Warehouse

    A Data Warehouse is a:

    % sub>ect*oriente#

    % integrate#

    %time*variant

    % non*vo(ati(e

    co((ection o$ #ata in support o$ management #ecisions%

    orsubject areas o$ the corporation that have been #e$ine# in the#ata mo#e(% .amp(es o$ sub>ect areas are: customer7 pro#uct7 activity7 po(icy7 c(aim7 account% !he ma>orsub>ect areas en# up being physica((y imp(emente# as a series o$ re(ate# tab(es in the #ata -arehouse%

    Personal Note: Could these be objects? No one to my knowledge has explored this possibility as yet.

    !he secon# sa(ient characteristic o$ the #ata -arehouse is that it is integrated.!his is the most importantaspect o$ a #ata -arehouse% !he #i$$erent #esign #ecisions that the app(ication #esigners have ma#e overthe years sho- up in a thousan# #i$$erent -ays% )enera((y7 there is no app(ication consistency inenco#ing7 naming conventions7 physica( attributes7 measurements o$ attributes7 key structure an# physica(characteristics o$ the #ata% .ach app(ication has been most (ike(y been #esigne# in#epen#ent(y% As #ata isentere# into the #ata -arehouse7 inconsistencies o$ the app(ication (eve( are un#one%

    !he thir# sa(ient characteristic o$ the #ata -arehouse is that it is time-variant.A C to 10 year time horiono$ #ata is norma( $or the #ata -arehouse% Data Warehouse #ata is a sophisticate# series o$ snapshots takenat one moment in time an# the key structure a(-ays contains some time e(ement%

    !he (ast important characteristic o$ the #ata -arehouse is that it is nonvolatile. n(ike operationa( #ata-arehouse #ata is (oa#e# en masse an# is then accesse#% p#ate o$ the #ata #oes not occur in the #ata-arehouse environment%

    3

  • 8/10/2019 Dw HK WhitePaper

    4/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    The lifecycle of the Data Warehouse

    Data $(o-s into the #ata -arehouse $rom the operationa( environment% sua((y a signi$icant amount o$trans$ormation o$ #ata occurs at the passage $rom the operationa( (eve( to the #ata -arehouse (eve(%

    +nce the #ata ages7 it passes $rom current #etai( to o(#er #etai(% As the #ata is summarie#7 it passes $romcurrent #etai( to (ight(y summarie# #ata an# then onto summarie# #ata%

    At some point in time #ata is purge# $rom the -arehouse% !here are severa( -ays in -hich this can bema#e to happen:

    % Data is a##e# to a ro((ing summary $i(e -here the #etai( is (ost%

    % Data is trans$erre# to a bu(k me#ium $rom a high*per$ormance me#ium such as DAD%

    % Data is trans$erre# $rom one (eve( o$ the architecture to another%

    % Data is actua((y purge# $rom the system at the DAs re9uest%

    !he $o((o-ing #iagram is $rom 3ui(#ing a Data Warehouse3 2n# .#7 by W%H% &nmon7 Wi(ey 56

    high(y summarie#

    (ight(y

    summarie#

  • 8/10/2019 Dw HK WhitePaper

    5/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    The Goals of a Data Warehouse

    Accor#ing to ?a(ph Kimba((

  • 8/10/2019 Dw HK WhitePaper

    6/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Why Data Warehousing is different from OLTP

    +n*(ine transaction processing is pro$oun#(y #i$$erent $rom #ata -arehousing% !he users are #i$$erent7 the#ata content is #i$$erent7 the #ata structures are #i$$erent7 the har#-are is #i$$erent7 the so$t-are is#i$$erent7 the a#ministration is #i$$erent7 the management o$ the systems is #i$$erent7 an# the #ai(y rhythms

    are #i$$erent% !he #esign techni9ues an# #esign instincts appropriate $or transaction processing areinappropriate an# even #estructive $or in$ormation -arehousing%

    OLTP Transactional Properties

    &n +'!P a transaction is #e$ine# by its A,&D properties%

    A Transaction is a userdefined se5uence of instructions that maintains consistencyacross a persistent set of va#ues. It is a se5uence of operations that is atomic'ith respect to recovery.

    !o remain va(i#7 a transaction must maintain its A,&D properties

    !tomicit"is a con#ition that states that $or a transaction to be va(i# the e$$ects o$ a(( its instructions mustbe en$orce# or none at a((%

    #onsistenc"is a property o$ the persistent #ata is an# must be preserve# by the eecution o$ a comp(etetransaction%

    $solationis a property that states that the e$$ect o$ running transactions concurrent(y must be that o$seria(iabi(ity% i%e% as i$ each o$ the transactions -ere run in iso(ation%

    Dura%ilit"is the abi(ity o$ a transaction to preserve its e$$ects i$ it has committe#7 in the presence o$me#ia an# system $ai(ures%

    A serious #ata -arehouse -i(( o$ten process on(y one transaction per #ay7 but this transaction -i(( containthousan#s or even mi((ions o$ recor#s% !his kin# o$ transaction has a specia( name in #ata -arehousing% &tis ca((e# a &roduction data load

    &n a #ata -arehouse7 consistenc" is measure# glo%all" We #o not care about an in#ivi#ua( transaction7but -e care enormous(y that the current (oa# o$ ne- #ata is a $u(( an# consistent set o$ #ata% What -e care

    about is the consistent state o$ the system -e starte# -ith be$ore the pro#uction #ata (oa#7 an# theconsistent state o$ the system -e en#e# up -ith a$ter a success$u( pro#uction #ata (oa#% !he most practica($re9uency o$ this pro#uction #ata (oa# is once per #ay7 usua((y in the ear(y hours o$ the morning% o7instea# o$ a microscopic perspective7 -e have a 9ua(ity assurance managers >u#gment o$ #ata consistency%

    +'!P systems are #riven by per$ormance an# re(iabi(ity concerns% sers o$ a #ata -arehouse a(most never#ea( -ith one account at a time7 usua((y re9uiring hun#re#s or thousan#s o$ recor#s to be searche# an#compresse# into a sma(( ans-er set% sers o$ a #ata -arehouse change the kin#s o$ 9uestions they askconstant(y% A(though7 the temp(ates o$ their re9uests may be simi(ar7 the impact o$ these 9ueries -i(( vary-i(#(y on the #atabase system% ma(( sing(e tab(e 9ueries7 ca((e# %ro'ses( nee# to be instantaneous-hereas (arge mu(titab(e 9ueries7 ca((e#)oin *ueries( are epecte# to run $or secon#s or minutes%

    Re&orting is the primary activity in a #ata -arehouse% sers consume in$ormation in human*sie# chunkso$ one or t-o pages% (inking numbers on a page can be c(icke# on to ans-er -hy 9uestions% "egatives

    be(o- are b(inking numbers%

    +

  • 8/10/2019 Dw HK WhitePaper

    7/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Example of a Data Warehouse Report

    ,roduct Region Sales -ro'th in Sales as #hange in #hange in.his onth Sales s of Sales as Sales as

    ast onth #ategor" of #at of #at .Dast t s

    ast r .D

    ramis #entral 110 12 31 3

    ramis Eastern 19 6738 2 6718 3

    ramis Western 55 5 44 1 5

    .otal ramis 344 + 33 1 5

    Widget #entral ++ 2 1 2 10

    Widget Eastern 102 4 12 5 13

    Widget Western 39 6798 9 6718

    .otal Widget 20 1 13 4 11

    -rand .otal 551 4 20 2

    !he t'in:ling nature o$ +'!P #atabases or

    bur#en on that system to correct(y #epict o(# history% We have a (ong series o$ transactions thatincrementa((y a(ter history an# it is c(ose to impossib(e to 9uick(y reconstruct the snapshot o$ a business ata speci$ie# point in time%

    We make a #ata -arehouse a speci$ic time series We move snapshots o$ the +'!P systems over to the#ata -arehouse as a series o$ #ata (ayers7 (ike geo(ogic (ayers% y bringing static sna&shots to the-arehouse on(y on a regu(ar basis7 -e so(ve both o$ the time representation prob(ems -e ha# on the +'!Psystem% "o up#ates #uring the #ay * so no t'in:ling y storing snapshots7 -e represent prior points intime correct(y% !his a((o-s us to ask comparative 9ueries easi(y% !he snapshot is ca((e# the &roductiondata e;tract( an# -e migratethis etract to the #ata -arehouse system at regu(ar time interva(s% !hisprocess gives rise to the t-o phases o$ the #ata -arehouse: loading and *uer"ing

  • 8/10/2019 Dw HK WhitePaper

    8/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    E/R Modeling Vs Dimension Tables

    Entit"/Relationshi& mo#e(ing seeks to #rive a(( the re#un#ancy out o$ the #ata% &$ there is no re#un#ancyin the #ata7 then a transaction that changes any #ata on(y nee#s to touch the #atabase in one p(ace% !his isthe secret behin# the phenomena( improvement in transaction processing spee# since the ear(y 80s% ./?

    mo#e(ing -orks by #ivi#ing the #ata into many #iscreet entities7 each o$ -hich becomes a tab(e in the+'!P #atabase% A simp(e ./? #iagram (ooks (ike the map o$ a (arge metropo(itan area -here the entitiesare the cities an# the re(ationships are the connecting $ree-ays% !his #iagram is very symmetric or*ueries that s&an man" records or man" ta%les( E/R diagrams are too com for users tounderstand and too com for soft'are to na>=. ?E @SED !S .AE ?!S$S =R E>.ER,R$SE D!.!W!REA=@SES

    &n #ata -arehousing7 80G o$ the 9ueries are sing(e*tab(e bro-ses7 an# 20G are mu(titab(e >oins% !hisa((o-s $or a tremen#ous(y simp(e #ata structure% !his structure is the dimensional model or the star )oinschema

    !his name is chosen because the ./? #iagram (ooks (ike a star -ith one (arge centra( tab(e ca((e# the factta%le an# a set o$ sma((er atten#ant tab(es ca((e# dimensional ta%les7 #isp(aye# in a ra#ia( pattern aroun#

    the $act tab(e% !his structure is very asymmetric% !he $act tab(e in the schema is the on(y one thatparticipates in mu(tip(e >oins -ith the #imension tab(es% !he #imension tab(es a(( have a sing(e >oin to thiscentra( $act tab(e%

    .ime Dimension

    timeIkey

    #ayIo$I-eek

    month9uarter

    year

    ho(i#ayI$(ag

    Sales act

    timeIkey

    pro#uctIkey

    storeIkey

    #o((arsIso(#

    unitsIso(#

    #o((arsIcost

    ,roduct Dimension

    Store Dimension

    pro#uctIkey

    #escription

    bran#

    category

    storeIkey

    storeIname

    a##ress

    $(oorIp(anItype! t"&ical dimensional model

    !he above is an eamp(e o$ a star schema $or a typica( grocery store chain% !he a(es ;act tab(e containsdaily item totals o$ a(( the pro#ucts so(#% !his is ca((e# the grain o$ the $act tab(e% .ach recor# in the $acttab(e represents the tota( sa(es o$ a speci$ic pro#uct in a market on a #ay% Any other combination generatesa #i$$erent recor# in the $act tab(e% .he fact ta%le of a t"&ical grocer" retailer 'ith 500 stores( eachcarr"ing 50(000 &roducts on the shel

  • 8/10/2019 Dw HK WhitePaper

    9/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201!he fact ta%le is -here the numerica( measurements o$ the business are store#% !hese measurements aretaken at the intersection o$ a(( the #imensions% !he best an# most use$u( $acts are continuousl" oin constraint

    and ".productkey $ p.productkey LMMM >oin constraint

    and t.%uarter $ &' ( '))*& LMMM app(ication constraint

    groupby p.brand LMMM group by c(ause

    orderby p.brand LMMM or#er by c(ause

    @irtua((y every 9uery (ike this one contains ro- hea#ers an# aggregated facts in the se(ect (ist% !he ro-hea#ers are not summe#7 the aggregate# $acts are%

    !he $rom c(ause (ist the tab(es invo(ve# in the >oin%

    !he >oin constraints >oin on the &rimar" :e"$rom the #imension tab(e an# the foreign :e" in the $acttab(e% Referential integrit" is etreme(y important in #ata -arehousing an# is en$orce# by the #ata basemanagement system%

    !his $act tab(e key is a com&osite :e" consisting o$ concatenate# $oreign keys%

    &n +'!P app(ications >oins are usua((y among arti$icia((y generate# numeric keys that have (itt(ea#ministrative signi$icance e(se-here in the company% &n #ata -arehousing one >ob $unction maintains themaster pro#uct $i(e an# overseas the generation o$ ne- pro#uct keys an# another >ob $unction makes surethat every sa(es recor# contains va(i# pro#uct keys% !hese >oins are there$ore ca((e# $S )oins

    9

  • 8/10/2019 Dw HK WhitePaper

    10/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201!&&lication constraints app(y to in#ivi#ua( #imension tab(es% ?ro'sing the #imension tab(es7 the userspeci$ies app(ication constraints% &t rare(y makes sense to app(y an app(ication constraint simu(taneous(yacross t-o #imensions7 thereby (inking the t-o #imensions% !he #imensions are (inke# on(y through the$act tab(e% &t is possib(e to #irect(y app(y an app(ication constraint to a $act in the $act tab(e% !his can bethought o$ as a filter on the recor#s that -ou(# other-ise be retrieve# by the rest o$ the 9uery%

    !he grou& %" clausesummaries recor#s in the ro- hea#ers% !he order %" clause #etermines the sortor#er o$ the ans-er set -hen it is presente# to the user%

    ;rom a per$ormance vie-point then7 the B' 9uery shou(# be eva(uate# as $o((o-s:

    ;irst7 the app(ication constraints are eva(uate# #imension by #imension% .ach #imension thus pro#uces aset o$ can#i#ate keys% !he can#i#ate keys are then assemb(e# $rom each #imension into tria( compositekeys to be searche# $or in the $act tab(e% A(( the 3hits3 in the $act tab(e are then groupe# an# summe#accor#ing to the speci$ications in the se(ect (ist an# group by c(ause%

    Attri%utes !o#e in Data Warehousing

    Attributes are the #rivers o$ the Data Warehouse% !he user begins by p(acing app(ication constraints on the#imensions through the process o$ bro-sing the #imension tab(es one at a time% !he bro-se 9ueries area(-ays on sing(e*#imension tab(es an# are usua((y $ast acting an# (ight-eight% ro-sing is to a((o- the

    user to assemb(e the correct constraints on each #imension% !he user (aunches severa( 9ueries in thisphase% !he user a(so #rags ro- hea#ers $rom the #imension tab(es an# a##itive $acts $rom the $act tab(e tothe ans-er staging area < the report=% !he user then (aunches a mu(titab(e >oin% ;ina((y7 the #bms groupsan# summaries mi((ions o$ (o-*(eve( recor#s $rom the $act tab(e into the sma(( ans-er set an# returns theans-er to the user%

    To !am"le Data Warehouse Designs

    Designing a ,roduct6=riented Data Warehouse

    .ime Dimension

    ,romotion Dimension

    Sales act,roduct Dimension

    Store Dimension

    .he -rocer" Store Schema

    timeIkey

    #ayIo$I-eek

    DayInoIinI4onth

    other time #imension attri

    promotionIkey

    promotionIname

    priceIre#uctionItypeother promotion attr

    pro#uctIkey

    KIno

    KI#esc

    other pro#uct attr

    storeIkey

    storeIname

    storeInumberstoreIa##r

    other store attr

    timeIkey

    pro#uctIkey

    storeIkey

    promotionIkey

    #o((arIsa(es

    unitsIsa(es

    #o((arIcost

    customerIcount

    10

  • 8/10/2019 Dw HK WhitePaper

    11/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    7ac,ground

    !he above schema is $or a grocery chain -ith C00 (arge grocery stores sprea# over a three*state area% .achstore has a $u(( comp(ement o$ #epartments inc(u#ing grocery7 $roen $oo#s7 #airy7 meat7 pro#uce7 bakery7

    $(ora(7 har# goo#s7 (i9uor an# #rugs% .ach store has about 607000 in#ivi#ua( pro#ucts on its she(ves% !hein#ivi#ua( pro#ucts are ca((e# +tock ,eeping nits orSC@s About 07000 o$ the Ks come $romoutsi#e manu$acturers an# have bar co#es imprinte# on the pro#uct package% !hese bar co#es ca((e#niversal Product Codes or @,#s are at the same grain as in#ivi#ua( Ks% !he remaining 207000 Kscome $rom #epartments (ike meat7 pro#uce7 bakery or $(ora( #epartments an# #o not have nationa((yrecognie# P, co#es%

    4anagement is concerne# -ith the (ogistics o$ or#ering7 stocking the she(ves an# se((ing the pro#ucts as-e(( as maimiing the pro$it at each store% !he most signi$icant management #ecision has to #o -ithpricing an# promotions% Promotions inc(u#e tem&orar" &rice reductions( a#s in ne-spapers7 #isp(ays inthe grocery store inc(u#ing she($ #isp(ays an# en# ais(e #isp(ays an# coupons%

    Identifying the Processes to "ode#

    !he $irst step in the #esign is to #eci#e -hat business processes to mo#e(7 by combining an un#erstan#ing

    o$ the business -ith an un#erstan#ing o$ -hat #ata is avai(ab(e% !he secon# step is to #eci#e on the graino$ the $act tab(e in each business process%

    A #ata -arehouse a(-ays #eman#s #ata epresse# at the (o-est possib(e grain o$ each #imension7 not $orthe 9ueries to see in#ivi#ua( (o-*(eve( recor#s7 but $or the 9ueries to be ab(e to cut through the #atabase invery precise -ays% !he best grain $or the grocery store #ata -arehouse is #ai(y item movement or K bystore by promotion by #ay%

    Dimension Ta%#e "ode#ing

    A care$u( grain statement #etermines the primary #imensiona((y o$ the $act tab(e% &t is then possib(e to a##a##itiona( #imensions to the basic grain o$ the $act tab(e7 -here these a##itiona( #imensions natura((y takeon on(y a sing(e va(ue un#er each combination o$ the primary #imensions% &$ it is recognie# that ana##itiona( #esire# #imension vio(ates the grain by causing a##itiona( recor#s to be generate#7 then the

    grain statement must be revise# to accommo#ate this a##itiona( #imension% !he grain o$ the grocery storetab(e a((o-s the primary #imensions o$ time7 pro#uct an# store to $a(( out imme#iate(y%

    4ost #ata -arehouses nee# an ep(icit time dimension ta%leeven though the primary time key may be anB' #ate*va(ue# ob>ect% !he ep(icit time #imension tab(e is nee#e# to #escribe $isca( perio#s7 seasons7ho(i#ays7 -eeken#s an# other ca(en#ar ca(cu(ations that are #i$$icu(t to get $rom the B' #ate machinery%

    !ime is usua((y the $irst #imension in the un#er(ying sort or#er in the #atabase because -hen it is the $irstin the sort or#er7 the successive (oa#ing o$ time interva(s o$ #ata -i(( (oa# #ata into virgin territory on the#isk%

    !he &roduct dimension is one o$ the t-o or three primary #imensions in near(y every #ata -arehouse%!his type o$ #imension has a great many attributes7 in genera( can go above C0 attributes%

    !he other t-o #imensions are an arti$act o$ the grocery store eamp(e%

    ! note of caution

    11

  • 8/10/2019 Dw HK WhitePaper

    12/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    ,roduct Dimension

    pro#uctIkey

    KI#esc

    KInumber

    packageIsieIkeypackageItype

    #ietItype

    -eight

    -eightIunitIo$I

    Imeasure

    storageItypeIkey

    unitsIperIretai(I

    case

    etc%%

    packageIsi8eIkey

    packageIsi8e

    bran#Ikey

    categoryIkey

    category

    #epartmentIkey

    subcategoryIkey

    subcategory

    categoryIkey

    bran#Ikey

    bran#

    subcategoryI

    key

    #epartmentIkey

    #epartment

    storageItypeIkey

    storageItype

    she($I(i$eItypeIkey

    she($I(i$eI

    typeIkey

    she($I(i$eI

    type

    ! sno'fla:ed &roduct dimension

    ro-sing is the act o$ navigating aroun# in a #imension7 either to gain an intuitive un#erstan#ing o$ ho-the various attributes corre(ate -ith each other or to bui(# a constraint on the #imension as a -ho(e% &$ a(arge pro#uct #imension tab(e is sp(it apart into a sno-$(ake7 an# robust bro-sing is attempte# among-i#e(y separate# attributes7 possib(y (ying a(ong various tree structures7 it is inevitab(e that bro-singper$ormance -i(( be compromise#%

    12

  • 8/10/2019 Dw HK WhitePaper

    13/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    8act Ta%#e "ode#ing

    !he sales fact ta%lerecor#s on(y the Ks actua((y so(#% "o recor# is kept o$ the Ks that #i# not se((%

  • 8/10/2019 Dw HK WhitePaper

    14/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    To !am"le Data Warehouse Designs

    Designing a #ustomer6=riented Data Warehouse

    & -i(( out(ine an insurance app(ication as an eamp(e o$ a customer*oriente# #ata -arehouse%

    &n this eamp(e the insurance company is a bi((ion property an# casua(ty insurer $or automobi(es7 home$ire protection7 an# persona( (iabi(ity% !here are t-o main pro#uction #ata sources: a(( transactions re(atingto the $ormu(ation o$ po(icies7 an# a(( transactions invo(ve# in processing c(aims% !he insurance company-ants to ana(ye both the -ritten po(icies an# c(aims% &t -ants to see -hich coverages are most pro$itab(ean# -hich are the (east% &t -ants to measure pro$its over time by covere# item type

  • 8/10/2019 Dw HK WhitePaper

    15/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    .he follo'ing four schemas outline the star schema for the insurance a&&lication

    #laims .ransaction

    Schema

    #ateIkey

    #ayIo$I-eek

    $isca(Iperio#

    emp(oyeeIkey

    name

    emp(oyeeItype#epartment

    covere#IitemIkey

    covere#IitemI#esc

    covere#IitemItype

    c(aimantInamec(aimantIkeyc(aimantIa##ress

    c(aimantItype

    thir#IpartyIkey

    thir#IpartyIname

    thir#IpartyIa##rthor#IpartyItype

    insure#IpartyIkey

    name

    a##resstype

    #emographic attributes

    coverageIkey

    coverageI#esc

    marketIsegment(ineIo$Ibusiness

    annua(IstatementI(ineautomobile0attributes ...

    po(icyIkeyriskIgra#e

    c(aimIkey

    c(aimI#escc(aimItypeautomobile0attributes ...

    transactionIkey

    transactionI#escriptionreason

    automobile0attributes

    ...

    transactionI#ate

    e$$ectiveI#ateinsure#IpartyIkey

    emp(oyeeIkey

    coverageIkeycovere#IitemIkey

    po(icyIkeyc(aimantIkey

    c(aimIkey

    thir#IpartyIkeytransactionIkey

    amount

    ,olic" .ransaction Schema

    #ateIkey

    #ayIo$ -eek

    $isca(Iperio#

    emp(oyeeIkey

    name

    emp(oyeeItype#epartment

    covere#IitemIkey

    covere#IitemI#escription

    covere#IitemItypeautomobile0attributes

    transactionIkey

    transactionI#scription

    reason

    transactionI#ate

    e$$ectiveI#ate

    insure#IpartyIkeyemp(oyeeIkey

    coverageIkey

    covere#IitemIkeypo(icyIkey

    transactionIkey

    amount

    insure#IpartyIkey

    name

    a##resstype

    #emographicIattributes%%%

    coverageIkey

    coverageI#escription

    marketIsegment(ineIo$Ibusiness

    annua(IstatementI(ine

    automobi(eIattributes

    po(icyIkeyriskIgra#e

    %%%

    15

  • 8/10/2019 Dw HK WhitePaper

    16/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    ,olic" Sna&shot Schema

    #ateIkey

    $isca(Iperio#

    agentIkey

    agentIname

    agentI(ocationagentItype

    covere#IitemIkey

    covere#IitemI#escriptioncovere#IitemItypeautomobile0attributes ...

    statusIkey

    statusI#escription

    insure#IpartyIkey

    name

    a##resstype

    #emographic attributes

    coverageIkey

    coverageI#escmarketIsegment

    (ineIo$Ibusiness

    annua(IstatementI(ineautomobile0attributes ...

    po(icyIkey

    riskIgra#e

    snapshotI#ate

    e$$ectiveI#ateinsure#IpartyIkey

    agentIkey

    coverageIkeycovere#IitemIkey

    po(icyIkeystatusIkey

    -rittenIpermission

    earne#IpremiumprimaryI(imit

    primaryI#e#uctib(e

    numberItransactionsautomobile0"acts ...

    #laims Sna&shot

    Schema

    #ateIkey

    #ayIo$I-eek$isca(Iperio#

    covere#IitemIkey

    covere#IitemI#esc

    covere#IitemItypeautomobile0attributes ...

    agentIkey

    agentIname

    agentItypeagentI(ocation

    c(aimIkey

    c(aimI#escc(aimItypeautomobile0attributes ...

    insure#IpartyIkeyname

    a##ress

    type#emographic attributes

    coverageIkey

    coverageI#escmarketIsegment

    (ineIo$Ibusiness

    annua(IstatementI(ineautomobile0attributes ...

    po(icyIkey

    riskIgra#e

    statusIkeytatusI#escription

    transactionI#ate

    e$$ectiveI#ate

    insure#IpartyIkeyagentIkey

    emp(oyeeIkey

    coverageIkeycovere#IitemIkey

    po(icyIkeyc(aimIkey

    statusIkey

    reservetIamountpai#IthisImonth

    receive#IthisImonth

    numberItransactionsautomobile "acts ...

    1+

  • 8/10/2019 Dw HK WhitePaper

    17/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201An appropriate #esign $or a property an# casua(ty insurance #ata -arehouse is a short va(ue chainconsisting o$ po(icy creation an# c(aims processing7 -here these t-o ma>or processes are represente# bothby transaction $act tab(es an# month(y snapshot $act tab(es%

    !his #ata -arehouse -i(( nee# to represent a number o$ heterogeneous coverage types -ith appropriatecombinations o$ core an# custom #imension tab(es an# $act tab(es%

    !he (arge insure# party an# covere# item #imensions -i(( nee# to be #ecompose# into one or moremini#imensions in or#er to provi#e reasonab(e bro-sing per$ormance an# in or#er to accurate(y trackthese s(o-(y changing #imensions%

    Data%ase (i/ing for the Insurance App#ication

    Policy Transaction #act Table !i$ing

    "umber o$ po(icies: 270007000

    "umber o$ covere# item coverages

  • 8/10/2019 Dw HK WhitePaper

    18/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Policy !na"shot #act Table !i$ing

    "umber o$ po(icies: 270007000

    "umber o$ covere# item coverages

  • 8/10/2019 Dw HK WhitePaper

    19/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Mechanics of the Design

    !here are nine #ecision points that nee# to be reso(ve# $or a comp(ete #ata -arehouse #esign:

    1% !he processes7 an# hence the i#entity o$ the $act tab(es

    2% !he grain o$ each $act tab(e

    % !he #imensions o$ each $act tab(e

    % !he $acts7 inc(u#ing preca(cu(ate# $acts%

    C% !he #imension attributes -ith comp(ete #escriptions an# proper termino(ogy

    6% Ho- to track s(o-(y changing #imensions

    J% !he aggregations7 heterogeneous #imensions7 mini#imensions7 9uery mo#e(s an# other physica( storage#ecisions

    8% !he historica( #uration o$ the #atabase

    5% !he urgency -ith -hich the #ata is etracte# an# (oa#e# into the #ata -arehouse

    $nter

  • 8/10/2019 Dw HK WhitePaper

    20/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    #hoosing the Aard'are/Soft'are &latforms

    !hese choices boi( #o-n to t-o primary concerns:

    1% Does the propose# system actua((y -ork

    2% &s this a ven#or re(ationship that -e -ant to have $or a (ong time

    Buestion the ven#or -hether:

    1% ,an the system 9uery7 store7 (oa#7 in#e7 an# a(ter a bi((ion*ro- $act tab(e -ith a #oen #imensions

    2% ,an the system rapi#(y bro-se a 1007000 ro- #imension tab(e

    enchmark the system to simu(ate $act an# #imension tab(e (oa#ing%

    ,on#uct a 9uery test $or:

    1% Average bro-se 9uery response time

    2% Average bro-se 9uery #e(ay compare# -ith un(oa#e# system

    % ?atio bet-een (ongest an# shortest bro-se 9uery time

    % Average >oin 9uery response time

    C% Average >oin 9uery #e(ay compare# -ith un(oa#e# system

    6% ?ation bet-een (ongest an# shortest >oin 9uery time

  • 8/10/2019 Dw HK WhitePaper

    21/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Ser

  • 8/10/2019 Dw HK WhitePaper

    22/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201A bu(k #ata (oa#er shou(# a((o- $or:

    !he para((e(iation o$ the bu(k #ata (oa# across a number o$ processors in either 4P or 4PPenvironments%

    e(ective(y turning o$$ an# then on the master in#e pre an# post bu(k (oa#s

    &nsert an# up#ate mo#es se(ectab(e by the DA

    ?e$erentia( integrity han#(ing options

    &t is a goo# i#ea7 as mentione# ear(ier7 to think o$ the (oa# process as one transaction% &$ the (oa# iscorrupte#7 a ro((back an# (oa# in the net (oa# -in#o- shou(# be trie#%

    #lient6Side acti

  • 8/10/2019 Dw HK WhitePaper

    23/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    %onclusions

    !he #ata -arehousing market is moving 9uick(y as a(( ma>or D4 an# too( ven#ors try to satis$y &nee#s% !he in#ustry nee#s to be #riven by the users as oppose# to by the so$t-are/har#-are ven#ors as hasbeen the case upto no-%

    o$t-are is the key% A(though there have been severa( a#vances in har#-are7 such as para((e( processing7the main impact -i(( sti(( be $e(t through so$t-are%

    Here are a $e- so$t-are issues:

    +ptimiation o$ the eecution o$ star >oin 9ueries

    &n#eing o$ #imension tab(es $or bro-sing an# constraining7 especia((y mu(ti*mi((ion*ro- #imensiontab(es

    &n#eing o$ composite keys o$ $act tab(es

    ynta etensions $or B' to han#(e aggregations an# comparisons

    upport $or (o-*(eve( #ata compression

    upport $or para((e( processing

    Database Design too(s $or star schemas

    .tract7 a#ministration an# BA too(s $or star schemas

    .n# user 9uery too(s

    23

  • 8/10/2019 Dw HK WhitePaper

    24/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    & %hec'list for an Ideal Data Warehouse

    !he $o((o-ing check(ist is $rom ?a(ph Kimba((s * A Data Warehouse !oo(kit7 Wi(ey 56

    Pre(iminary comp(ete (ist o$ a$$ecte# user groups prior to intervie-s

    Pre(iminary comp(ete (ist o$ (egacy #ata sources prior to intervie-s

    Data -arehouse imp(ementation team i#enti$ie#

    Data -arehouse manager i#enti$ie#

    &ntervie- (ea#er i#enti$ie#

    .tract programming manager i#enti$ie#

    .n# user groups to be intervie-e# i#enti$ie#

    Data -arehouse kicko$$ meeting -ith a(( a$$ecte# en# user groups

    .n# user intervie-s

    4arketing intervie-s

    ;inance intervie-s

    'ogistics intervie-s

    ;ie(# management intervie-s

    enior management intervie-s

    i*inch stack o$ eisting management reports representing a(( intervie-e# groups

    'egacy system DA intervie-s

    ,opy books obtaine# $or can#i#ate (egacy systems

    Data #ictionary ep(aining meaning o$ each can#i#ate tab(e an# $ie(#

    High*(eve( #escription o$ -hich tab(es an# $ie(#s are popu(ate# -ith 9ua(ity #ata

    &ntervie- $in#ings report #istribute#

    Prioritie# in$ormation nee#s as epresse# by en# user community

    Data au#it per$orme# sho-ing -hat #ata is avai(ab(e to support in$ormation nee#s

    Data-arehousing #esign meeting

    4a>or processes i#enti$ie# an# $act tab(es (ai# out

    )rain $or each $act tab(e chosen

    ,hoice o$ transaction grain @s time perio# accumu(ating snapshot grain

    Dimensions $or each $act tab(e i#enti$ie#

    ;acts $or each $act tab(e -ith (egacy source $ie(#s i#enti$ie#

    24

  • 8/10/2019 Dw HK WhitePaper

    25/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Dimension attributes -ith (egacy source $ie(#s i#enti$ie#

    ,ore an# custom heterogeneous pro#uct tab(es i#enti$ie#

    (o-(y changing #imension attributes i#enti$ie#

    Demographic mini#imensions i#enti$ie#

    &nitia( aggregate# #imensions i#enti$ie#

    Duration o$ each $act tab(e

  • 8/10/2019 Dw HK WhitePaper

    26/28

  • 8/10/2019 Dw HK WhitePaper

    27/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    Dri(( across that a((o-s mu(tip(e $act tab(es to appear in same report

    ,orrect(y ca(cu(ate# break ro-s

    ?e#*)reen eception high(ighting -ith inter$ace to #ri(( #o-n

    Abi(ity to use net-ork aggregate navigator -ith every atomic 9uery issue# by too(

    e9uentia( operations on the ans-er set such as numbering top "7 an# ro((ing

    Abi(ity to eten# 9uery synta $or D4 specia( $unctions

    Abi(ity to #e$ine very (arge behaviora( groups o$ customers or pro#ucts

    Abi(ity to graph #ata or han# o$$ #ata to thir#*party graphics package

    Abi(ity to pivot #ata or to han# o$$ #ata to thir#*party pivot package

    Abi(ity to support +'. hot (inks -ith other +'. a-are app(ications

    Abi(ity to p(ace ans-er set in c(ipboar# or !N! $i(e in 'otus or .ce( $ormats

    Abi(ity to print horionta( an# vertica( ti(e# report

    atch operation

    )raphica( user inter$ace user #eve(opment $aci(ities

    Abi(ity to bui(# a startup screen $or the en# user

    Abi(ity to #e$ine pu(( #o-n menu items

    Abi(ity to #e$ine buttons $or running reports an# invoking the bro-ser

    ,onsu(tants

    ,onsu(tant team 9ua(i$ie#

    ,onsu(tant team has imp(emente# a simi(ar #ata -arehouse

    ,onsu(tant team agrees -ith the #imensiona( approach

    ,onsu(tant team #emonstrates competence in prototype test

    2

  • 8/10/2019 Dw HK WhitePaper

    28/28

    Data Warehousing: A PerspectivebyHemant Kirpekar

    10/18/201

    (ibliogra"hy

    1% u(i#ing a Data Warehouse7 econ# .#ition7 by W%H% &nmon7 Wi(ey7 1556

    2% !he Data Warehouse !oo(kit7 by Dr% ?a(ph Kimba((7 Wi(ey7 1556

    % trategic Database !echno(ogy: 4anagement $or the year 20007 by A(an imon7 4organ Kau$mann7155C

    % App(ie# Decision upport7 by 4ichae( W% Davis7 Prentice Ha((7 1588

    C% Data Warehousing: Passing ;ancy or trategic &mperative7 -hite paper by the )artner )roup7 155C

    6% Kno-(e#ge Asset 4anagement an# ,orporate 4emory7 -hite paper by the Hispacom )roup7 to bepub(ishe# in Aug

    1556

    The End


Recommended