DATA WAREHOUSING
A REWARDING CAREERRALPH KIMBALL
MARCH 2015
Data Warehousing
© Ralph Kimball, 2015
March 2015
A Classy Problem
Challenge worthy of the best minds Durable, permanent: no quick technical
fixes Highly visible and important Constant new challenges Enormous investments in people and
technology Good salaries Interesting careers
Successful Qualities
You need to be interested in three things: The business The technology And what is #3?
People!
The mission of the data warehouse is todeliver information most effectivelyto decision makers, who Are not technology enthusiasts Do not read the manuals But are VERY motivated to use information
to make decisions
You need to love business usersin spite of their frailties
Deliver Most Effectively
Simple Obvious, Recognizable Relevant, Actionable Minimize number of cognitive subgoals: count the
clicks…
Fast Keep your hand on the mouse Don’t leave your desk Remember the lesson of Google
Don’t give me the flimsy excuse that it’s OK for the query to run for 10 minutes because the answer is “complex” or “important”
We just searched billions of web pages in less than a second
What’s a Meta For?
The Restaurant Metaphor You have a kitchen and a dining room
Building the Presentation Server The Platform for BI
Dimensional models (star schemas) Driven from business process SOURCES, not
reports Inherently distributed, but we will
INTEGRATE Faithfully maintains history Gracefully extensible, agile compatible Development built on standard techniques
Expose the Star Schema in the UI Platform for BI
Time Key (FK)Product Key (FK)Store Key (FK)Promotion Key (FK)DollarsUnitsCost
Product Key (PK)SKUDescriptionBrandCategoryPackage TypeSizeFlavor
Store Key (PK)Store IDStore NameAddressDistrictRegion
Promotion Key (PK)Promotion NamePromotion TypePrice TreatmentAd TreatmentDisplay TreatmentCoupon Type
Time dimension Sales fact table
Promotion dimension
Product dimension
Store dimension
Time Key (PK)SQL dateDay of WeekWeek NumberMonth
District Brand Total Dollars Total Cost Gross ProfitAtherton Clean Fast $ 1,233 $ 1,058 $ 175
dragand
drop!
dragand
drop!
dragand
drop!
dragand
drop!
compute...
Basic Modeling Techniques
Four steps in the design Choose the process
The data source Choose the grain
Business definition of the measurement Choose the dimensions
Single valued in presence of the grain Choose the facts
True to the grain
Dimensions are the Soul of the DW Wide,
verbose, denormalized
Ideal for bitmapped indexes
Attributes are the source of constraints and groupings
Resist the Urge to Snowflake the UI Denormalized dimension has exactly same content!
Facts are 1-to-1 with Measurements Fact record = event; Event = fact
record
Keeping the Pledge: Track History Track History with Slowly Changing
Dimensions Type 1 SCD: Overwrite
Type 2 SCD: The Primary Workhorse
Add a row and time stamps for each change
The Chemistry of Fact Tables Three types are all you ever need
Transaction Grain Single point in space and time: an event
Periodic Snapshot Grain Behavior that has occurred in a repeating
period
Accumulating Snapshot Grain Behavior during a short lived process
with a beginning and an end, possibly not finished yet
Retail Sales Fact Table
Short list of facts, unpredictably sparse or dense
Bank Account Periodic Snapshot Open ended list of facts, predictably
dense
Inventory Accumulating Snapshot Open ended facts, many milepost dates
& updates
The Most Powerful Result:Conformed Dimensions
The payload:
What is special about the 4th column? Creating and publishing
conformed dimensions and factsis 50% technical and 50% political
ProductManufacturing
ShipmentsWarehouse
Inventory Retail Sales Turns
Framis 2940 1887 761 21
Toggle 13338 9376 2448 14
Widget 7566 5748 2559 23
19
Advanced Techniques
Factless fact tables Hybrid SCD types Many valued dimensions and bridge
tables Ragged hierarchies of indeterminate
depth Rapidly changing monster dimensions Full list of 72 techniques in latest data
modeling book, full chapter on website
Manage with the DW Bus Matrix
Drive architecture, project management, communication
Big Data Use Cases
Behavior tracking Search ranking Ad tracking Location and proximity tracking Causal factor discovery Social CRM
Share of voice, audience engagement, conversation reach, active advocates, advocate influence, advocacy impact, resolution rate, resolution time, satisfaction score, topic trends, sentiment ratio, and idea impact
Financial account fraud detection/intervention System hacking detection/intervention On line game gesture tracking
More Big Data Use Cases
Non-numeric data and unique algorithms Document similarity testing Genomics analysis Cohort group discovery Satellite image comparison CAT scan comparisons Big science data collection
Complex numeric data Smart utility meters Building sensors In flight aircraft status
Data bags – name/value pairs with ad hoc content
Houston: We Have a Problem The traditional pure relational data warehouse
architecture can’t handle ANY of these use cases. We need:
Non-scalar data: vectors, arrays, data bags, structured text, free text, images, waveforms
Iterative logic, complex branching, advanced statistics
Petabyte data sources loaded at gigabytes/second Analysis in place across thousands of distributed
processors, data often not in database format,full data scans often needed
Data loaded before structure is understood Analysis while loading
Hadoop for Exploratory DW/BI
• HDFS is NOT a database, it’s a file system!
HDFS Primary Files:
Sources: Trans-actions
Freetext
ImagesMachines/ Sensors
Links/Networks
Metadata (system table):
HCatalog
SQL Query Engines:
Hive
Impala
BI Tools:
Tableau
Industry standard HW;Fault tolerant; Replicated; Write once(!); Agnostic content; Scalable to “infinity”
Others…
Bus Obj
Cognos
QlikView
Others…
All clients can use this to read files
These are query engines, not databases!
Purpose built for EXTREME I/O speeds;Use ETL tool or Sqoop
EDWOverflow
Starting a Data Warehouse Career Join a DW/BI project (leverage your
contacts) Experience trumps everything
ETL tools, BI tools, databases, Java, SQL, …
Migrate over time among Business (end user department) Technology (IT architecture team, ETL
development) People (user interface design, BI
development)
Read! For instance:
The Kimball Group Resource
www.kimballgroup.com Best selling data warehouse books
NEW BOOK!The Classic “Toolkit” 3rd Ed.
White Paperson Integration, Data Quality,and Big Data Analytics
Cloudera Webinars (www.Cloudera.com)Hadoop 101 for EDW ProfessionalsEDW 101 for Hadoop Professionals