Post on 20-Dec-2015
transcript
The Data Warehouse (DW) and Business Intelligence (BI) 9.1
COT5230 Data Mining
Week 9
The Data Warehouse (DW) and Business Intelligence (BI)
M O N A S HA U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y
The Data Warehouse (DW) and Business Intelligence (BI) 9.2
Lecture Outline
Overview of Data Warehousing
Data Warehouse Architecture
Overview of Business Intelligence (BI)
OLAP
The Data Warehouse (DW) and Business Intelligence (BI) 9.3
What is a DW?
A data store to support data analysis or decision support
– Decision support:» a methodology to extract information from data
– Decision support system:» an arrangement of computerized tools to assist in managerial
decision making
Answers questions by combining historical operational data with a business data model that reflects business activity
Data may come from both operational and external sources
– external data - e.g. industry average salaries
The Data Warehouse (DW) and Business Intelligence (BI) 9.4
Data Warehouse Definitions - 1
The information in a DW is subject-oriented, non-volatile, and of an historic nature, and so DWs tend to contain extremely large datasets
The purpose of the DW is to provide the tools and facilities to manage and deliver complete, timely, accurate, and understandable business information to authorized individuals for effective business decision making
DW implementation needs a company-wide effort that requires user involvement and commitment at all levels
A successful DW implementation tracks return on investment
The Data Warehouse (DW) and Business Intelligence (BI) 9.5
Data Warehouse Definitions - 2
A DW is a concept not a product– It is the compiling, assembling, and consolidating of
application data common to user communities at a single logical point
Typical use includes ad hoc queries, “what if”, data matching, trend analysis and other sophisticated information functions
Warehouse data is typically extracted from OLTP systems
A DW can be described as a read-only database that provides users with access to consolidated, historic, or static data extracted from operational databases, usually augmented with external data
The Data Warehouse (DW) and Business Intelligence (BI) 9.6
Operational Data vs. the DW - 1
Integration– Data found within the DW is ALWAYS integrated, e.g.
» encoding, measurements of attributes, etc. are standardized
Normalized vs. denormalized– Operational data is normalized
Timespan– Operational data is current
– DW data is historical
Granularity– Operational data is at transaction level
– DW data is at an aggregation level
The Data Warehouse (DW) and Business Intelligence (BI) 9.7
Operational Data vs. the DW - 2
Dimensionality– data is clustered according to functional
requirements i.e. all orders to be delivered to a particular suburb
– data analyst requires access to all dimensions
Use– DW is read only
The Data Warehouse (DW) and Business Intelligence (BI) 9.8
MIS, or Before the DW
MIS: Management Information System
required detailed knowledge of the operational systems
no Business Information Directory
data quality is ad hoc
limited data integration from source systems
integration and querying performed by MIS specialists using 3+GL tools such as SAS
or at best performing queries using SQL against images of unintegrated operational databases
The Data Warehouse (DW) and Business Intelligence (BI) 9.9
Inmon’s 12 Rules - 1
DW and operational environments are separated
Integrated DW data
DW contains historical data
DW is snapshot data captured at particular point in time
DW data is subject-oriented
The Data Warehouse (DW) and Business Intelligence (BI) 9.10
Inmon’s 12 Rules - 2
No online update
DW SDLC is data-driven
DW contains several levels of data - raw to summarized
Data sources are traced
Meta-data is a critical component
DW contains a charge back mechanism
The Data Warehouse (DW) and Business Intelligence (BI) 9.11
DW Architecture
Authoritative Source
Source SystemsExternal systems
Extract / Enhance /Transform Layer
Copy mgtExtractTransform
Process onceBusiness rules
Consistency& controls
Value add
Enterprisesingle imagedata view
Separates data fromapplication
Fully modelled& documented
Data Warehouse
Build datafor appropriatedatamart
Parallelprocess
Denormalizefor specificuse
Customise
Meets specificOLAPrequirements
DataMarts
Delivery touser
Industrystandardtools
Tailored applicationswhereappropriate
Load
Business Information Directory
The Data Warehouse (DW) and Business Intelligence (BI) 9.12
Source Systems/Authoritative Source
must first identify authoritative source data
Authoritative Source– atomic data from the creating/owning source system
data propagation must be subject to a delivery contract
data propagation is asynchronous– no reverse propagation
– no periodic synchronization
delivery must have minimal impact on operational systems
The Data Warehouse (DW) and Business Intelligence (BI) 9.13
Extract/Enhance/Transform Layer
must create integrated and standardized data
deduping process happens here
denormalize into a format for direct loading into the DW
cleanse – must remove semantic and syntactic inconsistencies
– return invalid data to the source system for repair
requires a data quality process
simple business transformations
addition of surrogate keys and time variance
The Data Warehouse (DW) and Business Intelligence (BI) 9.14
Handling Inserts/Deltas - 1
Scenarios– additions to a (1) New or (2) Existing partition
– partitions are (1) Atomic or (2) Aggregates
New partition - atomic or aggregate– work off-line
– do summation outside of database and use efficient tools i.e.. Syncsort or C
– then SQL*LOADER
The Data Warehouse (DW) and Business Intelligence (BI) 9.15
Handling Inserts/Deltas - 2
Updates to an existing partition– Atomic Partition
» Unload, Sort, Reload or» Insert directly into DB - concurrency issues
– Aggregate Partition
R1 X 1R2 X 2
X 3 - stored in databaseR3 X 1
– Update directly to DW
– Unload and update out of the database
– Keep source data and re sort sum
The Data Warehouse (DW) and Business Intelligence (BI) 9.16
The Data Warehouse
contains atomic data
Star Schema structure– contains
» Facts» Dimensions» Attributes - Surrogate keys» Attribute Hierarchies
Key Issues– size
– data retention period - YTD
– backup and recovery
– security
The Data Warehouse (DW) and Business Intelligence (BI) 9.17
Star Schemas
a data modeling technique used to map decision support data into a relational database
this structure is based on the premise that a highly normalized data structure do not serve advanced data analysis requirements well
DimACustomer
Fact TableSALES
DimBProduct
DimCSalesrep
DimDLocation
Cust#
SalesrepID
Loc# Prod#
The Data Warehouse (DW) and Business Intelligence (BI) 9.18
Snowflake Schemas
DimACustomer
Fact TableSALES
DimBProduct
DimCSalesrep
DimDLocation
SalesrepID
Prod#
CustomerCategory
Customer Address
Customer State
The Data Warehouse (DW) and Business Intelligence (BI) 9.19
Fact Tables
Facts measure something of interest to an enterprise– atomic level or transactional data
– summarization will reduce volume but may lose information
CUST# PROD# TOTALC100 P100 $1000C100 P200 $2000
CUST# PROD# SALESREP DATECOSTC100 P100 S1 1/12 $510C100 P100 S2 2/12 $490
The Data Warehouse (DW) and Business Intelligence (BI) 9.20
Dimensions
drill down to atomic data from dimensions or reference tables
A Query– List sales of Product P100 for each State for each
Month of 1999?
Product Location TimeP#=P100 State=Each Year=1999PName Nuts Region Month=EachPCat
The Data Warehouse (DW) and Business Intelligence (BI) 9.21
Attributes & Attribute Hierarchies
each dimension table contains attributes
surrogate keys are commonly added to improve performance of joins between Fact tables and their associated Dimensions
attributes are used to search, filter of classify facts
Attribute Hierarchies: classification attributes, e.g.
SALES_REGIONVIC, TAS
The Data Warehouse (DW) and Business Intelligence (BI) 9.22
Datamarts/Customization/Cubes
customization - select only the attributes and rows of interest for export to a datamart or data cube
apply coding techniques to the attributes of interest suitable for search algorithm to be used
each cell of a cube is a view consisting of an aggregation of interest
– e.g. TOTAL_SALES
used as a performance improving technique to – pre aggregate groupby cells
– remove data not required for the problem at hand from the search algorithm
The Data Warehouse (DW) and Business Intelligence (BI) 9.23
Business Intelligence & The DW
most enterprises have a data repository to allow data analysis to occur
database provide enabling techniques– efficient data storage and access
– query optimization
80% of knowledge discovery in databases (KDD) is the preparation of the data - this is the data warehouse
the evolution of the desktop, database, networks and AI/search has made it possible to perform KDD in commercial databases
The Data Warehouse (DW) and Business Intelligence (BI) 9.24
The BI Process - 1
Understand and define the process
Perform data collection and extraction
Perform Data Cleaning and Exploration
Data Engineering– select attributes of interest
– select records of interest
– map attributes to suit DM algorithms
The Data Warehouse (DW) and Business Intelligence (BI) 9.25
The BI Process - 2
Algorithm Engineering– which algorithm to use
– ability to deal with » quality of input» quality of output» performance
Run the data mining algorithm
Preliminary evaluation of the results
Refine the data and the problem
Use the results to implement a business strategy
The Data Warehouse (DW) and Business Intelligence (BI) 9.26
A BI Model
AnalysisDiscovery
Pattern Recognition
Prediction/Verification
Model
AnswerVariables
Learning
Adaptive Modelling
Profit from targeted customers buying Product X/Cost of Producing the Model and Predicting the Answer= Return on Investment
The Data Warehouse (DW) and Business Intelligence (BI) 9.27
DM Techniques
Verification Driven Data Mining Techniques– Naive evaluation - exhaustive search
– Random walk
– ad hoc query
– OLAP
– Hypothesis testing - statistics
Discovery Driven Data Mining Techniques
– Statistical Modeling (e.g. linear regression)
– Visualization
– Rule-based and inductive learning
– Neural networks
– Genetic algorithms (an optimization technique)
The Data Warehouse (DW) and Business Intelligence (BI) 9.28
OLAP:On-Line Analytical Processing
an environment for the analysis of multi-dimensional data
– dice
– rotate
– drill-down
– rollup
OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines
– easy to use user-interface
– open system architecture using local processing power
The Data Warehouse (DW) and Business Intelligence (BI) 9.29
References
Rob, P. & Coronel, C. Database Systems: Design, Implementation, and Management, 3rd Ed., Nelson 1997
Inmon W. H. - numerous. See http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/ for example
Kimball, R - numerous
Golfarelli, M., Maio, D., and Rizzi, S. Conceptual Design of Data Warehouses from E/R Schemes, in Proceedings of the 31st Hawaii International Conference on System Sciences,1998
Lee A.J. and Rundensteiner, E. A Data Warehouse Evolution: Consistent Metadata Management.
Gray, J. et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery 1, pp. 29-53, 1997
Maier, D. et al. Selected Research Issues in Decision Support Databases Journal of Intelligent Information Systems, 11 (2), pp. 169-191 1998