Date post: | 14-Jan-2016 |
Category: |
Documents |
Upload: | jeremy-nash |
View: | 216 times |
Download: | 2 times |
ISQS 3358, Business IntelligenceISQS 3358, Business Intelligence
Supplemental Notes on the Supplemental Notes on the Term ProjectTerm ProjectZhangxi Lin
Texas Tech University
1
Term project3-4 students form a team to fulfill a data
mart development project. ◦ Stage 1 (10%): Project proposal. March 9◦ Stage 2 (20%): Data mart development. March 25◦ Stage 3 (20%): Data mart populating. Due April 6◦ Stage 4 (20%): Analysis report. Due April 20◦ Stage 5 (30%): The compilation of the previous
deliverables with modifications and enhancements. Due May 1
Detailed instructions: http://zlin.ba.ttu.edu/3358/Projects09.html
Merits of the project outcomes Carefully developed project proposal
demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives
Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes
Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc.
Advanced OLAP cube design, and/or optional MDX scripting by self-taught
Rich data analysis outcomes Well-presented final report Demonstrating the creative ideas and skillful data
warehousing ability
More about dimensional More about dimensional modelingmodeling- How to develop a more professional data mart
4
Slowly Changing Slowly Changing DimensionsDimensions The attribute values in a dimension may change over
time, which are critical to understand the dynamics of the business. The ability to track the changes of facts over time is critical to a DW/BI system.◦ Employees changed their departments◦ Home moving (16.8% American moved per year) – zip code
changes possible. The dimensions that have changeable attribute values
slowly changing dimensions (SCDs)◦ Type 1 SCD overwrites the existing attribute value with a new
value. You don’t care about keeping track of historical values◦ Type 2 SCD change tracking – ETL process creates a new row
in the dimension table to capture the new values of the changed item
◦ Type 3 SCD – Similar to Type 2 SCD but only track current state and the original state; two additional attribute: SCD Start Date, SCD Initial Value
Ask business users how they will use the information to determine which SCDs are of type 1 and which are of type 2.
5
Surrogate KeySurrogate Key A surrogate key has a unique value assigned to each row
in the dimension. It becomes primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table
Benefits of surrogate keys:◦ Protect the DW/BI system from changes in the source system◦ Allow the DW/BI system to integrate data from multiple
source system◦ Enable developers to add rows to dimensions that do not
exist in the source system◦ Provide the means for tacking changes in dimension◦ Are efficient in the relational database and analysis services
Surrogate keys are created when doing data warehousing. They are new from the keys in original database
They are also called meaningless keys, substitute keys, non-natural keys, artificial keys
Specifically, surrogate keys are used in slowly changed dimensions (SCD) management
6
Aggregate DimensionsAggregate DimensionsSituation: data at different levels of
granularities Two resolutions
◦Removing a dimension◦Rolling up a dimension’s hierarchy
and provide a new, shrunken dimension at the aggregate level
7
Many-to-many or Multivalued Many-to-many or Multivalued DimensionsDimensionsRelationship between a dimension table and
fact table is called one-to-many: one row in the dimension table may join to many rows in the fact table.
Many-to-many or Multivalued Dimensions are referred to as there are more than one row in a dimension table joining to multiple rows in a fact table
Bridge table supports many-to-many relationship: ◦ fact-dimension◦ dimension-dimension.
8
HierarchiesHierarchiesMeaningful, standard ways to group
the data within a dimension◦Variable-depth hierarchies◦Frequently changing hierarchies
Examples of hierarchy in a dimension◦ Address: street, city, state, country◦ Organization: section, division, branch,
region◦ Time: year, quarter, month, date
9
Heterogeneous ProductsHeterogeneous ProductsSeveral products with differentiated
attributesProblem: sharing one dimension or
use different dimension?Resolutions
◦One family-oriented dimension with core fact and product tables plus specific information for each line of product
10
The Three Fact Table The Three Fact Table TypesTypesTransaction fact tablePeriodic snapshot fact table
◦Cumulative performance over specific time intervals; valuable to combine data across several business processes in the value chain.
Accumulating snapshot fact table◦Constantly updated over time.
11
Junk DimensionsJunk DimensionsAlso called miscellaneous or mystery
dimensionsThey are miscellaneous attributes that don’t
belong to any existing dimension.Typically flags or indictors that describe or
categorize the transaction in some way.Contents are often importantFour alternatives for dealing with them
◦ Leave them in the fact table◦ Create a separate dimension for each attribute◦ Omit them◦ Group them into a single junk dimension
12
Degenerate DimensionsDegenerate DimensionsNo description of its ownNo joining to an actual dimension
tableNo attributesExample: transaction ID
13
Data Warehousing Data Warehousing MethodologyMethodology- Implementing data warehouse systematically
14
Dimensional Modeling Process Preparation
◦ Identify roles and participants◦ Understanding the data architecture strategy◦ Setting up the modeling environment◦ Establishing naming conventions
Data profiling and research◦ Data profiling and source system exploration◦ Interacting with source system experts◦ Identifying core business users◦ Studying existing reporting systems
Building Dimensional models◦ High-level dimensional model design◦ Identifying dimension and fact attributes
Developing the detailed dimensional model Testing the model Reviewing and validating the model
Business Dimensional Lifecycle
16
ProjectPlanning
BusinessReq’ts
definition
Technical Arch.
Design
ProductSelection &Installation
DimensionalModeling
PhysicalDesign
BI Appl.Specification
BIApplication
Development
ETL design &
DevelopmentDeployment
Maintenance
Growth
Project Management
ETL MethodologyETL MethodologyDevelop a high-level mapBuild a sandbox source system (optional)Detailed data profiling Make decisions
◦ The source-to-target mapping◦ How often loading tables◦ The strategy for partitioning the relational and
Analysis Services fact table◦ The strategy for extracting data from each source
systemDe-duplicate key data from each source
system (optional)Develop a strategy for distributing dimension
tables across multiple database servers (optional)
17
Sandbox Source SystemSandbox Source SystemSandbox
◦ A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system.
◦ A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play.
In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks
How to create◦ Set up a static snapshot of the database◦ By sampling
18
Data ProfilingData ProfilingData profiling is a methodology for learning
about he characteristics of the data It is a hierarchical process that attempt to
build an assessment of the metadata associated with a collection of data sets.
Three levels◦ Bottom – characterizing the values associated with
individual attributes◦ Middle – the assessment looking at relationships
between multiple columns within a single table.◦ Highest level – the profile describing relationships
that exist between data attributes across different tables.
Can run a program against the sandbox source system to obtain the needed information.
19
Decision Issues in ELT System Decision Issues in ELT System DesignDesignSource-to-target mappingLoad frequencyHow much history is needed
20
Strategies for Extracting Strategies for Extracting DataDataExtracting data from packaged source
systems –self-contained data sources◦ May not be good to use their APIs◦ May not be good to use their add-on analytic
systemExtracting directly from the source
databases◦ Strategies vary depending on the nature of the
source databaseExtracting data from incremental loads
◦ How the source database records the changes of the rows
Extracting historical data
21