+ All Categories
Home > Documents > ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech...

ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech...

Date post: 14-Jan-2016
Category:
Upload: jeremy-nash
View: 216 times
Download: 2 times
Share this document with a friend
21
ISQS 3358, Business Intelligence ISQS 3358, Business Intelligence Supplemental Notes on the Supplemental Notes on the Term Project Term Project Zhangxi Lin Texas Tech University 1
Transcript
Page 1: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

ISQS 3358, Business IntelligenceISQS 3358, Business Intelligence

Supplemental Notes on the Supplemental Notes on the Term ProjectTerm ProjectZhangxi Lin

Texas Tech University

1

Page 2: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Term project3-4 students form a team to fulfill a data

mart development project. ◦ Stage 1 (10%): Project proposal. March 9◦ Stage 2 (20%): Data mart development. March 25◦ Stage 3 (20%): Data mart populating. Due April 6◦ Stage 4 (20%): Analysis report. Due April 20◦ Stage 5 (30%): The compilation of the previous

deliverables with modifications and enhancements. Due May 1

Detailed instructions: http://zlin.ba.ttu.edu/3358/Projects09.html

Page 3: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Merits of the project outcomes Carefully developed project proposal

demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives

Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes

Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc.

Advanced OLAP cube design, and/or optional MDX scripting by self-taught

Rich data analysis outcomes Well-presented final report Demonstrating the creative ideas and skillful data

warehousing ability

Page 4: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

More about dimensional More about dimensional modelingmodeling- How to develop a more professional data mart

4

Page 5: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Slowly Changing Slowly Changing DimensionsDimensions The attribute values in a dimension may change over

time, which are critical to understand the dynamics of the business. The ability to track the changes of facts over time is critical to a DW/BI system.◦ Employees changed their departments◦ Home moving (16.8% American moved per year) – zip code

changes possible. The dimensions that have changeable attribute values

slowly changing dimensions (SCDs)◦ Type 1 SCD overwrites the existing attribute value with a new

value. You don’t care about keeping track of historical values◦ Type 2 SCD change tracking – ETL process creates a new row

in the dimension table to capture the new values of the changed item

◦ Type 3 SCD – Similar to Type 2 SCD but only track current state and the original state; two additional attribute: SCD Start Date, SCD Initial Value

Ask business users how they will use the information to determine which SCDs are of type 1 and which are of type 2.

5

Page 6: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Surrogate KeySurrogate Key A surrogate key has a unique value assigned to each row

in the dimension. It becomes primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table

Benefits of surrogate keys:◦ Protect the DW/BI system from changes in the source system◦ Allow the DW/BI system to integrate data from multiple

source system◦ Enable developers to add rows to dimensions that do not

exist in the source system◦ Provide the means for tacking changes in dimension◦ Are efficient in the relational database and analysis services

Surrogate keys are created when doing data warehousing. They are new from the keys in original database

They are also called meaningless keys, substitute keys, non-natural keys, artificial keys

Specifically, surrogate keys are used in slowly changed dimensions (SCD) management

6

Page 7: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Aggregate DimensionsAggregate DimensionsSituation: data at different levels of

granularities Two resolutions

◦Removing a dimension◦Rolling up a dimension’s hierarchy

and provide a new, shrunken dimension at the aggregate level

7

Page 8: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Many-to-many or Multivalued Many-to-many or Multivalued DimensionsDimensionsRelationship between a dimension table and

fact table is called one-to-many: one row in the dimension table may join to many rows in the fact table.

Many-to-many or Multivalued Dimensions are referred to as there are more than one row in a dimension table joining to multiple rows in a fact table

Bridge table supports many-to-many relationship: ◦ fact-dimension◦ dimension-dimension.

8

Page 9: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

HierarchiesHierarchiesMeaningful, standard ways to group

the data within a dimension◦Variable-depth hierarchies◦Frequently changing hierarchies

Examples of hierarchy in a dimension◦ Address: street, city, state, country◦ Organization: section, division, branch,

region◦ Time: year, quarter, month, date

9

Page 10: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Heterogeneous ProductsHeterogeneous ProductsSeveral products with differentiated

attributesProblem: sharing one dimension or

use different dimension?Resolutions

◦One family-oriented dimension with core fact and product tables plus specific information for each line of product

10

Page 11: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

The Three Fact Table The Three Fact Table TypesTypesTransaction fact tablePeriodic snapshot fact table

◦Cumulative performance over specific time intervals; valuable to combine data across several business processes in the value chain.

Accumulating snapshot fact table◦Constantly updated over time.

11

Page 12: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Junk DimensionsJunk DimensionsAlso called miscellaneous or mystery

dimensionsThey are miscellaneous attributes that don’t

belong to any existing dimension.Typically flags or indictors that describe or

categorize the transaction in some way.Contents are often importantFour alternatives for dealing with them

◦ Leave them in the fact table◦ Create a separate dimension for each attribute◦ Omit them◦ Group them into a single junk dimension

12

Page 13: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Degenerate DimensionsDegenerate DimensionsNo description of its ownNo joining to an actual dimension

tableNo attributesExample: transaction ID

13

Page 14: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Data Warehousing Data Warehousing MethodologyMethodology- Implementing data warehouse systematically

14

Page 15: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Dimensional Modeling Process Preparation

◦ Identify roles and participants◦ Understanding the data architecture strategy◦ Setting up the modeling environment◦ Establishing naming conventions

Data profiling and research◦ Data profiling and source system exploration◦ Interacting with source system experts◦ Identifying core business users◦ Studying existing reporting systems

Building Dimensional models◦ High-level dimensional model design◦ Identifying dimension and fact attributes

Developing the detailed dimensional model Testing the model Reviewing and validating the model

Page 16: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Business Dimensional Lifecycle

16

ProjectPlanning

BusinessReq’ts

definition

Technical Arch.

Design

ProductSelection &Installation

DimensionalModeling

PhysicalDesign

BI Appl.Specification

BIApplication

Development

ETL design &

DevelopmentDeployment

Maintenance

Growth

Project Management

Page 17: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

ETL MethodologyETL MethodologyDevelop a high-level mapBuild a sandbox source system (optional)Detailed data profiling Make decisions

◦ The source-to-target mapping◦ How often loading tables◦ The strategy for partitioning the relational and

Analysis Services fact table◦ The strategy for extracting data from each source

systemDe-duplicate key data from each source

system (optional)Develop a strategy for distributing dimension

tables across multiple database servers (optional)

17

Page 18: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Sandbox Source SystemSandbox Source SystemSandbox

◦ A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system.

◦ A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play.

In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks

How to create◦ Set up a static snapshot of the database◦ By sampling

18

Page 19: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Data ProfilingData ProfilingData profiling is a methodology for learning

about he characteristics of the data It is a hierarchical process that attempt to

build an assessment of the metadata associated with a collection of data sets.

Three levels◦ Bottom – characterizing the values associated with

individual attributes◦ Middle – the assessment looking at relationships

between multiple columns within a single table.◦ Highest level – the profile describing relationships

that exist between data attributes across different tables.

Can run a program against the sandbox source system to obtain the needed information.

19

Page 20: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Decision Issues in ELT System Decision Issues in ELT System DesignDesignSource-to-target mappingLoad frequencyHow much history is needed

20

Page 21: ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Strategies for Extracting Strategies for Extracting DataDataExtracting data from packaged source

systems –self-contained data sources◦ May not be good to use their APIs◦ May not be good to use their add-on analytic

systemExtracting directly from the source

databases◦ Strategies vary depending on the nature of the

source databaseExtracting data from incremental loads

◦ How the source database records the changes of the rows

Extracting historical data

21


Recommended