1
CSE-634Data Mining Concepts and Techniques
Spring 2007
Data Warehousing and OLAP TechnologyPart – I
By Group 2 GuidanceAnuradha T P – 106019423 Prof. Anita
WasilewskaKarthik Bhade – 105840048 Department of Computer
ScienceMaduri Narasimhan – 105791690 SUNY Stony BrookSumit Chopra - 105959878
2
References
[1] Data Mining Concepts and Techniques – Jiawei Han and Micheline Kamber[2] Data Mining Concepts and Techniques – Jiawei Han and Micheline Kamber – Book Slides[3] Sections 3.1,3.2, and 3.3[4] http://www.daneil-lemire.com[5] http://www.kalmstrom.nu
Knowledge is the antidote to fear.
- Ralph Waldo Emerson
What is Data Warehouse?
o Defined in many different ways.
A decision support database that is maintained separately
from the organization’s operational database.
Support information processing by providing a solid platform
of consolidated, historical data for analysis.
o “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
o Data warehousing:
The process of constructing and using data warehouses
Data Warehouse – Subject Oriented
o Organized around major subjects, such as customer,
product, sales.
o Focused on the modeling and analysis of data for decision
makers, not on daily operations
o Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
Data Mining Concepts and Techniques - Book Slides
6
Data Warehouse - Integrated
o Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records
o Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data sources
When data is moved to the warehouse, it is converted. o Eg: Sales data may be on RDB, customer information in flat
files.
Data Warehouse - Time Variant
o The time horizon for the data warehouse is significantly longer than that of operational database systems
Operational database: current value
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
o Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time element”
Data Warehouse - Nonvolatile
o A physically separate store of data, transformed from the
operational environment
o Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
9
Heterogeneous Databases
o Consists of a set of interconnected, autonomous databases.
o Objects in one database may differ from objects in other databases.
o Information exchange across such databases is difficult.
10
Data Warehouse vs. Heterogeneous DBMS
o Heterogeneous DBMS: A query driven approach
Build wrappers/mediators on top of heterogeneous databases
A meta-dictionary is used to translate the query into queries
appropriate for individual heterogeneous sites.
The results are integrated into a global answer set.
This approach involves complex information filtering.
Inefficient and potentially expensive.
o Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
11
Operational DBMS
o They consist of tables with a set of attributes and stores a large set of tuples.
o They use the Entity-Relationship (ER) data model.o They are used to store transactional data.o They contain the most current information.o Thus known as Online Transaction Processing (OLTP)
systems.
12
Data Warehouse vs. Operational DBMS
o User and system orientation customer vs. market
o Data contents current, detailed vs. historical, consolidated
o Database design ER + application vs. star + subject
o View current, local vs. evolutionary, integrated
o Access patterns update vs. read-only but complex queries
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
14
Why Separate Data Warehouse?
o High performance for both systems DBMS - Tuned for Online Transaction Processing Systems Warehouse - Tuned for Online Analytical Processing systems involving
complex OLAP queries Processing OLAP queries would degrade DBMS performance of operational
tasks.
o Decision support requires historical data which operational Databases do not typically maintain.
o Decision Support requires consolidation of data from heterogeneous sources.
o Solution To maintain separate database systems which support special primitives
and structures suitable to store, access and process OLAP specific data.
Multidimensional Data Model
o A Data warehouse is based on multidimensional data model, which views data in the form of a data cube.
o Data cube models n-D data, defined by dimensions and facts. Dimensions: They are entities with respect to which an
organization wants to keep records such as items (item_name).
Facts: It is a subject of decision oriented analysis such as dollars_sold or units_sold.
Facts are numerical measures. Quantities by which we want to analyze relationship
between dimensions. Contains key to each of the related dimension tables.
o A multidimensional data model is typically organized around a central theme, like sales, and is represented by a fact table.
Data Mining Concepts and Techniques-Book Slides
Sales volume as a function of product, Date, Country
DatePro
duct
Cou
ntr
y
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
• Total annual sales
• of TV in U.S.A.Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month
Office Week
Day
Data Mining Concepts and Techniques-Book Slides
Cube: A Lattice of Cuboids
se
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Data Mining Concepts and Techniques-Book Slides
Schemas for Multidimensional Databases
Multidimensional model exists in form of Star Schema: A fact table in the middle connected to a set of
dimension tables. time_key
dayday_of_the_weekmonthquarteryear
time time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesbranch_keybranch_namebranch_type
branch
item_keyitem_namebrandtypesupplier_type
item
location_keystreetcitystate_or_provincecountry
location
Sales Fact Table
Data Mining Concepts and Techniques-Book Slides
o Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake.
time_keydayday_of_the_weekmonthquarteryear
time
branch_keybranch_namebranch_type
branch
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
item_keyitem_namebrandtypesupplier_key
item
location_keystreetcity_key
location
city_keycitystate_or_provincecountry
citySales Fact Table
Data Mining Concepts and Techniques-Book Slides
o Fact Constellation: Multiple facts tables share dimension tables, viewed as collection of stars, therefore called galaxy schema or fact constellation.
time_keydayday_of_the_weekmonthquarteryear
time
branch_keybranch_namebranch_type
branchlocation_keystreetcityprovince_or_statecountry
location
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
item_keyitem_namebrandtypesupplier_type
item
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipperSales Fact Table
Shipping Fact Table
Cube Definition syntax in DMQL
o Cube Definition (Fact Definition)define cube (cube_name) [dimension_list]: (measure_list)
Examples: define cube sales_star [time,item,branch,location]: dollars_sold= sum (sales_in_dollars), avg_sales= avg(sales_in_dollar
o Dimension Definition (Dimension Table)define dimension (dimension_name) as ((attribute_or_subdimension _list))Example: define dimension branch (branch_key,branch_name,branch
_type)o Special case (Shared dimensional table as in fact constellation)
define dimension (dimension_name) as (dimension_in_first_cube) in cube (first_cube_name)
Defining Star Schema in DMQL
Example
define cube sales_star [time,item,branch,location]:dollars_sold= sum (sales_in_dollars), units_sold= count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
Defining Snowflake Schema in DMQL
Example
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
Defining Fact Constellation in DMQL
Exampledefine cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter,
year)define dimension item as (item_key, item_name, brand, type,
supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city,
province_or_state, country)define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
Data Mining Concepts and Techniques- Sec 3.2.4
Measures of Data cubes:
Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning
E.g., count(), sum(), min(), max() Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function
E.g., avg(), standard_deviation() Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate. That is there does not exists a algebraic function with M arguments that characterizes computation.
E.g., median(), mode(), rank()
Data Mining Concepts and Techniques- Fig 3.7
Concept Hierarchies
All all
Country Canada USA
state British Columbia .. Ontario New york … Illinois
Vancouver …Victoria Toronto .. Chicago
city Buffalo … New york
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Roll up may be performed by removing 1 or more dimensions
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data Drill Down may be performed by introducing new dimensions
Slice and dice: project and select Slice: selection on 1 dimension Dice : selection on 2 or more dimensions
Pivot (rotate): rotates data axes reorient the cube, visualization, 3D to series of 2D planes
Other operations Drill across: involving (across) more than one fact table Ranking top N or bottom N items in lists. Computing moving averages, growth rates etc
OLAP ENGINE IS A POWER DATA ANALYSIS TOOL
Data Mining Concepts and Techniques-Book Slides
sss
Data Mining Concepts and Techniques- Fig 3.11
A Starnet Query model Location
Continent
Country
Province or State
City
Street
name brand category typeitems
day
month
quarter
year
time
- Lines represent a concept hierarchy for a dimension
- Each abstraction level is called a footprint
Starnet forms basis of querying a multi-D model
30
Data Warehouse Architecture
Design and Construction of Data Warehouse Three-tier architecture Warehouse servers for OLAP Processing
31
Design – A Business Analysis Framework
Why data warehouse for business analysts?
Competitive advantage – relevant information to measure performance and make critical adjustments.
Business Productivity – quickly and efficiently gather information that accurately describes the organization.
Customer relationship management – consistent view of customers and items across all lines of business, departments and all markets.
Cost reduction – tracking trends, patterns and exceptions over long period of time in a consistent and reliable manner.
32
Views for Design
o Top down View
Allows the selection of the relevant information necessary for the data warehouse.
The information matches the current and coming business needs.
o Data source View
Exposes information being captured, stored and managed by operational systems.
It is documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables.
Data sources are modeled using Entity-relationship model or CASE (Computer Aided Software Engineering) tools.
33
Contd..
o Data Warehouse View It represents information that is stored inside the data warehouse,
including pre-calculated totals and counts, as well as information regarding the source, date and time of origin, added to provide historical context.
o Business Query View It is the perspective of data in the data warehouse from the
perspective of the end user.
34
Skill Sets
o Business Skills
o Technology Skills
o Program Management Skills
35
Design Process
o Top Down Approach Starts with overall design Technology is mature Business problems are clear and well understood
o Bottom-up Approach Starts with experiments and prototypes Early stage of business modeling and technology
development
o Combined Approach Planned and strategic nature of top-down approach Rapid implementation and opportunistic application of
bottom-up approach
36
Software Engineering View of Design Process
o Steps in design and construction Planning Requirements study Problem analysis Warehouse design Data Integration and testing Deployment of Data Warehouse
37
Contd..
Development Methods
o Waterfall Method Performs structured and systematic analysis at each step before
proceeding to the next.
o Spiral Method Involves rapid generation of functional systems with short intervals
between releases.
Spiral Model is a good choice for Data warehouse development especially for data marts.
38
General Steps in Warehouse design Process
o Choose a business process to modelo Choose the grain of the business process. Eg
Individual Transactions, snapshoto Choose the dimensions that will apply to each fact
table record. Eg time, item, customer, supplier, status
o Choose the measures that will populate each fact table record. Eg dollars_sold, units_sold
39
Data Warehouse Architecture
Design and Construction of Data Warehouse Three-tier architecture Warehouse servers for OLAP Processing
40
Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Othersources
Data Storage
OLAP Server
41
Data Warehouse Models
o Enterprise Warehouse
o Data Mart
o Virtual Warehouse
42
Enterprise Warehouse
o Collects all of the information about subjects spanning the entire organization.
o Corporate wide data integration, from one or more operational systems or external information providers, and is cross functional in scope.
o Can range in size from few giga bytes to hundreds of gigabytes, terabytes or beyond.
o Implemented on traditional mainframes, UNIX super servers, or parallel architecture platforms.
o Requires extensive business modeling and may take years to design and build.
43
Data Mart
o Contains a subset of corporate wide data that is of value to a specific group of users.
o The data in data marts tend to be summarized.o Implemented in low cost departmental servers that are UNIX or
Windows/NT - based.o It may involve complex integration in the long run if its design and
planning were not enterprise wide.o Depending on Source of data,o Independent Data Martso Data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or geographical area.
o Dependent Data Martso Sourced directly from enterprise data warehouses.
44
Virtual Warehouse
o It is a set of views over operational databases.
o For efficient query processing, only some of the possible summary views may be materialized.
o It is easy to build but requires excess capacity on operational database servers.
45
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi-Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
46
Data Warehouse Architecture
Design and Construction of Data Warehouse Three-tier architecture Warehouse servers for OLAP Processing
47
Types of OLAP Servers
o Relational OLAP (ROLAP) Servers Intermediate Servers standing in between a relational backend server and client
front end tools. They use a relational or extended relational DBMS to store and manage
warehouse data. They also optimize each DBMS backend, implementation of aggregation,
navigation logic.
o Multidimensional OLAP (MOLAP) Servers Support multidimensional views of data through array-based multi dimensional
storage engines. They map multidimensional views to data cubes array structures. Data cubes allow fast indexing to pre computed summarized data. The storage utilization may be low if the data is sparse. Dense sub cubes are identified and stored as array structures. Sparse sub cubes employ compression technology for efficient storage utilization.
48
Contd..
o Hybrid OLAP (HOLAP) Servers Combine ROLAP’s scalability and MOLAP’s fast computation. HOLAP may allow large volumes of detail data to be stored in a
relational database. Aggregations are kept in a separate MOLAP store. Microsoft SQL Server 7.0 supports a hybrid OLAP server.
o Specialized SQL Servers Provides advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read only environment.
49
OLAP Reporting tool for Excel
Cited from www.kalmstrom.nu Kalmstrom.nu Outlook Solutions
50
This list contains the
saved reports views.
To the right you see the current data displayed in the format defined in the report view. The
views contain only
layout options, no
data.
The graph part of OLAP Reporting Tool works like an Excel
chart.
Select which information you
want to see
The pivot part of OLAP Reporting Tool. It works very much like an Excel pivot
table.
Saves the current graph as
a .gif file.
51
Anywhere the dropdown symbol is displayed you can filter the information. By simply clicking the dropdown and selecting on or more checkboxes you can change what information is being displayed. In the example above it is possible to filter all of the fields in the red circles. For example, I could do a filter to only show the items sold in Zacatecas and Veracruz with four clicks:1. De-select the All checkbox2. Select the Mexico Central checkbox (all three regions within Mexico Central will be selected)3. De-select the DF region4. Press OK
52
You can very easily drill down to find data on lower levels. Both the areas circled in read can be used to see the sales figure per type of promotion in the Sunday Paper as in the example here. Another very common example of drilldown is to see the values per month from a per quarter view. To do drilldown in the pivot view, simply click the + signs. In the graph you will need to right click on the category you want to expand of drill into. (Only possible with Excel 2002 or later.)
53
These are the basic steps for creating a multi-graph.1.
2.
3.
A new area is shown. Drag fields into it to create multi-graph.
54
The multi-graph feature is quite unique and is easy to create in OLAP Reporting. To do it in Excel is more complicated.
55
Technical Paper
Analyzing Large Collections of Electronic Text Using OLAP
Steven Keith, Owen kaserUniversity of New Brunswick
July 11,2005
-Maduri Rajan Narasimhan
56
WOW
Creation of user-driven tools to interface with a (Data) Warehouse
of Words (WoW) is needed. A WoW is built by an Extraction, Transformation, and Loading
(ETL) procedure, which processes the text and aggregates data from different sources.
A WoW stores its data in data cubes. A data cube can be abstracted as a k-dimensional array with
several predefined operations such as slicing, dicing, rolling up and drilling down.
These operations allow the user to focus on just some subset of the data, at the desired granularity.
On-Line Analytical Processing (OLAP) provides near constant-time answers to queries over large multidimensional data sets.
57
OLAP
OLAP is especially applicable when many aggregate queries such as sum and average are of interest.
Thus, data warehouses and OLAP have been used widely in business applications.
The main advantage a user-driven OLAP tool would provide is flexibility.
While IR and Artificial Intelligence tools are well suited to their single function, a user-driven tool gives a wide variety of users the freedom to pursue their individual research.
A simple user-driven application is the most reasonable solution for those users not already accustomed to writing their own MDX or SQL queries.
58
Practical Applications
User-driven analytical tools are used in the humanities for author attribution, lexical analysis, and stylometric analysis.
Author attribution is determining the authorship of an anonymous piece of writing through various stylistic and statistical methods.
Lexical analysis includes many measurements of vocabulary usage such as Type-Token Ratio, Number of Different Words and Mean Word Frequency.
Stylometric analysis not only considers the words in use but also accounts for other statistical elements of style such as word length, sentence length, use of punctuation and many other features.
Analogies of the form A is to B as C is to D can be characterized by cooccurrences: two words connected by a joining word such as has, on, and with (64 joining words were initially proposed).
59
WoW Creation
Creation involves the three stages of ETL. Extraction: The extraction involves the plain text and XML
documents of Project Gutenberg, a large corpus of literary works that is not in a suitable form for immediate analysis.
Transformation: The transformation phase will involve the calculation of all data that will be stored in the WoW such as word frequency, punctuation frequency, and sentence lengths.
Loading: The loading phase will involve the actual creation and storage of the data cubes containing the calculated items.
Issues to be handled: At times data, such as the author’s nationality, is missing and must be handled.
Also, new books are added to corpora daily, and a means for loading these new books into the WoW must be created.
60
WoW Schema
The main strength of an OLAP application is its efficient evaluation of aggregate queries across several dimensions and at different level of granularity.
The “book” hierarchy maintains its finest detail at the level of chapters.
61
Contd..
The year of publication may be generalized to a literary era (eg Victorian); alternatively, the year may be generalized to decade and then to century
Several natural generalizations may help word studies. Alternately words can be grouped according to their
final suffix.
62
Contd..
Finally, tools such as Signature allow user-specified word lists. Given a set of “interesting” word stems, a stemmed word can be classified as belonging to [oneof] the user’s lists or belonging to no list1.
These hierarchies allow for rollup queries (essentially generalizations) to be evaluated.
Instead of finding the frequent words used in a chapter or book, one might be interested in the frequent words used by an author or used in a time period.
To support the initial stylometric, analogy, and phrase-use queries, the WoW contains several cubes.
Sentence Style (Book × Word × WordCount × CommaCount × Colon- SemicolonCount × StopwordCount ! Occurrence Count).
63
Conclusion
Each “Count” is an integer, and the Word dimension represents the first word in a sentence.
Short Phrase (Book×Word×Word×Word ×Word ! OccurrenceCount).
The cube records all sequences of 4 words, and it could be used to explore common (or rare) phrases by authors or time periods.
These cubes will allow for many queries to be evaluated and would aid in all of the practical applications as well as a variety of other studies.
64
Thank you !