Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | victor-williamson |
View: | 214 times |
Download: | 0 times |
Decision support systems for E-commerce
Decision support systems for EC DSS: help the knowledge worker (executive, manager,
analyst) make faster and better decisions– what were the sales volumes by region and product category
for the last year?– How did the share price of computer manufacturers correlate
with quarterly profits over the past 10 years?– Will a 10% discount increase sales volume sufficiently?
Data Warehousing: enables On-line analytical processing (OLAP)– OLAP is a component of decision support system
Data mining– Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases. – Data mining is a powerful, high-performance data analysis
tool for decision support.
Potential Applications of Data Warehousing and Mining in EC
Analysis of user access patterns and buying patterns Customer segmentation and target marketing Cross selling and improved Web advertisement Personalization Association (link) analysis Customer classification and prediction Time-series analysis Typical event sequence and user behavior pattern
analysis Transition and trend analysis
Data Warehousing The phrase data warehouse was coined by William
Inmon in 1990 Data Warehouse is a decision support database that is
maintained separately from the organization’s operational database
Definition: A DW is a repository of integrated information from distributed, autonomous, and possibly heterogeneous information sources for query, analysis, decision support, and data mining purposes
Characteristics (cont’d)
Integrated– No consistency in encoding, naming conventions, …
among different application-oriented data from different legacy systems, different heterogeneous data sources
– When data is moved to the warehouse, it is consolidated converted, and encoded
Non-volatile– New data is always appended to the database, rather
than replaced– The database continually absorbs new data, integrating it
with the previous data– In contrast, operational data is regularly accessed and
manipulated a record at a time and update is done to data in the operational environment.
Characteristics (cont’d)
Time-variant– The time horizon for the data warehouse is significantly
longer than that of operational systems.– Operational database contain current value data. Data
warehouse data is nothing more than a sophisticated series of snapshots, taken as of some moment in time. Operational data is valid only at the moment of access-capturing a moment in time. Within seconds, that data may no longer be valid in its description of current operations
– Operational data may or may not contain some element of time. Informational data has a time dimension: each data point is associated with a point in time, and data points can be compared along that axis.
Reasons to Separate DW from Operational Systems
Performance:– special data organization, access methods, and
implementation methods are needed to support multidimensional views and operations typical of OLAP
– Complex OLAP queries would degrade performance for operational transactions, Thus DW avoids interruption of the operational processing at the underlying information sources
– Concurrency control and recovery of OLTP mode are not compatible with OLAP analysis
– Provide fast access to integrated information
Reasons to Separate DW from Operational Systems
Decision support requires
– historical data which operational databases do not
typically maintain
– consolidation of data from heterogeneous sources:
operational databases, external sources• different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled.– aggregation, summarization, annotation of raw data
System Architecture
Detector Detector Detector Detector
End UserEnd User
LegacyLegacy Flat-fileFlat-file RDBMSRDBMS OODBMSOODBMS. . .. . .
Analysis, Query Reports,Analysis, Query Reports,Data MiningData Mining
DW Components
Underlying information sources– often the operational systems, providing the lowest level
of data.– designed for operational use, not for decision support,
and the data reflect this fact.– Multiple data sources are often from different systems
run on a wide range of hardware and much of the software is built in-house or highly customized.
– Multiple data sources introduce a large number of issues, such as semantic conflicts.
– Distributed, autonomous, and possibly heterogeneous
DW Components (cont’d)
Integrator– Receives updates– makes the data conform to the conceptual schema used
by the warehouse– integrates the changes into the warehouse– merges the data with existing data already present– resolves possible update anomalies– Modifies warehouse views accordingly
User interface– Tools to query and perform data analysis and data
mining
DW Components (cont’d)
Change detectors/propagators– Refresh the warehouse by detecting to an information
source that are of interest to the warehouse and propagating updates on source data to the data stored in the warehouse
– when to refresh• determined by usage, types of data source, etc.
– how to refresh• data shipping: using triggers to update snapshot log table and
propagate the updated data to the warehouse (define triggers in a full-functionality DBMS)
• transaction shipping: shipping the updates in the transaction log (examine the updates in the log file)
• write programs for legacy systems
Multidimensional Data Sales volume as a function of product, time, and
geography
OLAP Servers
Relational OLAP (ROLAP)
– Extended relational DBMS that maps operations on
multidimensional data to standard relations operations
Multidimensional OLAP (MOLAP)
– Special purpose server that directly implements
multidimensional data and operations
Hybrid OLAP (HOLAP)
– give users/system administrators freedom to select
different partitions.
Warehouse Design: Conceptual Modeling
Star schema– A single object (fact table) in the middle connected to
a number of objects (dimension tables) Snowflake schema
– A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.
Fact constellation schema – Multiple fact tables share dimension tables.
A Multidimensional fact table scheme
Example of The Star Schema
Example of the Snowflake Schema
Example of the Fact Constellation Schema
Sales Data
A Sample Data Cube
Total annual salesof TV in China.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
China
India
Japan
sum
OLAP Operations
roll-up– aggregating on a specific dimension, I.e., summarize
data– total sales volume last year by product category by
region drill-down
– also called roll down, drill through – inverse of roll-up, go from higher level summary to
lower level summary or detailed data– For a particular product category, find the detailed sales
data for each salesperson by date
OLAP Operations (cont’d) slicing
– projecting data along a subset of dimensions with an equality selection of other dimensions
– Sales of beverages in the West for Jan 98 dicing
– similar to slicing except that instead of equality selection of other dimensions, a range selection is used
– Sales of beverages in the West over the last 6 months Pivot
– reorient cube
Cube Operation
SELECT date, product, customer, SUM (amount)FROM SALESCUBE BY date, product, customer
Need to compute the following Group-By
(date, product, customer),(date,product),(date, customer), (product, customer),(date), (product) (customer)
Cube Computation -- Array Based Algorithm
An MOLAP approach: the base cuboid is stored as multidimensional array.
Read in a number of cells to compute partial cuboids
{}
AB
C
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }
ROLAP versus MOLAP
ROLAP – exploits services of relational engine effectively– provides additional OLAP services
• design tools for DSS schema• performance analysis tool to pick aggregates to
materialize– SQL comes in the way of sequential processing
and column aggregation– Some queries are hard to formulate and can often
be time consuming to execute
ROLAP versus MOLAP
MOLAP– the storage model is an n-dimensional array
– Front-end multidimensional queries map to server capabilities in a straightforward way
– Direct addressing abilities
– Handling sparse data in array representation is expensive
– Poor storage utilization when the data is sparse
Example
Designed, developed and implemented - QDMS (Quality Database Management System) - A
working system– Provides fast access to integrated production and
inspection data– Provides complex data analysis for decision support – Isolates data analysis processing from operational
systems– Encourages manufacturers to evaluate and improve
their performance
Example (Cont’d)
Information Sources– Heterogeneous information sources
• Flat files• RDBS - Oracle, Sybase, Paradox, MS Access, FoxPro• Non-relational DBS - IBM IMS• Others - Lotus Notes
– Data• Uniform in some cases, e.g., Lot_no; Product id: NSN• Non-uniform in some other cases, e.g. Defect id • Temporal ordering for production records
Actual Application
Com.1
Query:Query: ““overall & detail production performance”overall & detail production performance”
• manufacturer: Com1manufacturer: Com1
• products: all productsproducts: all products
• date interval: 01-Jan-94 until 01-Jan-1999date interval: 01-Jan-94 until 01-Jan-1999
• source: USDAsource: USDA
Com.1
Com.1
Com.1
Lot#1
Lot#2
Lot#3
Contract Number 1
Contract Number 2
Contract Number 3
Data Mining Characterization and Comparison
– Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions.
Association– finding rules like: buys(x, diapers) buys(x,
milk) Classification and Prediction
– Classify data based on the values in a classifying attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.
– Predict some unknown or missing attribute values based on other information.
Data Mining (Cont’d) Clustering:
– Group data to form new classes, e.g., cluster houses to find distribution patterns.
Time-series analysis: – Trend and deviation analysis: Find and
characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., stock analysis.
– Similarity-based pattern-directed analysis: Find and characterize user-specified patterns in large databases.
– Cyclicity/periodicity analysis: Find segment-wise or total cycles or periodic behavior in time-related data.
Classification Data categorization based on a set of training objects.
– Applications: credit approval, target marketing, medical diagnosis, treatment effectiveness analysis, etc.
– Example: classify a set of diseases and provide the symptoms which describe each class or subclass.
The classification task: Based on the features present in the class_labeled training data, develop a description or model for each class. It is used for– classification of future test data,– better understanding of each class, and– prediction of certain properties and behaviors.