Decision support systems for E-commerce. Decision support systems for EC DSS: help the knowledge...

Decision support systems for E-commerce

Decision support systems for EC DSS: help the knowledge worker (executive, manager,

analyst) make faster and better decisions– what were the sales volumes by region and product category

for the last year?– How did the share price of computer manufacturers correlate

with quarterly profits over the past 10 years?– Will a 10% discount increase sales volume sufficiently?

Data Warehousing: enables On-line analytical processing (OLAP)– OLAP is a component of decision support system

Data mining– Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases. – Data mining is a powerful, high-performance data analysis

tool for decision support.

Potential Applications of Data Warehousing and Mining in EC

Analysis of user access patterns and buying patterns Customer segmentation and target marketing Cross selling and improved Web advertisement Personalization Association (link) analysis Customer classification and prediction Time-series analysis Typical event sequence and user behavior pattern

analysis Transition and trend analysis

Data Warehousing The phrase data warehouse was coined by William

Inmon in 1990 Data Warehouse is a decision support database that is

maintained separately from the organization’s operational database

Definition: A DW is a repository of integrated information from distributed, autonomous, and possibly heterogeneous information sources for query, analysis, decision support, and data mining purposes

Characteristics (cont’d)

Integrated– No consistency in encoding, naming conventions, …

among different application-oriented data from different legacy systems, different heterogeneous data sources

– When data is moved to the warehouse, it is consolidated converted, and encoded

Non-volatile– New data is always appended to the database, rather

than replaced– The database continually absorbs new data, integrating it

with the previous data– In contrast, operational data is regularly accessed and

manipulated a record at a time and update is done to data in the operational environment.

Characteristics (cont’d)

Time-variant– The time horizon for the data warehouse is significantly

longer than that of operational systems.– Operational database contain current value data. Data

warehouse data is nothing more than a sophisticated series of snapshots, taken as of some moment in time. Operational data is valid only at the moment of access-capturing a moment in time. Within seconds, that data may no longer be valid in its description of current operations

– Operational data may or may not contain some element of time. Informational data has a time dimension: each data point is associated with a point in time, and data points can be compared along that axis.

Reasons to Separate DW from Operational Systems

Performance:– special data organization, access methods, and

implementation methods are needed to support multidimensional views and operations typical of OLAP

– Complex OLAP queries would degrade performance for operational transactions, Thus DW avoids interruption of the operational processing at the underlying information sources

– Concurrency control and recovery of OLTP mode are not compatible with OLAP analysis

– Provide fast access to integrated information

Reasons to Separate DW from Operational Systems

Decision support requires

– historical data which operational databases do not

typically maintain

– consolidation of data from heterogeneous sources:

operational databases, external sources• different sources typically use inconsistent data

representations, codes and formats which have to be

reconciled.– aggregation, summarization, annotation of raw data

System Architecture

Detector Detector Detector Detector

End UserEnd User

LegacyLegacy Flat-fileFlat-file RDBMSRDBMS OODBMSOODBMS. . .. . .

Analysis, Query Reports,Analysis, Query Reports,Data MiningData Mining

DW Components

Underlying information sources– often the operational systems, providing the lowest level

of data.– designed for operational use, not for decision support,

and the data reflect this fact.– Multiple data sources are often from different systems

run on a wide range of hardware and much of the software is built in-house or highly customized.

– Multiple data sources introduce a large number of issues, such as semantic conflicts.

– Distributed, autonomous, and possibly heterogeneous

DW Components (cont’d)

Integrator– Receives updates– makes the data conform to the conceptual schema used

by the warehouse– integrates the changes into the warehouse– merges the data with existing data already present– resolves possible update anomalies– Modifies warehouse views accordingly

User interface– Tools to query and perform data analysis and data

mining

DW Components (cont’d)

Change detectors/propagators– Refresh the warehouse by detecting to an information

source that are of interest to the warehouse and propagating updates on source data to the data stored in the warehouse

– when to refresh• determined by usage, types of data source, etc.

– how to refresh• data shipping: using triggers to update snapshot log table and

propagate the updated data to the warehouse (define triggers in a full-functionality DBMS)

• transaction shipping: shipping the updates in the transaction log (examine the updates in the log file)

• write programs for legacy systems

Multidimensional Data Sales volume as a function of product, time, and

geography

OLAP Servers

Relational OLAP (ROLAP)

– Extended relational DBMS that maps operations on

multidimensional data to standard relations operations

Multidimensional OLAP (MOLAP)

– Special purpose server that directly implements

multidimensional data and operations

Hybrid OLAP (HOLAP)

– give users/system administrators freedom to select

different partitions.

Warehouse Design: Conceptual Modeling

Star schema– A single object (fact table) in the middle connected to

a number of objects (dimension tables) Snowflake schema

– A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

Fact constellation schema – Multiple fact tables share dimension tables.

A Multidimensional fact table scheme

Example of The Star Schema

Example of the Snowflake Schema

Example of the Fact Constellation Schema

Sales Data

A Sample Data Cube

Total annual salesof TV in China.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

China

India

Japan

sum

OLAP Operations

roll-up– aggregating on a specific dimension, I.e., summarize

data– total sales volume last year by product category by

region drill-down

– also called roll down, drill through – inverse of roll-up, go from higher level summary to

lower level summary or detailed data– For a particular product category, find the detailed sales

data for each salesperson by date

OLAP Operations (cont’d) slicing

– projecting data along a subset of dimensions with an equality selection of other dimensions

– Sales of beverages in the West for Jan 98 dicing

– similar to slicing except that instead of equality selection of other dimensions, a range selection is used

– Sales of beverages in the West over the last 6 months Pivot

– reorient cube

Cube Operation

SELECT date, product, customer, SUM (amount)FROM SALESCUBE BY date, product, customer

Need to compute the following Group-By

(date, product, customer),(date,product),(date, customer), (product, customer),(date), (product) (customer)

Cube Computation -- Array Based Algorithm

An MOLAP approach: the base cuboid is stored as multidimensional array.

Read in a number of cells to compute partial cuboids

{}

AB

C

{ABC}{AB}{AC}{BC}

{A}{B}{C}{ }

ROLAP versus MOLAP

ROLAP – exploits services of relational engine effectively– provides additional OLAP services

• design tools for DSS schema• performance analysis tool to pick aggregates to

materialize– SQL comes in the way of sequential processing

and column aggregation– Some queries are hard to formulate and can often

be time consuming to execute

ROLAP versus MOLAP

MOLAP– the storage model is an n-dimensional array

– Front-end multidimensional queries map to server capabilities in a straightforward way

– Direct addressing abilities

– Handling sparse data in array representation is expensive

– Poor storage utilization when the data is sparse

Example

Designed, developed and implemented - QDMS (Quality Database Management System) - A

working system– Provides fast access to integrated production and

inspection data– Provides complex data analysis for decision support – Isolates data analysis processing from operational

systems– Encourages manufacturers to evaluate and improve

their performance

Example (Cont’d)

Information Sources– Heterogeneous information sources

• Flat files• RDBS - Oracle, Sybase, Paradox, MS Access, FoxPro• Non-relational DBS - IBM IMS• Others - Lotus Notes

– Data• Uniform in some cases, e.g., Lot_no; Product id: NSN• Non-uniform in some other cases, e.g. Defect id • Temporal ordering for production records

Actual Application

Com.1

Query:Query: ““overall & detail production performance”overall & detail production performance”

• manufacturer: Com1manufacturer: Com1

• products: all productsproducts: all products

• date interval: 01-Jan-94 until 01-Jan-1999date interval: 01-Jan-94 until 01-Jan-1999

• source: USDAsource: USDA

Com.1

Com.1

Com.1

Lot#1

Lot#2

Lot#3

Contract Number 1

Contract Number 2

Contract Number 3

Data Mining Characterization and Comparison

– Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions.

Association– finding rules like: buys(x, diapers) buys(x,

milk) Classification and Prediction

– Classify data based on the values in a classifying attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.

– Predict some unknown or missing attribute values based on other information.

Data Mining (Cont’d) Clustering:

– Group data to form new classes, e.g., cluster houses to find distribution patterns.

Time-series analysis: – Trend and deviation analysis: Find and

characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., stock analysis.

– Similarity-based pattern-directed analysis: Find and characterize user-specified patterns in large databases.

– Cyclicity/periodicity analysis: Find segment-wise or total cycles or periodic behavior in time-related data.

Classification Data categorization based on a set of training objects.

– Applications: credit approval, target marketing, medical diagnosis, treatment effectiveness analysis, etc.

– Example: classify a set of diseases and provide the symptoms which describe each class or subclass.

The classification task: Based on the features present in the class_labeled training data, develop a description or model for each class. It is used for– classification of future test data,– better understanding of each class, and– prediction of certain properties and behaviors.

Date post:	25-Dec-2015
Category:	Documents
Upload:	victor-williamson
View:	214 times
Download:	0 times

Decision support systems for E-commerce. Decision support systems for EC DSS: help the knowledge...

Documents