OLAP
On Line Analytic Processing
OLTP
• On Line Transaction Processing– support for ‘real-time’ processing of orders,
bookings, sales– typically access to single rows of tables– current data only - current products, available
flights– supports Operational level of organisational
hierarchy
Analysis
• Support for – tactical (e.g. stock ordering in view of forecast
demand– strategic (e.g. where to open a new store)
• Need to ‘understand the data’ - ‘Business Intelligence’
Requirements• Operations
– Summation– Statistical analysis - variation for standard– Ranking and percentages– Comparison over time - last year-to-date
• ‘Static data picture’ to avoid ‘inconsistent reads’ during analysis production, and comparisons between analyses
• Addition data - forecasting models, external data, structural data (location of tills, of products on shelves)
Analysis pre database
• 50% of code used to create complex bulk reports on how the company was performing
• inflexible, costly, untimely
• ‘raw data’ too large to retain, so aggregation required - restricts year-to year comparison to level of prior aggregation
OLAP/Data warehouse
• Specialised tools, data structures to support analysis ‘on line’ i.e. on demand
• Data will be a snap-shot to ensure consistency
• Hold base ‘fact’ table - raw basic facts - credit card transaction, item sale, booking
• + multiple analytic dimensions - – time, product, customer, store
SuperMarket ‘Basket’ data• Fact - a single line on a till receipt
– basket no– date– customer no– product code– till no– quantity– price
• Dimensions – Customer, Product, Time, Till/Store
Star Schema
Kinds of Attributes• Measures
– continuously varying– interval,ratio scales– able to be summed, ranked etc
• Dimensions– nominal scales - no ordering– may be classified in multiple ways
• Date/Time– interval scale - i.e ordered only– can be treated as dimension and classified
Product dimension
• Product e.g. size 9 Bata slippers– Product category - shoes
• Product group - clothing
– Size e.g. shoe9– Range e.g Bata– Product class e.g. cheap
Time Dimension
• Date e.g. 11 Dec 2002– Day of Week e.g. Wednesday– Week e.g. 49– Month e.g. December
• Qtr e.g.3– Year e.g. 2002
– Promotion period e.g. Pre-Christmas Sales– Season e.g. Autumn
Snowflake schema
Query Processing
• Consider a typical analytic query– How do sales of clothing vary by day of week in stores in
the SW region?– Select product.name,dow.name,sum(qty*price)– from sales, product, productCategory,
productGroup,date,DOW– where ( the 5 join conditions)– and (productGroup.name=‘Clothing’)– group by DOW.name,product.name– order by DOW.name,product.name
Difficulties
• Fact table is huge - must be compressed as much as possible by reducing field sizes etc
• Since dimension data is smaller and more stable, OK to denormalise to reduce joins– date - add dow.id,month.id,year,fiscalYear,…– Balance required– Denormalised dimensions result in ‘Starflake
schema’
Aggregation
• Simple case :– sales - date,product,cust,store,value– How many aggregations are possible– Store(5) Product(10) Customer(30)
Aggregation operations
• Cube - all possible aggregations– 8 for 3 dimensions
• Roll-up - aggregate in order– e.g. Product,Store,Cust
– 4 for 3 dimensions
• Slice and Dice– Takes parts or slices of a cube of aggregations
• Drill down– Given an aggregation, expand an aggregated dimension e.g. Expand
clothing sales analysis by City
Data Mining
• OLAP requires a priori assumptions about the categories of interest
• But a useful category may be ‘hidden’ - can it be discovered a posteriori ?– e.g. identify high risk motor insurance policies by
attributes of policy - gender, age, type of vehicle, job, postcode
• Rule induction (machine learning) methods can be used
• Finding relationships– looking for deviations from normal behaviour -
e.g. to identify fraudulent transactions in a credit card company
– looking for deviations from average e.g. Non-random combinations of goods in a basket - classic example is beer and nappies
• Requires heavy aggregations, statistical selection, rule induction
ETL
• Extract - Transform - Load
• OLAP databases are often said to be read only - but all need periodic updating with new data extracted from sources, validation, re-organising and load, whilst maintaining aggregations and index
• Extract– new or changed data from the OLTP– changed product and structural data– external data
• External data may be in legacy systems, remote databases, flat files
Transform
• Filtering out bad transaction data
• Validating against database
Load
• Load new facts
• Re-de-normalising if product dimensions change
• Re-aggregation