+ All Categories
Home > Documents > Database Management Systems

Database Management Systems

Date post: 09-Jan-2016
Category:
Upload: cyndi
View: 19 times
Download: 0 times
Share this document with a friend
Description:
Database Management Systems. Chapter 8 Data Warehouses and Data Mining. Sequential Storage and Indexes. We picture tables as simple rows and columns, but they cannot be stored this way. It takes too many operations to find an item. Insertions require reading and rewriting the entire table. - PowerPoint PPT Presentation
41
1 Jerry Post Copyright © 2003 Database Management Database Management Systems Systems Chapter 8 Data Warehouses and Data Mining
Transcript
Page 1: Database Management Systems

1

Jerry PostCopyright © 2003

Database Management Database Management SystemsSystems

Chapter 8Data Warehouses and

Data Mining

Page 2: Database Management Systems

2

DDAATTAABBAASSEE

Sequential Storage and Indexes

We picture tables as simple rows and columns, but they cannot be stored this way. It takes too many

operations to find an item.

Insertions require reading and rewriting the entire table.

ID LastName FirstName DateHired

1 Reeves Keith 1/29/98

2 Gibson Bill 3/31/98

3 Reasoner Katy 2/17/98

4 Hopkins Alan 2/8/98

5 James Leisha 1/6/98

6 Eaton Anissa 8/23/98

7 Farris Dustin 3/28/98

8 Carpenter Carlos 12/29/98

9 O'Connor Jessica 7/23/98

10 Shields Howard 7/13/98

Page 3: Database Management Systems

3

DDAATTAABBAASSEE

Operations on Sequential Tables Read entire table

Easy and fast

Sequential retrieval Easy and fast for one order.

Random Read/Sequential Very weak Probability of any row = 1/N Sequential retrieval 1,000,000 rows means

500,000 retrievals per lookup!

Delete Easy

Insert/Modify Very weak

i i

iN

iN

EV11

2

1

2

)1(1

NNN

NEV

Row Prob. # ReadsA 1/N 1B 1/N 2C 1/N 3D 1/N 4E 1/N 5… 1/N i

Page 4: Database Management Systems

4

DDAATTAABBAASSEE

Insert into Sequential Table Insert Inez:

Find insert location. Copy top to new file. At insert location, add row. Copy rest of file.

ID LastName FirstName DateHired8 Carpenter Carlos 12/29/986 Eaton Anissa 8/23/987 Farris Dustin 3/28/982 Gibson Bill 3/31/984 Hopkins Alan 2/8/985 James Leisha 1/6/989 O'Connor Jessica 7/23/983 Reasoner Katy 2/17/981 Reeves Keith 1/29/9810 Shields Howard 7/13/98ID LastName FirstName DateHired

8 Carpenter Carlos 12/29/986 Eaton Anissa 8/23/987 Farris Dustin 3/28/982 Gibson Bill 3/31/98

5 James Leisha 1/6/989 O'Connor Jessica 7/23/983 Reasoner Katy 2/17/981 Reeves Keith 1/29/9810 Shields Howard 7/13/98

11 Inez Maria 1/15/99

Page 5: Database Management Systems

5

DDAATTAABBAASSEE

Binary Search Given a sorted list of names. How do you find Jones. Sequential search

Jones = 10 lookups Average = 15/2 = 7.5 lookups Min = 1, Max = 14

Binary search Find midpoint (14 / 2) = 7 Jones > Goetz Jones < Kalida Jones > Inez Jones = Jones (4 lookups)

Max = log2 (N) N = 1000 Max = 10 N = 1,000,000 Max = 20

AdamsBrownCadizDorfmannEatonFarris

1 GoetzHanson

3 Inez 4 Jones 2 Kalida

LomaxMirandaNorman

14 entries

Page 6: Database Management Systems

7

DDAATTAABBAASSEE

Indexed Sequential Storage Common uses

Large tables. Need many sequential lists. Some random search--with

one or two key columns. Mostly replaced by B+-Tree.

ID LastName FirstName DateHired1 Reeves Keith 1/29/982 Gibson Bill 3/31/983 Reasoner Katy 2/17/984 Hopkins Alan 2/8/985 James Leisha 1/6/986 Eaton Anissa 8/23/987 Farris Dustin 3/28/988 Carpenter Carlos 12/29/989 O'Connor Jessica 7/23/9810 Shields Howard 7/13/98

ID Pointer1 A112 A223 A324 A425 A476 A587 A638 A679 A7810 A83

A11A22A32A42A47A58A63A67A78A83

Address

LastName PointerCarpenter A67Eaton A58Farris A63Gibson A22Hopkins A42James A47O'Connor A78Reasoner A32Reeves A11Shields A83

Indexed for ID and LastName

Page 7: Database Management Systems

10

DDAATTAABBAASSEE

Index Options: Bitmaps and Statistics

Bitmap indexA compressed index designed for non-primary key columns.

Bit-wise operations can be used to quickly match WHERE criteria.

Analyze statisticsBy collecting statistics about the actual data within the index,

the DBMS can optimize the search path. For example, if it knows that only a few rows match one of your search conditions in a table, it can apply that condition first, reducing the amount of work needed to join tables.

Page 8: Database Management Systems

11

DDAATTAABBAASSEE

Problems with Indexes

Each index must be updated when rows are inserted, deleted or modified.

Changing one row of data in a table with many indexes can result in considerable time and resources to update all of the indexes.

Steps to improve performance Index primary keys Index common join columns (usually primary keys) Index columns that are searched regularlyUse a performance analyzer

Page 9: Database Management Systems

12

DDAATTAABBAASSEE

Data Warehouse

OLTP Database3NF tables

Operationsdata

Predefinedreports

Data warehouseStar configuration

Daily datatransfer

Interactivedata analysis

Flat files

Page 10: Database Management Systems

13

DDAATTAABBAASSEE

Data Warehouse Goals

Existing databases optimized for Online Transaction Processing (OLTP)

Online Analytical Processing (OLAP) requires fast retrievals, and only bulk writes.

Different goals require different storage, so build separate dta warehouse to use for queries.

Extraction, Transformation, Transportation (ETT) Data analysis

Ad hoc queries Statistical analysis Data mining (specialized automated tools)

Page 11: Database Management Systems

14

DDAATTAABBAASSEE

Extraction, Transformation, and Transportation (ETT)

Data warehouse:All data must be consistent.

Customers

Convert Client to Customer

Apply standard product numbers

Convert currencies

Fix region codes

Transaction data from diverse systems.

Page 12: Database Management Systems

15

DDAATTAABBAASSEE

OLTP v. OLAP

Page 13: Database Management Systems

16

DDAATTAABBAASSEE

Multidimensional Cube

TimeSale Date

CustomerLocation

Categ

ory

Pet StoreItem SalesAmount = Quantity*Sale Price

Page 14: Database Management Systems

17

DDAATTAABBAASSEE

Sales Date: Time Hierarchy

Year

Quarter

Month

Week

Day

Levels Roll-upTo get higher-level totals

Drill-downTo get lower-level details

Page 15: Database Management Systems

18

DDAATTAABBAASSEE

Star Design

SalesQuantity

Amount=SalePrice*Quantity

Fact Table

Products

CustomerLocation

Sales Date

Dimension Tables

Page 16: Database Management Systems

19

DDAATTAABBAASSEE

Snowflake Design

SaleIDItemIDQuantitySalePriceAmount

OLAPItems

ItemIDDescriptionQuantityOnHandListPriceCategory

Merchandise

SaleIDSaleDateEmployeeIDCustomerIDSalesTax

Sale

CustomerIDPhoneFirstNameLastNameAddressZipCodeCityID

Customer

CityIDZipCodeCityState

City

Dimension tables can join to other dimension tables.

Page 17: Database Management Systems

20

DDAATTAABBAASSEE

OLAP Computation Issues

Compute Quantity*Price in base query, then add to get $23.00

If you use Calculated Measure in the Cube, it will add first and multiply second to get $45.00, which is wrong.

Page 18: Database Management Systems

21

DDAATTAABBAASSEE

OLAP Data Browsing

Page 19: Database Management Systems

22

DDAATTAABBAASSEE

Microsoft Pivot Table

Page 20: Database Management Systems

23

DDAATTAABBAASSEE

OLAP in SQL 99Category Month Amount

Bird 1 $135.00

Bird 2 $45.00

Bird 3 $202.50

Bird 6 $67.50

Bird 7 $90.00

Bird 9 $67.50

Cat 1 $396.00

Cat 2 $113.85

Cat 3 $443.70

Cat 4 $2.25

SELECT Category, Month(SaleDate) AS Month, Sum(Quantity*SalePrice) AS Amount

FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem ON Merchandise.ItemID = SaleItem.ItemID) ON Sale.SaleID = SaleItem.SaleIDGROUP BY Category, Month(SaleDate);

GROUP BY two columns

Gives you totals for each month within each category.

You do not get super-aggregate totals for the category, or the month, or the overall total.

Page 21: Database Management Systems

24

DDAATTAABBAASSEE

SQL ROLLUP

SELECT Category, Month…, Sum …FROM …GROUP BY ROLLUP (Category, Month...)

Bird 1 135.00Bird 2 45.00…Bird (null) 607.50Cat 1 396.00Cat 2 113.85…Cat (null) 1293.30…(null) (null) 8451.79

Category Month Amount

Page 22: Database Management Systems

25

DDAATTAABBAASSEE

Missing Values Cause ProblemsIf there are missing values in the groups, it can be difficult to identify the super-aggregate rows.

Bird 1 135.00Bird 2 45.00…Bird (null) 32.00Bird (null) 607.50Cat 1 396.00Cat 2 113.85…Cat (null) 1293.30…(null) (null) 8451.79

Category Month Amount

Super-aggregate

Missing date

Page 23: Database Management Systems

26

DDAATTAABBAASSEE

GROUPING FunctionSELECT Category, Month…, Sum …,

GROUPING (Category) AS Gc, GROUPING (Month) AS Gm

FROM …GROUP BY ROLLUP (Category, Month...)

Bird 1 135.00 0 0Bird 2 45.00 0 0…Bird (null) 32.00 0 0Bird (null) 607.50 1 0Cat 1 396.00 0 0Cat 2 113.85 0 0…Cat (null) 1293.30 1 0…(null) (null) 8451.79 1 1

Category Month Amount Gc Gm

Page 24: Database Management Systems

27

DDAATTAABBAASSEE

CUBE Option

Bird 1 135.00 0 0Bird 2 45.00 0 0…Bird (null) 32.00 0 0Bird (null) 607.50 1 0Cat 1 396.00 0 0Cat 2 113.85 0 0…Cat (null) 1293.30 1 0(null) 1 1358.8 0 1(null) 2 1508.94 0 1(null) 3 2362.68 0 1…(null) (null) 8451.79 1 1

Category Month Amount Gc Gm

SELECT Category, Month, Sum, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm

FROM …GROUP BY CUBE (Category, Month...)

Page 25: Database Management Systems

28

DDAATTAABBAASSEE

GROUPING SETS: Hiding Details

Bird (null) 607.50Cat (null) 1293.30…(null) 1 1358.8(null) 2 1508.94(null) 3 2362.68…(null) (null) 8451.79

Category Month Amount

SELECT Category, Month, SumFROM …GROUP BY GROUPING SETS ( ROLLUP (Category),

ROLLUP (Month),( )

)

Page 26: Database Management Systems

29

DDAATTAABBAASSEE

SQL OLAP Analytical Functions

VAR_POP varianceVAR_SAMPSTDDEV_POP standard deviationSTDEV_SAMPCOVAR_POP covarianceCOVAR_SAMPCORR correlationREGR_R2 regression r-squareREGR_SLOPE regression data (many)REGR_INTERCEPT

Page 27: Database Management Systems

30

DDAATTAABBAASSEE

SQL RANK FunctionsSELECT Employee, SalesValue RANK() OVER (ORDER BY SalesValue DESC) AS rankDENSE_RANK() OVER (ORDER BY SalesValue DESC) AS denseFROM SalesORDER BY SalesValue DESC, Employee;

Employee SalesValue rank dense

Jones 18,000 1 1

Smith 16,000 2 2

Black 16,000 2 2

White 14,000 4 3DENSE_RANK does not skip numbers

Page 28: Database Management Systems

31

DDAATTAABBAASSEE

SQL OLAP WindowsSELECT Category, SaleMonth, MonthAmount, AVG(MonthAmount) OVER (PARTITION BY Category ORDER BY SaleMonth ASC ROWS 2 PRECEDING) AS MAFROM qryOLAPSQL99ORDER BY SaleMonth ASC;

Category SaleMonth MonthAmount MABird 200101 1500.00Bird 200102 1700.00Bird 200103 2000.00 1600.00Bird 200104 2500.00 1850.00…Cat 200101 4000.00Cat 200102 5000.00Cat 200103 6000.00 4500.00Cat 200104 7000.00 5500.00…

Page 29: Database Management Systems

32

DDAATTAABBAASSEE

Ranges: OVER

SELECT SaleDate, ValueSUM(Value) OVER (ORDER BY SaleDate) AS running_sum,SUM(Value) OVER (ORDER BY SaleDate RANGE

BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum2,

SUM (Value) OVER (ORDER BY SaleDate RANGEBETWEEN CURRENT ROWAND UNBOUNDED FOLLOWING) AS remaining_sum;

FROM …

Sum1 computes total from beginning through current row.

Sum2 does the same thing, but more explicitly lists the rows.

Sum3 computes total from current row through end of query.

Page 30: Database Management Systems

33

DDAATTAABBAASSEE

LAG and LEAD Functions

SELECT SaleDate, Value, LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_dayLEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day

FROM …ORDER BY SaleDate

LAG or LEAD: (Column, # rows, default)

SaleDate Value prior_day next_day1/1/2003 1000 0 15001/2/2003 1500 1000 20001/3/2003 2000 1500 2300…1/31/2003 3500 3200 0

Prior is 0 from default value

Not part of standard yet? But are in SQL Server and Oracle.

Page 31: Database Management Systems

34

DDAATTAABBAASSEE

Data Mining

Goal: To discover unknown relationships in the data that can be used to make better decisions.

Databases

Reports

Queries

OLAP

Data Mining

Transactions and operations

Specific ad hoc questions

Aggregate, compare, drill down

Unknown relationships

Page 32: Database Management Systems

35

DDAATTAABBAASSEE

Exploratory Analysis

Data Mining usually works autonomously.Supervised/directedUnsupervisedOften called a bottom-up approach that scans the data to

find relationships

Some statistical routines, but they are not sufficientStatistics relies on averagesSometimes the important data lies in more detailed pairs

Page 33: Database Management Systems

36

DDAATTAABBAASSEE

Common Techniques

Classification/Prediction/Regression Association Rules/Market Basket Analysis Clustering

Data pointsHierarchies

Neural Networks Deviation Detection Sequential Analysis

Time series eventsWebsites

Textual Analysis Spatial/Geographic Analysis

Page 34: Database Management Systems

37

DDAATTAABBAASSEE

Classification Examples

ExamplesWhich borrowers/loans are most likely to be successful?Which customers are most likely to want a new item?Which companies are likely to file bankruptcy?Which workers are likely to quit in the next six months?Which startup companies are likely to succeed?Which tax returns are fraudulent?

Page 35: Database Management Systems

38

DDAATTAABBAASSEE

Classification Process Clearly identify the outcome/dependent variable. Identify potential variables that might affect the outcome.

Supervised (modeler chooses) Unsupervised (system scans all/most)

Use sample data to test and validate the model. System creates weights that link independent variables to

outcome.

Income Married Credit History Job Stability Success

50000 Yes Good Good Yes

25000 Yes Bad Bad No

75000 No Good Good No

Page 36: Database Management Systems

39

DDAATTAABBAASSEE

Classification Techniques

Regression Bayesian Networks Decision Trees (hierarchical) Neural Networks Genetic Algorithms

ComplicationsSome methods require categorical dataData size is still a problem

Page 37: Database Management Systems

40

DDAATTAABBAASSEE

Association/Market Basket

Examples What items are customers likely to buy together? What Web pages are closely related? Others?

Classic (early) example: Analysis of convenience store data showed customers often buy

diapers and beer together. Importance: Consider putting the two together to increase cross-

selling.

Page 38: Database Management Systems

41

DDAATTAABBAASSEE

Association Details (two items)

Rule evaluation (A implies B) Support for the rule is measured by the percentage of all

transactions containing both items: P(A ∩ B) Confidence of the rule is measured by the transactions with A that

also contain B: P(B | A) Lift is the potential gain attributed to the rule—the effect compared

to other baskets without the effect. If it is greater than 1, the effect is positive:

P(A ∩ B) / ( P(A) P(B) ) P(B|A)/P(B)

Example: Diapers implies Beer Support: P(D ∩ B) = .6 P(D) = .7 P(B) = .5 Confidence: P(B|D) = .857 = P(D ∩ B)/P(D) = .6/.7 Lift: P(B|D) / P(B) = 1.714 = .857 / .5

Page 39: Database Management Systems

42

DDAATTAABBAASSEE

Association Challenges If an item is rarely purchased, any other item bought with it

seems important. So combine items into categories.

Some relationships are obvious. Burger and fries.

Some relationships are meaningless. Hardware store found that toilet rings sell well only when a new

store first opens. But what does it mean?

Item Freq.

1 “ nails 2%

2” nails 1%

3” nails 1%

4” nails 2%

Lumber 50%

Item Freq.

Hardware 15%

Dim. Lumber 20%

Plywood 15%

Finish lumber 15%

Page 40: Database Management Systems

43

DDAATTAABBAASSEE

Cluster Analysis Examples

Are there groups of customers? (If so, we can cross-sell.) Do the locations for our stores have elements in common? (So we

can search for similar clusters for new locations.) Do our employees (by department?) have common characteristics?

(So we can hire similar, or dissimilar, people.) Problem: Many dimensions and large datasets

Small intracluster distance

Large intercluster distance

Page 41: Database Management Systems

44

DDAATTAABBAASSEE

Geographic/Location Examples

Customer location and sales comparisonsFactory sites and costEnvironmental effects

Challenge: Map data, multiple overlays


Recommended