NPACI AHM 2001 Tutorial on Data Mining for Scientific ...

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

NPACI AHM2001NPACI AHM2001

TutorialTutorialonon

Data Mining for Scientific Data Mining for Scientific ApplicationsApplications

Chaitan Baru

Tony Fountain




Tutorial Objectives

• Provide overview of the infrastructure – technologies and techniques – for:• data mining, database systems

• Provide some illustrative examples of how the infrastructure can be used in scientific applications

• Present plans for the SDSC Knowledge and Information Discovery Lab (SKIDL)

• Identify potential collaborations – for applications as well as infrastructure

–– Our emphasis is on the infrastructure



Tutorial Outline

8:00 - 8:15 Data intensive computing in NPACI (Baru)

8:15 - 9:15 Introduction to data mining (Fountain)

9:15 - 10:15 DBMS support for analysis of large-scale data (Baru)

10:15 - 10:30 BREAK

10:30 - 12:00 Examples of data mining tools (Fountain)

Next steps...



NPACI DICE

• Focus on data, information, and knowledge management:• Persistent archives

• Use of XML and archival storage systems (e.g. HPSS) for data storage

• Metadata-based access to data sets (Extensible Metadata Catalog, eMCAT)

• Distributed data handling (Storage Resource Broker, SRB)

• Information mediation (Mediation of Information using XML, MIX)

• Model-based mediation (NeuroMIX), use of Topic Maps



HPSS

• Capacity • Total >400TB• Current usage: >240TB stored

• Load• Transfer rate: 1TB/day

• SRB provides a “container” mechanism for better usage and improved efficiency



Application(SRB client)

Distributed Storage Resources

DB2, Oracle, ObjectStore HPSS, UniTree UNIX, ftp

MCATSRB ServersSRB Middleware

The SDSC Storage Resource Broker• Metadata-based access to data sets stored in distributed, heterogeneous storage resources

Solaris, Linux, NT, AIX, HP-UX, IRIX



Current Usage of SRB

• Collections• Digital Sky: ~4TB, ~8 million files• Digital Embryo: ~700GB, millions of files• Digital library collections (ADL, UCB, Michigan): ~1

million files• HyperLTER – hyperspectral data • Particle Physics Data Grid

• Upcoming collections• SLAC...



Mediation of Information using XML (MIX)

DataSource

XML DataSource

DataSource

MIXmMIXmMediatorMediator

XML View(s)

Blended BrowsingBlended Browsingand Querying (BBQ)and Querying (BBQ)

interface

XML View(s)

XML View(s)

Definition of mediated view inXML Matching And Structuring (XMAS)XML Matching And Structuring (XMAS)

query language

WrapperWrapper

Lazy evaluation ofXMAS queries usingDOM-VXD



From data management infrastructure to knowledge discovery infrastructure

• The Affymetrix story• “Technology built for Wall Street helps bioinformatics companies as

well…”

• The “scientist in the middle”• The infrastructure is a tool to help the scientist, not a replacement!

infrastructure

KDD



The Infrastructure Supports:

• Exploratory data analysis of large data sets• efficient ad hoc statistical processing

• Parallel data access, subsetting, and analysis• Data intensive approach to model building and

verification• including, fusion of different forms of data (e.g. database tables,

instrument outputs, remote sensing data, maps, …)

– Employ, and build upon, existing (commercial, freeware) tools and software packages, as much as possible



The SDSC Knowledge and Information Discovery Lab (SKIDL)

• Initial hardware platform• 2-processor Sun, 512MB memory, 36 GB local disk

• Upgrade to: • 20 processor Sun, 6 GB memory, 400 GB local disk• Access to additional disk storage via storage area network (SAN)

• Possible further upgrade (via CalIT2)• Additional 4 GB memory, 1 TB SAN disk, Gigabit Ethernet capability

• Software• High-performance, parallel database systems and file systems

• DB2• Oracle, GPFS

• Suite of data mining tools• Intelligent Miner, MineSet, Bayesian network tools• S-Plus, Darwin, Clementine, SAS

• Presentation, visualization: ESRI ArcIMS, ...



Data Mining

Tony FountainNPACI ESS

SDSC Knowledge & Information Discovery Lab



Overview (DM101)

• Part 1: • Definition• Motivations• Methods, Techniques, & Tools

• Part 2:• Examples & Demos• Data Mining to Decision Support



Overview (DM101)

• Part 1: • Definition• Motivations• Methods, Techniques, & Tools

• Database 605 – Chaitan Baru

• Part 2:• Examples & Demos• Data Mining to Decision Support



Outline (DM101)

Part 1 – What is data mining?1. Direct2. Contributions from other disciplines3. Motivations & context4. Example applications5. Analytical methods:

• Association Rules• Classification & Prediction• Clustering• OLAP

6. MSU data set



Definition

The search for interesting patterns…



Definition

The search for interesting patterns,

in large databases…



Definition


in large databases,

that were collected for other applications…



Definition


in large databases,

that were collected for other applications,

using machine learning algorithms…



Definition


in large databases,


using machine learning algorithms,

and high-performance computers…



Definition


in large databases,



and high-performance computers,

for fun and profit!



Definition


in large databases,



and high-performance computers,

for science and society!



KDD ProcessKnowledge Discovery and Data Mining

Collection

Processing/Cleansing/Correction/Formatting

Mining/Analysis/Modeling

Presentation/Visualization

Application/Decision Support

Management/Integration/Warehousing



Data Mining & Knowledge Discovery KD, KDD, KDD(D)*

What’s in a name?• Database• Data Mining• Discovery• Derivation• Decision Support



Contributions

Data Mining

Artificial Intelligence

High Performance ComputingStatistics

Database Systems



Contributions

Data Mining

Artificial Intelligence

High Performance ComputingStatistics

Database Systems

Operations Research

GIS

Visualization



The Case for Data Mining: Data Reality

• Controlled experimental data collection is an ideal• Legacy archives and independent collection activities• Deluge from new sources

• Remote Sensing• Instrumentation & Wireless Communications• Simulation Models

• Growth of data collections vs. analysts • Many types of data, many uses, many types of queries• Advances in computational infrastructure provide new

opportunities for access and integration • Paradigm shift: hypothesis-driven data collection to data

mining (KDD)



The Revolution in Ecology

• Computational Ecology and Eco-Informatics• Instrumentation & Remote Sensing

• Amphibian urls and hyperspectral data• Tropical glaciers in Ohio

• Computer Simulations• Coupled biogeochemistry, ocean,

atmosphere…

• Ecology without boots!



Classic Applications - Commercial

• Fraud Detection – credit card• Churning – long-distance carriers• Targeted Marketing – customer profiles• Stock Market – futures trading• Market Basket Analysis

• Soon to be classic: FL 2000 election



Classic Applications - Science

• Volcanoes on Venus - Classification

• Burl, et al., NASA, Cal Tech.

• Astronomical clustering – Autoclass, Bayesian Clustering

• Cheeseman, Stutz, NASA

• Oil spills from remote sensing data – Decision Trees

• Kubat, et al., Ottowa

• Biodiversity analysis – Genetic algorithms, Bayesian Nets

• Stockwell, SDSC/UCSD

• …???



Classic Applications - Science

• Volcanoes on Venus - Classification

• Burl, et al., NASA, Cal Tech.

• Astronomical clustering – Autoclass, Bayesian Clustering

• Cheeseman, Stutz, NASA

• Oil spills from remote sensing data – Decision Trees

• Kubat, et al., Ottowa

• Biodiversity analysis – Genetic algorithms, Bayesian Nets

• Stockwell, SDSC/UCSD

• YOUR NAME HERE!! (1800-SKIDLME)



Data Mining Tools (suites)

• SPSS - Clementine

• http://www.spss.com/clementine/

• Oracle - Darwin

• http://www.oracle.com/ip/analyze/warehouse/datamining/

• SGI - MineSet

• http://www.sgi.com/software/mineset/

• IBM - Intelligent Miner • http://www-4.ibm.com/software/data/iminer/fordata/

• http://www.kdnuggets.com/software/index.html



Data Mining Analytical Techniques(patterns, hypotheses, models)

• Statistical Methods • Descriptive, Modeling, Data Reduction…

• Associations• Simple relations in categorical data

• Classification & Prediction• Model induction - Supervised learning

• Clustering• Concept discovery - Unsupervised learning



Association Rule Mining

• Associations• Simple rules in categorical data

• Sample applications • Market Basket Analysis

Buys(Milk) => Buys(Eggs)• Transaction Processing

Income(Hi) & Single(Y) => Owns(Computer)

• Search for Strong Rules• Support R(A => B) = P(A U B)• Confidence R(A => B) = P(B | A) = P(AB) / P(A)



Association Rule Mining

70 Bird Antelope Lion

80 Hyena Bird Snake

70 Tiger Snake Antelope

70 Bird Lion Hyena

70 Snake Lion Bird

R1: [70 => (Bird & Lion)]

Support: P(70 or (Bird & Lion)) = 4/5 = 80%

Confidence: P((Bird & Lion) | 70)) =

P(Bird & Lion & 70) / P(70) = (3/5) / (4/5) = 75%



Classification

• Classification and prediction• Create model for distinguishing concepts• Labeled training data• Metrics based on accuracy rates and cross-validation

• Numerous methods• Decision trees• Neural Nets• Bayesian Networks• Regression

• Many applications• Identifying credit risks• Predicting biological productivity• Medical diagnosis• Classifying toxic risks…



Classification – Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Ecosystem Precipitation




Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5Precipitation < 60




Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5 Forest 120

Forest 104

Forest 116

Prairie 63

Precip < 60Precip < 100




Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

IF(Precip < 60 ) then Desert

Else If (Precip < 100) then Prairie

Else Forest



Pruned Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5Precipitation < 60



Pruned Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5

Precipitation < 60

IF(Precip < 60 ) then Desert

Else [P(Forest) = .75] &

[P(Prairie) = .25]



Clustering

• Cluster Analysis – Concept Discovery• Create models for discovered concepts• No known class labels• Metrics based on cluster similarity

• Numerous methods• K-means (partitioning)• Bayesian Networks• Hierarchical clustering• Neural Networks

• Example applications• Identifying common subpopulations • Creating taxonomies (biological, manufacturing, commerce)• Discovering failure patterns in manufactured parts• Locating environmental risk areas…



Clustering – K-Means

Precipitation Temperature

8 81

71 70

62 63

49 45

17 76

32 49




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

50 – 8050 – 80C3

25 - 5535 - 60C2

0 - 2570 - 85C1

Cluster Temperature Precipitation




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

50 – 8050 – 80C3

25 - 5535 - 60C2

0 - 2570 - 85C1

Cluster Temperature Precipitation




30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

C1 70 - 85 0-25 Desert

C2 35 - 60 25 - 55 Prairie

C3 50 – 80 50 – 80 Forest

Cluster Temperature Precipitation Ecosystem



On-line Analytical Processing (OLAP)

• On-line Transaction Processing (OLTP) vs. OLAP• Analysis & decision support are more compute intensive

• Concept hierarchies - (representing forests & trees)• Space: site, county, state, country…• Time: day, week, month….• Taxonomic hierarchies …

• Methods: rules, explicit specification, clustering• Multidimensional data & efficient access/selection• Operations: slice, dice, roll up, drill down, pivot



Concept Hierarchy for Precipitation

(high)0-3

0-12 inches

4-8 9-12

0-1 2-3 7-8 11-129-104-6

(low) (med)



OLAP Examples

• Slice• For (precip = “4-8 inches”)

• Dice• For (precip = “4-8 inches” AND week = “120”)

• Drill down (specification)

• On time from months to weeks

• Roll up (generalization, summarization)

• On Space from counties to states



MSU Data Set

• Agricultural productivity simulation • Integrates land use, climate, ecosystem data• Remote sensing, computer simulations, field observations

• Inputs – geographic & climatic parameters• Max and min temperatures• Solar radiation• Precipitation ….

• Outputs – ecosystem • Leaf area index • Crop yield• Soil Water …



Statistics of MSU Simulation Data

• 20 years, daily records• 1053 regions• 5 million rows• Approx 300MB

• Stuart Gage, MSU ComputationalEcology and Visualization Lab

• http://www.cevl.msu.edu/index.html



Example: DBMS support for OLAP

• SQL support for rollup• SELECT region, week, day_of_week, sum(solrad)

FROM msu.details_table

GROUP BY ROLLUP (region, week, day_of_week)

ORDER BY region, week, day_of_week

• Output is summation of solrad by• (region, week, day_of_week)• (region, week, –)• (region, –, –)• (–, –, –)



DBMS Support for Large Data Analysis

• Large database support

• Parallel processing

• OLAP functions

• New data types, object extensions, spatial data, XML…

• Distributed databases



Dealing with large databases• In the beginning…

• Database size max logical filesystem size (2GB) (UNIX)

• Tablespaces• A tablespace can have multiple tablespace containers• Size of tablespace container max filesystem size

DatabaseT1, T2, T3 ...

/filesystem

DatabaseT1, T2, ...

/fs

Database

T1, T3Tablespace1

T2Tablespace2

/fs /fs /fs/fs



Tablespaces...

• Different types of tablespace containers• DBMS managed (“raw”)• File system managed (“cooked”)

• Different types of tablespaces• Regular data and indexes (typical max size of 64GB)• Large objects (LOB’s) and temporary data (typical max

is 2TB)

• Larger page sizes for containers (4K to 32K)• Max. TS size for regular data increases to 512GB

• What if a given table is > 512GB?



Loading large databases

• The relevant industry benchmark is TPC-H (www.tpc.org)

• Evolved from TPC-D Benchmark• First audited benchmark was performed in December 1995• 100GB database, 32-node IBM SP

• Current largest benchmark runs are for 1TB database

• Largest table in benchmark has • ~ 70% of data (700GB)

• 6 billion rows

• Measures single user performance (“power metric”) and multi-user performance (“throughput metric”)



Large Database Benchmarks• Results from IBM, April 2000

• Loading 1TB database takes about 7.5 hours

• Total disk = 9.7 times the raw size of database

• Hardware configuration• 32 4-way IBM SP nodes, 4GB/node (128GB), 35x9GB disks/node

• Total 5-year cost of system: $9.3M

• Power: 12,812; QphH: 12,867; Price/perf: $725

• Results from HP, Feb. 13th, 2001

• Loading database takes 5.25 hours

• Total disk = 10.2 times raw size of database

• Hardware configuration

• 64 processor Superdome, 96GB memory, 3 disk arrays with 558 18.2GB drives

• Total 5-year cost of system: $9.6M

• Power: 13,730; QphH: 9,755; Price/perf: $985



Large Database Benchmarks• IBM(cluster of SMP) vs HP (SMP) – based solely on

analysis of published TPC-H numbers:• HP is 7.2% better in power (12,812 vs 13,730)• IBM is 24.2% better in throughput (12,867 vs 9,755)• IBM is 3.2% better in price ($9.3M vs $9.6M)• IBM is 36% better in price/performance ($725 vs $985)

• TPC-C Benchmark example – IBM• 32x4 processors, 4GB/node (128GB), 218 18GB disks/node

• Total managed storage of ~125TB

• 440,879 tpm-c

• Total cost: $14.2M

• See www.tpc.org for all results



Large Database Benchmarks

• High-end database sizes• “several customers with 100TB of managed disk” – IBM• “customer has requested 1PB (that’s petabyte) of on-line

storage for bioinformatics application over next 5 years” – Sun• “TB’s are passé, think PB’s” – IBM Life Sciences rep• Legacy formats are files, but newer data will be in DBMS

• Dealing with very large data sizes• Interfacing to archival storage• Parallelism



DB2

Databasetable

Create TablespaceHPSS-TSPACE

Managed By DatabaseUsingFILE (HPSS <hpss-filename> <size> DISKBUF <path> <size>);

HPSSHPSSdisk

cache

Linking DBMS to archival storageThe DB2/HPSS Project

C4 C5C1 C2 C3• Joint project with IBM TJ Watson Research Center

• DB2 provides link to Tivoli ADSM• Oracle also supports interface to archival storage

HPSS_TSPACE

DB2 disk buffer



Parallelism in Database Systems

• Example databaseINPUT table:region int — spatial region, county

year smallint— year (1972 - 1990)

day smallint— day of year (1-366)

solrad int — solar radiation

tmax float — max day temp. (-33, 44)

tmin float — min day temp. (-45, 29.5)

pp float — precipitation (mm)

dd float — degree days (heat)

OUTPUT table:region int

year smallint

day smallint

x_albers int — x-coordinate

y_albers int — y-coordinate

tdd10 float — total degree days

add float — total anthesis degree days

tlai float — total leaf area index

seed float — total seed biomass (gr/m2)

yield float — final yield (tons/ha)

twater float — total soil water evaporation + total transpiration

ttsw float — Maximum water available



Generating query graphs• Convert SQL queries to query execution plans consisting of low-

level query operators• Q1: Select all regions where max temp is greater than 40 degrees, over the

entire period of the study: • SELECT distinct(region) FROM Input WHERE tmax>40

• Q2: Select solar radiation and total leaf area index values for all days and regions in the year 1978:

• SELECT solrad, tlai FROM Input A, Output B

WHERE A.region=B.region AND A.year=B.year AND A.day=B.day

Remove duplicates,format output

Apply tmax>40

Read INPUT table

Format output

Join (region, day, year)

Read INPUT table Read OUTPUT table

Q1 Q2



Levels of Query Parallelism

• Inter-query• Execute multiple queries (Q1 and Q2) at the same time

• Inter-operator (intra-query)• Concurrently execute multiple operators in the query• Pipeline through the operators, e.g. read and join

Format output





Levels of Parallelism...

• Intra-operator• Data parallelism• Employ multiple processes for each operator

INPUT table OUTPUT table

Read table Read table

Format output

Join



Parallel Architecture models and DBMS

• Shared-everything • memory, process space, disk subsystem are all

common

• Shared disk• Separate memory/process space• Disk subsystem/filesystem is common

• Shared nothing• Separate memory, disks, OS…• Only communication “bus” is shared



Shared Everything

Disk

Processor

Memory

• SMP: Symmetric Multi-Processors• Provide well-balanced systems• Shared workload, resilient to “unexpected” workload• Dynamic allocation of processes to query operators (inter- as well

as intra-query)• Expensive and don’t scale to large configurations



Shared Disk

Disk

Processor

Memory

• Some of the classic architectures map to this, VaxCluster, IBM mainframes (could make a comeback with SAN’s)

• Can share I/O workload, dynamic partitioning of data• Only need to scale I/O subsystem, and not memory



Shared Nothing

Disk

Processor

Memory

• Highly scalable• Static partitioning of data• Cannot share workload• Cluster of SMP’s provides advantages of shared-

nothing and SMP’s



SN System

Combining Nodegroups and Tablespaces

Nodegroup

Tablespace2OUTPUT

Tablespace1INPUT

SN System

Nodegroup1 Nodegroup2

Tablespace2OUTPUT

Tablespace1INPUT

Format output





The DBMS/Application bottleneck

• Serial communication between DBMS and app.

Application



The DBMS/Application bottleneck

App App App App

• Parallel communication between DBMS and app.

App



DBMS / DM software connection

Data Mining Platform

Database Platform

Extract data subsets

Generate results

Presentation (e.g. GIS, 3D)

Store session results



Performance Tuning

• Sample set of Database Manager configuration parameters:

• CPU speed (millisec/instruction) (CPUSPEED) = 9.700848e-07• Comm. bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+00• Max number of existing agents (MAXAGENTS) = 400• Initial number of agents in pool (NUM_INITAGENTS) = 0• Max number of coord. Agents (MAX_COORDAGENTS)• Max no. of concurrent coord. agents (MAXCAGENTS) • Maximum query degree of parallelism (MAX_QUERYDEGREE) = ANY• Enable intra-partition parallelism (INTRA_PARALLEL) = NO



Database Tuning

• Sample set of Database configuration parameters:

• Default query optimization class (DFT_QUERYOPT) = 9• Degree of parallelism (DFT_DEGREE) = 1

• Database heap (4KB) (DBHEAP) = 1200• Catalog cache size (4KB) (CATALOGCACHE_SZ) = 64• Log buffer size (4KB) (LOGBUFSZ) = 8• Utilities heap size (4KB) (UTIL_HEAP_SZ) = 5000• Buffer pool size (pages) (BUFFPAGE) = 128000• Max storage for lock list (4KB) (LOCKLIST) = 100

• Number of asynch page cleaners (NUM_IOCLEANERS) = 1• Number of I/O servers (NUM_IOSERVERS) = 3• Sequential detect flag (SEQDETECT) = YES• Default prefetch size (pages) (DFT_PREFETCH_SZ) = 32



Examples of data exploration

• Testing temporal relationship (sensitivity analysis)• Can conditions from day N-1 be used to predict output of day N• How far back can we go?• Input table:

• Generate output:• Region, Year, Day, Inputi, Outputi, Output(i-1)

Region Year Day Inputs Outputs1 78 1 I1 O1

1 78 2 I2 O2

1 78 3 I3 O3

1 78 4 I4 O4

2 78 1 I1 O1

2 78 2 I2 O2



“Flattening” the table

SELECT A.region, A.year, A.day, A.solrad, A.tlai, B.day, B.tlai FROM msu.combined A, msu.combined B WHERE A.region=B.region AND A.year=B.year

AND A.day=B.day-1

• E.g. SQL query:

• Query Explain facility



“Flattening” the table

Access Table Name = MSU.COMBINED | #Columns = 5| Relation Scan| | Prefetch: Eligible| Insert Into Sorted Temp Table ID = t1| | #Columns = 4| | #Sort Key Columns = 1| | | Key 1: YEAR (Ascending)Access Temp Table ID = t1| Relation Scan| | Prefetch: EligibleMerge Join

Merge Join| Access Table Name = MSU.COMBINED| | #Columns = 4| | Relation Scan| | | Prefetch: Eligible| | Insert Into Sorted Temp Table ID = t2| | | #Columns = 4| | | #Sort Key Columns = 1| | | | Key 1: YEAR (Ascending)| Access Temp Table ID = t2| | Relation Scan| | | Prefetch: Eligible| Residual Predicate(s)| | #Predicates = 2Return Data to Application| #Columns = 7



“Flattening” the table, with indexing

Access Table Name = MSU.COMBINED | #Columns = 5| Relation Scan| | Prefetch: Eligible| Insert Into Sorted Temp Table ID = t1| | #Columns = 4| | #Sort Key Columns = 1| | | Key 1: REGION (Ascending)Access Temp Table ID = t1| Relation Scan| | Prefetch: EligibleNested Loop Join

Nested Loop Join| Access Table Name = MSU.COMBINED| | #Columns = 4| | Index Scan: Name = MSU.C_RYD | | | Index Columns:| | | | 1: REGION (Ascending)| | | | 2: YEAR (Ascending)| | | | 3: DAY (Ascending)| | | Data Prefetch: Eligible 157| | | Index Prefetch: Eligible 157| | Return Data to Application| | | #Columns = 7



Declustering the table

• Partition the table by Region and/or Year• Linearly scalable join operation

• Testing spatial relationships/sensitivity• Compare region R with a specified neighborhood of R• Compare region R with other “similar” regions–spatial

clustering• Decluster table by year/day



Built-in support for OLAP• Example table

• INPUT (Region, Week, Day_of_week, Solrad)• 2 regions, 1978, 250 days/year (500 rows)

• SQL support for rollup• SELECT region, week, day_of_week, sum(solrad)

FROM Input

GROUP BY ROLLUP (region, week, day_of_week)


• Output is summation of solrad by• (region, week, day_of_week), (region, week, –)• (region, –, –), (–, –, –)



(Region, Week, Day_of_Week SUM(Solrad)

17003 1 1 1661.0

17003 1 2 2654.0

17003 1 3 2709.0

17003 1 4 2101.0

17003 1 5 1197.0

17003 1 6 1605.0

17003 1 7 1133.0

17003 1 - 13060.0

….

17003 36 1 6030.0

17003 36 2 6222.0

17003 36 3 6351.0

17003 36 4 6387.0

17003 36 5 6160.0

17003 36 - 31150.0

17003 - - 1206273.0

- - - 2398149.0



The “cube” operator

• SQL query• SELECT region, week, day_of_week, sum(solrad)

FROM Input

GROUP BY CUBE (region, week, day_of_week)


• Output is summation of solrad by• (region, week, day_of_week), (region, week, –),

(region, –, –), (–, –, –)• (region, –, day_of_week)• (–, week, day_of_week)• (–, week, –)• (–, –, day_of_week)



Distributed data mining

• “Function shipping” vs. “data shipping”• Generalization of the “operator pushdown”

notion• “DataCutter” operations in SRB• Source/wrapper-side processing in MIX

• Need to understand which operations can be distributed and how

• Web-based infrastructure for OLAP and DM• XML for Analysis



Application(SRB client)

MCATSRB ServersSRB Middleware

“Remote” operations in SRB

DataCutter, other “remote” operations



Wrapper-side processing in MIX

DataSource XML Data

Source

DataSource

MIXmMIXmMediatorMediator

ApplicationApplication

WrapperWrapper

Wrapper



The role of XML

• Representing, exchanging metadata• image headers, instrumentation information,

descriptive metadata...

• Expressing service descriptions• Web-based services

• Exchanging data among services• “Raw” data: sequence information, GIS

information…• Results of analysis: rowsets, multidimensional

cubes,...



Client Functionality

XML for analysis

UI

Client Functions

Discover, Execute Calls

SOAP

HTTP

XML for AnalysisProvider

Implementation

Discover, Execute Calls

- Server

SOAP

HTTP

Data

Client Web Service Provider Web Service

Discover, Execute

Data

Data Source



Examples - Overview

• Intelligent Miner – Data Analysis and Mining• Interface, database connectivity, data creation• Statistical routines• Classification

• Decision Tree

• Neural Network

• Clustering

• Netica - Probabilistic Modeling and Decision Support• Belief networks, probabilistic queries • Statistical decision theory, decision models, influence diagrams



























Probabilistic Modeling andBayesian Belief Networks

Productivity

Precipitation Solar Radiation

Yield







Statistical Decision Theory*

• Normative model of rational decision making• Decision: Irrevocable allocation of resources• Beliefs: Probability theory• Preferences: Utility theory• Expected Utility = Probability * Utility• Value of Information = EU (A | I) – EU(A)

• Principle of Rationality:

Maximize Expected Utility

• (*Rational agents are your friends.)







BK2 Induction Algorithm

• Data Mining via Probabilistic Model Induction• Discover Network Structure and Parameters• Greedy Algorithm – ML gradient search• Encode background Knowledge – Preferences



Model Day 1



Model Day 120



Other Mining Applications

• Spatial Data Mining • Time Series • Sequence Mining• Text Data Mining• Multimedia Database Mining• Web Mining• Network Traffic Analysis



Acknowledgements• Students

• Peter Shin, Ankur Jain

• Science Collaborator• Stuart Gage, MSU – shared his data set and many insights about

the data

• SDSC• Mike Vildibill, Deputy Dir, providing hardware resources for SKIDL• Josh Polterock / Dave Archbell – help with software installation,

maintenance

• Funding support• NPACI ESS: support for Tony Fountain, Ankur Jain• NPACI DICE: support for Chaitan Baru• NSF REU: support for Peter Shin

Date post:	20-Jan-2015
Category:	Documents
Upload:	tommy96
View:	317 times
Download:	0 times

NPACI AHM 2001 Tutorial on Data Mining for Scientific ...

Documents