San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
NPACI AHM2001NPACI AHM2001
TutorialTutorialonon
Data Mining for Scientific Data Mining for Scientific ApplicationsApplications
Chaitan Baru
Tony Fountain
San Diego Supercomputer Center
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Tutorial Objectives
• Provide overview of the infrastructure – technologies and techniques – for:• data mining, database systems
• Provide some illustrative examples of how the infrastructure can be used in scientific applications
• Present plans for the SDSC Knowledge and Information Discovery Lab (SKIDL)
• Identify potential collaborations – for applications as well as infrastructure
–– Our emphasis is on the infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Tutorial Outline
8:00 - 8:15 Data intensive computing in NPACI (Baru)
8:15 - 9:15 Introduction to data mining (Fountain)
9:15 - 10:15 DBMS support for analysis of large-scale data (Baru)
10:15 - 10:30 BREAK
10:30 - 12:00 Examples of data mining tools (Fountain)
Next steps...
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
NPACI DICE
• Focus on data, information, and knowledge management:• Persistent archives
• Use of XML and archival storage systems (e.g. HPSS) for data storage
• Metadata-based access to data sets (Extensible Metadata Catalog, eMCAT)
• Distributed data handling (Storage Resource Broker, SRB)
• Information mediation (Mediation of Information using XML, MIX)
• Model-based mediation (NeuroMIX), use of Topic Maps
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
HPSS
• Capacity • Total >400TB• Current usage: >240TB stored
• Load• Transfer rate: 1TB/day
• SRB provides a “container” mechanism for better usage and improved efficiency
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Application(SRB client)
Distributed Storage Resources
DB2, Oracle, ObjectStore HPSS, UniTree UNIX, ftp
MCATSRB ServersSRB Middleware
The SDSC Storage Resource Broker• Metadata-based access to data sets stored in distributed, heterogeneous storage resources
Solaris, Linux, NT, AIX, HP-UX, IRIX
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Current Usage of SRB
• Collections• Digital Sky: ~4TB, ~8 million files• Digital Embryo: ~700GB, millions of files• Digital library collections (ADL, UCB, Michigan): ~1
million files• HyperLTER – hyperspectral data • Particle Physics Data Grid
• Upcoming collections• SLAC...
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Mediation of Information using XML (MIX)
DataSource
XML DataSource
DataSource
MIXmMIXmMediatorMediator
XML View(s)
Blended BrowsingBlended Browsingand Querying (BBQ)and Querying (BBQ)
interface
XML View(s)
XML View(s)
Definition of mediated view inXML Matching And Structuring (XMAS)XML Matching And Structuring (XMAS)
query language
WrapperWrapper
Lazy evaluation ofXMAS queries usingDOM-VXD
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
From data management infrastructure to knowledge discovery infrastructure
• The Affymetrix story• “Technology built for Wall Street helps bioinformatics companies as
well…”
• The “scientist in the middle”• The infrastructure is a tool to help the scientist, not a replacement!
infrastructure
KDD
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The Infrastructure Supports:
• Exploratory data analysis of large data sets• efficient ad hoc statistical processing
• Parallel data access, subsetting, and analysis• Data intensive approach to model building and
verification• including, fusion of different forms of data (e.g. database tables,
instrument outputs, remote sensing data, maps, …)
– Employ, and build upon, existing (commercial, freeware) tools and software packages, as much as possible
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The SDSC Knowledge and Information Discovery Lab (SKIDL)
• Initial hardware platform• 2-processor Sun, 512MB memory, 36 GB local disk
• Upgrade to: • 20 processor Sun, 6 GB memory, 400 GB local disk• Access to additional disk storage via storage area network (SAN)
• Possible further upgrade (via CalIT2)• Additional 4 GB memory, 1 TB SAN disk, Gigabit Ethernet capability
• Software• High-performance, parallel database systems and file systems
• DB2• Oracle, GPFS
• Suite of data mining tools• Intelligent Miner, MineSet, Bayesian network tools• S-Plus, Darwin, Clementine, SAS
• Presentation, visualization: ESRI ArcIMS, ...
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining
Tony FountainNPACI ESS
SDSC Knowledge & Information Discovery Lab
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Overview (DM101)
• Part 1: • Definition• Motivations• Methods, Techniques, & Tools
• Part 2:• Examples & Demos• Data Mining to Decision Support
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Overview (DM101)
• Part 1: • Definition• Motivations• Methods, Techniques, & Tools
• Database 605 – Chaitan Baru
• Part 2:• Examples & Demos• Data Mining to Decision Support
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Outline (DM101)
Part 1 – What is data mining?1. Direct2. Contributions from other disciplines3. Motivations & context4. Example applications5. Analytical methods:
• Association Rules• Classification & Prediction• Clustering• OLAP
6. MSU data set
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns,
in large databases…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns,
in large databases,
that were collected for other applications…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns,
in large databases,
that were collected for other applications,
using machine learning algorithms…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns,
in large databases,
that were collected for other applications,
using machine learning algorithms,
and high-performance computers…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns,
in large databases,
that were collected for other applications,
using machine learning algorithms,
and high-performance computers,
for fun and profit!
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Definition
The search for interesting patterns,
in large databases,
that were collected for other applications,
using machine learning algorithms,
and high-performance computers,
for science and society!
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
KDD ProcessKnowledge Discovery and Data Mining
Collection
Processing/Cleansing/Correction/Formatting
Mining/Analysis/Modeling
Presentation/Visualization
Application/Decision Support
Management/Integration/Warehousing
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining & Knowledge Discovery KD, KDD, KDD(D)*
What’s in a name?• Database• Data Mining• Discovery• Derivation• Decision Support
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Contributions
Data Mining
Artificial Intelligence
High Performance ComputingStatistics
Database Systems
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Contributions
Data Mining
Artificial Intelligence
High Performance ComputingStatistics
Database Systems
Operations Research
GIS
Visualization
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The Case for Data Mining: Data Reality
• Controlled experimental data collection is an ideal• Legacy archives and independent collection activities• Deluge from new sources
• Remote Sensing• Instrumentation & Wireless Communications• Simulation Models
• Growth of data collections vs. analysts • Many types of data, many uses, many types of queries• Advances in computational infrastructure provide new
opportunities for access and integration • Paradigm shift: hypothesis-driven data collection to data
mining (KDD)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The Revolution in Ecology
• Computational Ecology and Eco-Informatics• Instrumentation & Remote Sensing
• Amphibian urls and hyperspectral data• Tropical glaciers in Ohio
• Computer Simulations• Coupled biogeochemistry, ocean,
atmosphere…
• Ecology without boots!
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classic Applications - Commercial
• Fraud Detection – credit card• Churning – long-distance carriers• Targeted Marketing – customer profiles• Stock Market – futures trading• Market Basket Analysis
• Soon to be classic: FL 2000 election
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classic Applications - Science
• Volcanoes on Venus - Classification
• Burl, et al., NASA, Cal Tech.
• Astronomical clustering – Autoclass, Bayesian Clustering
• Cheeseman, Stutz, NASA
• Oil spills from remote sensing data – Decision Trees
• Kubat, et al., Ottowa
• Biodiversity analysis – Genetic algorithms, Bayesian Nets
• Stockwell, SDSC/UCSD
• …???
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classic Applications - Science
• Volcanoes on Venus - Classification
• Burl, et al., NASA, Cal Tech.
• Astronomical clustering – Autoclass, Bayesian Clustering
• Cheeseman, Stutz, NASA
• Oil spills from remote sensing data – Decision Trees
• Kubat, et al., Ottowa
• Biodiversity analysis – Genetic algorithms, Bayesian Nets
• Stockwell, SDSC/UCSD
• YOUR NAME HERE!! (1800-SKIDLME)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining Tools (suites)
• SPSS - Clementine
• http://www.spss.com/clementine/
• Oracle - Darwin
• http://www.oracle.com/ip/analyze/warehouse/datamining/
• SGI - MineSet
• http://www.sgi.com/software/mineset/
• IBM - Intelligent Miner • http://www-4.ibm.com/software/data/iminer/fordata/
• http://www.kdnuggets.com/software/index.html
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Data Mining Analytical Techniques(patterns, hypotheses, models)
• Statistical Methods • Descriptive, Modeling, Data Reduction…
• Associations• Simple relations in categorical data
• Classification & Prediction• Model induction - Supervised learning
• Clustering• Concept discovery - Unsupervised learning
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Association Rule Mining
• Associations• Simple rules in categorical data
• Sample applications • Market Basket Analysis
Buys(Milk) => Buys(Eggs)• Transaction Processing
Income(Hi) & Single(Y) => Owns(Computer)
• Search for Strong Rules• Support R(A => B) = P(A U B)• Confidence R(A => B) = P(B | A) = P(AB) / P(A)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Association Rule Mining
70 Bird Antelope Lion
80 Hyena Bird Snake
70 Tiger Snake Antelope
70 Bird Lion Hyena
70 Snake Lion Bird
R1: [70 => (Bird & Lion)]
Support: P(70 or (Bird & Lion)) = 4/5 = 80%
Confidence: P((Bird & Lion) | 70)) =
P(Bird & Lion & 70) / P(70) = (3/5) / (4/5) = 75%
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classification
• Classification and prediction• Create model for distinguishing concepts• Labeled training data• Metrics based on accuracy rates and cross-validation
• Numerous methods• Decision trees• Neural Nets• Bayesian Networks• Regression
• Many applications• Identifying credit risks• Predicting biological productivity• Medical diagnosis• Classifying toxic risks…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classification – Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Ecosystem Precipitation
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classification – Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116
Prairie 63
Desert 2
Desert 5Precipitation < 60
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classification – Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116
Prairie 63
Desert 2
Desert 5 Forest 120
Forest 104
Forest 116
Prairie 63
Precip < 60Precip < 100
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Classification – Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
IF(Precip < 60 ) then Desert
Else If (Precip < 100) then Prairie
Else Forest
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Pruned Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116
Prairie 63
Desert 2
Desert 5Precipitation < 60
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Pruned Decision Tree
Desert 2
Forest 120
Forest 104
Desert 5
Forest 116
Prairie 63
Forest 120
Forest 104
Forest 116
Prairie 63
Desert 2
Desert 5
Precipitation < 60
IF(Precip < 60 ) then Desert
Else [P(Forest) = .75] &
[P(Prairie) = .25]
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering
• Cluster Analysis – Concept Discovery• Create models for discovered concepts• No known class labels• Metrics based on cluster similarity
• Numerous methods• K-means (partitioning)• Bayesian Networks• Hierarchical clustering• Neural Networks
• Example applications• Identifying common subpopulations • Creating taxonomies (biological, manufacturing, commerce)• Discovering failure patterns in manufactured parts• Locating environmental risk areas…
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
Precipitation Temperature
8 81
71 70
62 63
49 45
17 76
32 49
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
per
atu
re
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
pera
ture
50 – 8050 – 80C3
25 - 5535 - 60C2
0 - 2570 - 85C1
Cluster Temperature Precipitation
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
pera
ture
50 – 8050 – 80C3
25 - 5535 - 60C2
0 - 2570 - 85C1
Cluster Temperature Precipitation
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Clustering – K-Means
30
40
50
60
70
80
90
0 20 40 60 80
Precipitation
Tem
pera
ture
C1 70 - 85 0-25 Desert
C2 35 - 60 25 - 55 Prairie
C3 50 – 80 50 – 80 Forest
Cluster Temperature Precipitation Ecosystem
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
On-line Analytical Processing (OLAP)
• On-line Transaction Processing (OLTP) vs. OLAP• Analysis & decision support are more compute intensive
• Concept hierarchies - (representing forests & trees)• Space: site, county, state, country…• Time: day, week, month….• Taxonomic hierarchies …
• Methods: rules, explicit specification, clustering• Multidimensional data & efficient access/selection• Operations: slice, dice, roll up, drill down, pivot
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Concept Hierarchy for Precipitation
(high)0-3
0-12 inches
4-8 9-12
0-1 2-3 7-8 11-129-104-6
(low) (med)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
OLAP Examples
• Slice• For (precip = “4-8 inches”)
• Dice• For (precip = “4-8 inches” AND week = “120”)
• Drill down (specification)
• On time from months to weeks
• Roll up (generalization, summarization)
• On Space from counties to states
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
MSU Data Set
• Agricultural productivity simulation • Integrates land use, climate, ecosystem data• Remote sensing, computer simulations, field observations
• Inputs – geographic & climatic parameters• Max and min temperatures• Solar radiation• Precipitation ….
• Outputs – ecosystem • Leaf area index • Crop yield• Soil Water …
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Statistics of MSU Simulation Data
• 20 years, daily records• 1053 regions• 5 million rows• Approx 300MB
• Stuart Gage, MSU ComputationalEcology and Visualization Lab
• http://www.cevl.msu.edu/index.html
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Example: DBMS support for OLAP
• SQL support for rollup• SELECT region, week, day_of_week, sum(solrad)
FROM msu.details_table
GROUP BY ROLLUP (region, week, day_of_week)
ORDER BY region, week, day_of_week
• Output is summation of solrad by• (region, week, day_of_week)• (region, week, –)• (region, –, –)• (–, –, –)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
DBMS Support for Large Data Analysis
• Large database support
• Parallel processing
• OLAP functions
• New data types, object extensions, spatial data, XML…
• Distributed databases
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Dealing with large databases• In the beginning…
• Database size max logical filesystem size (2GB) (UNIX)
• Tablespaces• A tablespace can have multiple tablespace containers• Size of tablespace container max filesystem size
DatabaseT1, T2, T3 ...
/filesystem
DatabaseT1, T2, ...
/fs
Database
T1, T3Tablespace1
T2Tablespace2
/fs /fs /fs/fs
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Tablespaces...
• Different types of tablespace containers• DBMS managed (“raw”)• File system managed (“cooked”)
• Different types of tablespaces• Regular data and indexes (typical max size of 64GB)• Large objects (LOB’s) and temporary data (typical max
is 2TB)
• Larger page sizes for containers (4K to 32K)• Max. TS size for regular data increases to 512GB
• What if a given table is > 512GB?
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Loading large databases
• The relevant industry benchmark is TPC-H (www.tpc.org)
• Evolved from TPC-D Benchmark• First audited benchmark was performed in December 1995• 100GB database, 32-node IBM SP
• Current largest benchmark runs are for 1TB database
• Largest table in benchmark has • ~ 70% of data (700GB)
• 6 billion rows
• Measures single user performance (“power metric”) and multi-user performance (“throughput metric”)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Large Database Benchmarks• Results from IBM, April 2000
• Loading 1TB database takes about 7.5 hours
• Total disk = 9.7 times the raw size of database
• Hardware configuration• 32 4-way IBM SP nodes, 4GB/node (128GB), 35x9GB disks/node
• Total 5-year cost of system: $9.3M
• Power: 12,812; QphH: 12,867; Price/perf: $725
• Results from HP, Feb. 13th, 2001
• Loading database takes 5.25 hours
• Total disk = 10.2 times raw size of database
• Hardware configuration
• 64 processor Superdome, 96GB memory, 3 disk arrays with 558 18.2GB drives
• Total 5-year cost of system: $9.6M
• Power: 13,730; QphH: 9,755; Price/perf: $985
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Large Database Benchmarks• IBM(cluster of SMP) vs HP (SMP) – based solely on
analysis of published TPC-H numbers:• HP is 7.2% better in power (12,812 vs 13,730)• IBM is 24.2% better in throughput (12,867 vs 9,755)• IBM is 3.2% better in price ($9.3M vs $9.6M)• IBM is 36% better in price/performance ($725 vs $985)
• TPC-C Benchmark example – IBM• 32x4 processors, 4GB/node (128GB), 218 18GB disks/node
• Total managed storage of ~125TB
• 440,879 tpm-c
• Total cost: $14.2M
• See www.tpc.org for all results
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Large Database Benchmarks
• High-end database sizes• “several customers with 100TB of managed disk” – IBM• “customer has requested 1PB (that’s petabyte) of on-line
storage for bioinformatics application over next 5 years” – Sun• “TB’s are passé, think PB’s” – IBM Life Sciences rep• Legacy formats are files, but newer data will be in DBMS
• Dealing with very large data sizes• Interfacing to archival storage• Parallelism
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
DB2
Databasetable
Create TablespaceHPSS-TSPACE
Managed By DatabaseUsingFILE (HPSS <hpss-filename> <size> DISKBUF <path> <size>);
HPSSHPSSdisk
cache
Linking DBMS to archival storageThe DB2/HPSS Project
C4 C5C1 C2 C3• Joint project with IBM TJ Watson Research Center
• DB2 provides link to Tivoli ADSM• Oracle also supports interface to archival storage
HPSS_TSPACE
DB2 disk buffer
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Parallelism in Database Systems
• Example databaseINPUT table:region int — spatial region, county
year smallint— year (1972 - 1990)
day smallint— day of year (1-366)
solrad int — solar radiation
tmax float — max day temp. (-33, 44)
tmin float — min day temp. (-45, 29.5)
pp float — precipitation (mm)
dd float — degree days (heat)
OUTPUT table:region int
year smallint
day smallint
x_albers int — x-coordinate
y_albers int — y-coordinate
tdd10 float — total degree days
add float — total anthesis degree days
tlai float — total leaf area index
seed float — total seed biomass (gr/m2)
yield float — final yield (tons/ha)
twater float — total soil water evaporation + total transpiration
ttsw float — Maximum water available
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Generating query graphs• Convert SQL queries to query execution plans consisting of low-
level query operators• Q1: Select all regions where max temp is greater than 40 degrees, over the
entire period of the study: • SELECT distinct(region) FROM Input WHERE tmax>40
• Q2: Select solar radiation and total leaf area index values for all days and regions in the year 1978:
• SELECT solrad, tlai FROM Input A, Output B
WHERE A.region=B.region AND A.year=B.year AND A.day=B.day
Remove duplicates,format output
Apply tmax>40
Read INPUT table
Format output
Join (region, day, year)
Read INPUT table Read OUTPUT table
Q1 Q2
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Levels of Query Parallelism
• Inter-query• Execute multiple queries (Q1 and Q2) at the same time
• Inter-operator (intra-query)• Concurrently execute multiple operators in the query• Pipeline through the operators, e.g. read and join
Format output
Join (region, day, year)
Read INPUT table Read OUTPUT table
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Levels of Parallelism...
• Intra-operator• Data parallelism• Employ multiple processes for each operator
INPUT table OUTPUT table
Read table Read table
Format output
Join
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Parallel Architecture models and DBMS
• Shared-everything • memory, process space, disk subsystem are all
common
• Shared disk• Separate memory/process space• Disk subsystem/filesystem is common
• Shared nothing• Separate memory, disks, OS…• Only communication “bus” is shared
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Shared Everything
Disk
Processor
Memory
• SMP: Symmetric Multi-Processors• Provide well-balanced systems• Shared workload, resilient to “unexpected” workload• Dynamic allocation of processes to query operators (inter- as well
as intra-query)• Expensive and don’t scale to large configurations
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Shared Disk
Disk
Processor
Memory
• Some of the classic architectures map to this, VaxCluster, IBM mainframes (could make a comeback with SAN’s)
• Can share I/O workload, dynamic partitioning of data• Only need to scale I/O subsystem, and not memory
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Shared Nothing
Disk
Processor
Memory
• Highly scalable• Static partitioning of data• Cannot share workload• Cluster of SMP’s provides advantages of shared-
nothing and SMP’s
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
SN System
Combining Nodegroups and Tablespaces
Nodegroup
Tablespace2OUTPUT
Tablespace1INPUT
SN System
Nodegroup1 Nodegroup2
Tablespace2OUTPUT
Tablespace1INPUT
Format output
Join (region, day, year)
Read INPUT table Read OUTPUT table
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The DBMS/Application bottleneck
• Serial communication between DBMS and app.
Application
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The DBMS/Application bottleneck
App App App App
• Parallel communication between DBMS and app.
App
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
DBMS / DM software connection
Data Mining Platform
Database Platform
Extract data subsets
Generate results
Presentation (e.g. GIS, 3D)
Store session results
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Performance Tuning
• Sample set of Database Manager configuration parameters:
• CPU speed (millisec/instruction) (CPUSPEED) = 9.700848e-07• Comm. bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+00• Max number of existing agents (MAXAGENTS) = 400• Initial number of agents in pool (NUM_INITAGENTS) = 0• Max number of coord. Agents (MAX_COORDAGENTS)• Max no. of concurrent coord. agents (MAXCAGENTS) • Maximum query degree of parallelism (MAX_QUERYDEGREE) = ANY• Enable intra-partition parallelism (INTRA_PARALLEL) = NO
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Database Tuning
• Sample set of Database configuration parameters:
• Default query optimization class (DFT_QUERYOPT) = 9• Degree of parallelism (DFT_DEGREE) = 1
• Database heap (4KB) (DBHEAP) = 1200• Catalog cache size (4KB) (CATALOGCACHE_SZ) = 64• Log buffer size (4KB) (LOGBUFSZ) = 8• Utilities heap size (4KB) (UTIL_HEAP_SZ) = 5000• Buffer pool size (pages) (BUFFPAGE) = 128000• Max storage for lock list (4KB) (LOCKLIST) = 100
• Number of asynch page cleaners (NUM_IOCLEANERS) = 1• Number of I/O servers (NUM_IOSERVERS) = 3• Sequential detect flag (SEQDETECT) = YES• Default prefetch size (pages) (DFT_PREFETCH_SZ) = 32
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Examples of data exploration
• Testing temporal relationship (sensitivity analysis)• Can conditions from day N-1 be used to predict output of day N• How far back can we go?• Input table:
• Generate output:• Region, Year, Day, Inputi, Outputi, Output(i-1)
Region Year Day Inputs Outputs1 78 1 I1 O1
1 78 2 I2 O2
1 78 3 I3 O3
1 78 4 I4 O4
2 78 1 I1 O1
2 78 2 I2 O2
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
“Flattening” the table
SELECT A.region, A.year, A.day, A.solrad, A.tlai, B.day, B.tlai FROM msu.combined A, msu.combined B WHERE A.region=B.region AND A.year=B.year
AND A.day=B.day-1
• E.g. SQL query:
• Query Explain facility
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
“Flattening” the table
Access Table Name = MSU.COMBINED | #Columns = 5| Relation Scan| | Prefetch: Eligible| Insert Into Sorted Temp Table ID = t1| | #Columns = 4| | #Sort Key Columns = 1| | | Key 1: YEAR (Ascending)Access Temp Table ID = t1| Relation Scan| | Prefetch: EligibleMerge Join
Merge Join| Access Table Name = MSU.COMBINED| | #Columns = 4| | Relation Scan| | | Prefetch: Eligible| | Insert Into Sorted Temp Table ID = t2| | | #Columns = 4| | | #Sort Key Columns = 1| | | | Key 1: YEAR (Ascending)| Access Temp Table ID = t2| | Relation Scan| | | Prefetch: Eligible| Residual Predicate(s)| | #Predicates = 2Return Data to Application| #Columns = 7
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
“Flattening” the table, with indexing
Access Table Name = MSU.COMBINED | #Columns = 5| Relation Scan| | Prefetch: Eligible| Insert Into Sorted Temp Table ID = t1| | #Columns = 4| | #Sort Key Columns = 1| | | Key 1: REGION (Ascending)Access Temp Table ID = t1| Relation Scan| | Prefetch: EligibleNested Loop Join
Nested Loop Join| Access Table Name = MSU.COMBINED| | #Columns = 4| | Index Scan: Name = MSU.C_RYD | | | Index Columns:| | | | 1: REGION (Ascending)| | | | 2: YEAR (Ascending)| | | | 3: DAY (Ascending)| | | Data Prefetch: Eligible 157| | | Index Prefetch: Eligible 157| | Return Data to Application| | | #Columns = 7
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Declustering the table
• Partition the table by Region and/or Year• Linearly scalable join operation
• Testing spatial relationships/sensitivity• Compare region R with a specified neighborhood of R• Compare region R with other “similar” regions–spatial
clustering• Decluster table by year/day
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Built-in support for OLAP• Example table
• INPUT (Region, Week, Day_of_week, Solrad)• 2 regions, 1978, 250 days/year (500 rows)
• SQL support for rollup• SELECT region, week, day_of_week, sum(solrad)
FROM Input
GROUP BY ROLLUP (region, week, day_of_week)
ORDER BY region, week, day_of_week
• Output is summation of solrad by• (region, week, day_of_week), (region, week, –)• (region, –, –), (–, –, –)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
(Region, Week, Day_of_Week SUM(Solrad)
17003 1 1 1661.0
17003 1 2 2654.0
17003 1 3 2709.0
17003 1 4 2101.0
17003 1 5 1197.0
17003 1 6 1605.0
17003 1 7 1133.0
17003 1 - 13060.0
….
17003 36 1 6030.0
17003 36 2 6222.0
17003 36 3 6351.0
17003 36 4 6387.0
17003 36 5 6160.0
17003 36 - 31150.0
17003 - - 1206273.0
- - - 2398149.0
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The “cube” operator
• SQL query• SELECT region, week, day_of_week, sum(solrad)
FROM Input
GROUP BY CUBE (region, week, day_of_week)
ORDER BY region, week, day_of_week
• Output is summation of solrad by• (region, week, day_of_week), (region, week, –),
(region, –, –), (–, –, –)• (region, –, day_of_week)• (–, week, day_of_week)• (–, week, –)• (–, –, day_of_week)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Distributed data mining
• “Function shipping” vs. “data shipping”• Generalization of the “operator pushdown”
notion• “DataCutter” operations in SRB• Source/wrapper-side processing in MIX
• Need to understand which operations can be distributed and how
• Web-based infrastructure for OLAP and DM• XML for Analysis
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Application(SRB client)
MCATSRB ServersSRB Middleware
“Remote” operations in SRB
DataCutter, other “remote” operations
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Wrapper-side processing in MIX
DataSource XML Data
Source
DataSource
MIXmMIXmMediatorMediator
ApplicationApplication
WrapperWrapper
Wrapper
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
The role of XML
• Representing, exchanging metadata• image headers, instrumentation information,
descriptive metadata...
• Expressing service descriptions• Web-based services
• Exchanging data among services• “Raw” data: sequence information, GIS
information…• Results of analysis: rowsets, multidimensional
cubes,...
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Client Functionality
XML for analysis
UI
Client Functions
Discover, Execute Calls
SOAP
HTTP
XML for AnalysisProvider
Implementation
Discover, Execute Calls
- Server
SOAP
HTTP
Data
Client Web Service Provider Web Service
Discover, Execute
Data
Data Source
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Examples - Overview
• Intelligent Miner – Data Analysis and Mining• Interface, database connectivity, data creation• Statistical routines• Classification
• Decision Tree
• Neural Network
• Clustering
• Netica - Probabilistic Modeling and Decision Support• Belief networks, probabilistic queries • Statistical decision theory, decision models, influence diagrams
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Probabilistic Modeling andBayesian Belief Networks
Productivity
Precipitation Solar Radiation
Yield
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Statistical Decision Theory*
• Normative model of rational decision making• Decision: Irrevocable allocation of resources• Beliefs: Probability theory• Preferences: Utility theory• Expected Utility = Probability * Utility• Value of Information = EU (A | I) – EU(A)
• Principle of Rationality:
Maximize Expected Utility
• (*Rational agents are your friends.)
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
BK2 Induction Algorithm
• Data Mining via Probabilistic Model Induction• Discover Network Structure and Parameters• Greedy Algorithm – ML gradient search• Encode background Knowledge – Preferences
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Model Day 1
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Model Day 120
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Other Mining Applications
• Spatial Data Mining • Time Series • Sequence Mining• Text Data Mining• Multimedia Database Mining• Web Mining• Network Traffic Analysis
San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure
Acknowledgements• Students
• Peter Shin, Ankur Jain
• Science Collaborator• Stuart Gage, MSU – shared his data set and many insights about
the data
• SDSC• Mike Vildibill, Deputy Dir, providing hardware resources for SKIDL• Josh Polterock / Dave Archbell – help with software installation,
maintenance
• Funding support• NPACI ESS: support for Tony Fountain, Ankur Jain• NPACI DICE: support for Chaitan Baru• NSF REU: support for Peter Shin