Data Mining Techniques in Support of Science Data
Stewardship
Eric A. Kihn, M. ZhizhinNOAA/NGDC RAS/CGDS
Presentation outline
I. Background for the talkII. What is science data stewardship?III. What is data mining?IV. Techniques for SDSIV. Conclusions
Motivation for this presentation?
• Present an innovative technology application to a new community
• Show different methods for accessing and characterizing data
• Present some interesting results in area of employing intelligent systems to support environmental data archives
What is being presented
• A set of tools and techniques developed or utilized at the National Geophysical Data Center
• A system meant to mimic the expertise of a subject matter expert (SME)
• Some key concepts such as fuzzy logic, data mining, knowledge tools.
Nature June 10, 1999• “It’s sink or swim as a tidal wave of data
approaches. … Are scientists ready for the flood?”• “Most researchers are accustomed to studying a
relatively small data set for a long time, using statistical models to tease out patterns. At some fundamental level that paradigm has broken down.”
• NASA’s EOS exceeds 1 Tb/Day• CERN exceeds 20 Tb/Day• The internet as a distributed data source provides
100’s of petabytes.
Ph.D’s and Networked Data
• The number of eyes looking at data remains constant.
• The amount of data tends to follow Moore’s law.
• In order to turn data into knowledge new techniques are required.
Nature June 10, 1999
NOAA NATIONAL DATA CENTERSNOAA NATIONAL DATA CENTERS
NGDC Holdings - % Mbytes by Data NGDC Holdings - % Mbytes by Data TypeType
DMSP SATELLITEDATA 97%
ALL OTHER 3%
Data archived as of September 2002Data archived as of September 2002
SIDE SCAN SONAR
16% GEOMAGNETISM 10%
MARINE TRACKLINE+ OTHER MARINE
10%
ECO SYSTEMS9%
BATHYMETRY,TOPOGRAPHY, & RELIEF
8%
SOLAR 7%
HAZARDS 2%MARINE GEOLOGY 1%
LAND GRAVITY < 1%
LAND GEOCHEMISTRY <1%COSMIC RAY < 1%
AURORA <1%
LAND GEOTHERMAL < 1%
SOLAR-TERRESTRIALPUBLICATIONS < 1%
IONOSPHERIC 12%
SATELLITE - GOES, NOAA TIROS 24%
What is science data stewardship?
Why the emphasis on data mining now?Answer: Layers of data archives
Data Collection
Data Archive
Data Warehouse
Data Collection
Data Archive
Data Warehouse
Data Mart• Standard Metadata• Access methods (i.e. XML)
• Enterprise organization of data
• Data quality control• Local holdings
Levels of Information Analysis
OLAP&
Reporting
AdvancedAnalysis
Bet
ter I
nfor
mat
ion
• Simulation/Optimization• Forecasting• Segmentation• Model Building• Hypothesis Testing• Statistics• Conditional Climatology• Visualization• Climatology• Percentages• Counts & Sums• Queries
Research Quality Data
Productization
Processing, Calibration
Collection and Storage
ScientistsScientistsQuality Control
Quality Control
TechniquesTechniques
Raw DataRaw Data
UsersUsers
Skilled Skilled
UsersUsers
Mission Mission
ScientistsScientists
UserUserRequirementsRequirements
KnowledgeSocietySocietyAnalysis
Analysis
What is data mining?
Definition of Data Mining
Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[1] It uses machine learning, statistical and visualization techniques to discovery and present knowledge in a form which is easily comprehensible to humans.
Application to Environmental Data
•Data quality control•Human linguistic translation•Event and trend detection•Data classification•Forecast•Deviation detection
Categories of Knowledge Tools
• Reporting and OLAP • Theory driven modeling:
– Correlations – t-tests – ANOVA – Linear Regression – Logistic Regression – Discriminant Analysis – Forecasting Methods
• Data driven modeling:– Cluster Analysis – Factor Analysis – Decision Trees – Fuzzy Classifier – Neural Networks – Association rules – Rule induction
2-D Fuzzy C-Means Clustering
Why Fuzzy Logic?
Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the concept of partial truth -- truth values between "completely true" and "completely false". It was introduced by Dr. Lotfi Zadeh of UC/Berkeley in the 1960's as a means to model the uncertainty of natural language.
Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. Fuzzy logic lets computers function closer to the way our brains work. We aggregate data and form a number of partial truths which we aggregate further into higher truths which in turn, when certain thresholds are exceeded, cause certain further results such as motor reaction.
Fuzzy-Logic
• Jim is 5’2” (157 cm) tall. Is Jim tall?– Boolean Logic - “NO” (0)– Fuzzy-Logic - “Jim is .082 tall” (.082)
• Major Advantages:– Allows more realistic (natural) definition of sets– More graceful handling of boundaries/intersections– Provides more human-like searching
• Fuzzy-Logic does NOT impact the data. It is simply a classification technique for selecting the most relevant data, given a set of complex conditions.
Definition of a fuzzy set
Fuzzy set A in X is asa set of ordered pairs
,, XxxxA A
10 xAdefined by membershipfunction
Classical set A in X isa set of ordered pairs
,, XxxIxA A
defined by indicatorfunction 1,0xI A
Fuzzy logic
First operand: fuzzy set A
Second operand: fuzzy set B
Fuzzy NOT
Fuzzy AND
Fuzzy OR
AA 1
BABA ,min
BABA ,max
January Wind Speed Record
0
5
10
15
20
1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97
Date
Win
d Sp
eed
(kts
)
January Temperature Record
05
1015202530
1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97
Date
Tem
pera
ture
(deg
C)
January Relative Humidity Record
0
20
40
60
80
100
1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97
Date
Rel
. Hum
idity
(%)
“High” Wind
“Average”Temperature
“About” 60%Humidity
List of Events
What is fuzzy clustering?
• In non-fuzzy or hard clustering, data is divided into crisp clusters, where each data point belongs to exactly one cluster.
• In fuzzy clustering, the data points can belong to more than one cluster, and associated with each of the points are membership grades which indicate the degree to which the data points belong to the different clusters.
Types of Fuzzy Cluster AlgorithmsClassical Fuzzy Algorithms (cummulus like clusters)
The fuzzy c-means algorithmThe Gustafson-Kessel algorithmThe Gath-Geva algorithmMountain and Subtractive
Linear and Ellipsodial (lines)
The fuzzy c-varieties algorithmThe adaptive clustering algorithm
Shell (circles,ellipses, parabolas)
Fuzzy c-shells algorithmFuzzy c-spherical algorithmAdaptive fuzzy c-shells algorithm
Mountain fuzzy clustering algorithm
• Form a grid on the data space; intersections are candidates for cluster centers
• Construct mountain function representing data density• Sequentially destruct the mountain function:
– Make dent where highest values are
(each data point contributes
to the height)• Subtracted amount inversely proportional to distance
between v and c1 and height m(c1)
N
i
xv i
evm1
2 2
2
)(
2
21
21)()()(
cv
new ecmvmvm
2D density mountains
Mountain function with b) σ=0.05 c) 0.1 d) 0.2
0 0.5 10
0.2
0.4
0.6
0.8
1(a)
0
0.5
1
0
0.5
1
10
20
30
(b)
0
0.5
1
0
0.5
1
20
40
60
(c)
0
0.5
1
0
0.5
1
20406080
(d)
2D mountain clustering
0
0.5
1
0
0.5
1
10
20
30
(a)
0
0.5
1
0
0.5
1
10
20
30
(b)
0
0.5
1
0
0.5
1
10
20
30
(c)
0
0.5
1
0
0.5
1
10
20
30
(d)
Mountain destruction with β=1 b) first cluster c) second d) third
Mountain fuzzy clustering
• No need to set number of clusters a priori• Simple, but computationally expensive• May be used to generate fuzzy rules relating the
variables (knowledge discovery)• May be generalized to subtractive clusteringYager, R. and D. Filev, "Generation of Fuzzy Rules
by Mountain Clustering," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, pp. 209-219, 1994.
Subtractive clusteringThe method assumes each data point is a
potential cluster center and calculates a measure of the likelihood that each data point would define the cluster center, based on the density of surrounding data points:
• Selects the data point with the highest potential to be the first cluster center
• Removes all data points in the vicinity of the first cluster center, in order to determine the next data cluster and its center location
• Iterates on this process until all of the data is within radii of a cluster center
Subtractive clustering advantages
• No grids in the parameter space: computationally efficient
• Fuzzy clusters centered at the observation points: real modes selection
• May be used to generate fuzzy rules relating the variables (knowledge discovery)
Chiu, S., "Fuzzy Model Identification Based on Cluster Estimation," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, Sept. 1994
Techniques for SDS
Data Quality Control
The Space Weather Reanalysis a long term re-analysis requires careful quality control of a huge volume of data. A single instance of bad data can have ripple effects throughout the entire model run. Working with satellite and station data in particular can be tricky, with spikes, baseline shifts, dropouts all prominent in the data stream. In a typical small scale study it would be possible for a researcher to hand screen the data, but here the volume requires the application of “intelligent” computer techniques, based on fuzzy-logic, neural computing and other mathematical functions.
Sample station data used in the SWR effort.
Some Preliminary Results
Scenario: Boulder for Mid October
Parameters Studied: Temperature (surface), Relative Humidity
Impacts: Scenario represents likely impacts on an IR sensor instrument.
Data Source: NCEP Reanalysis (20 years)
Technique: Subtractive Clustering
Visual Standard Output
Boulder 00 UT Rh vs. T
265 270 275 280 285 290 295 3000
10
20
30
40
50
60
70
80
90
100
Temp, K
Rel
Hum
id, %
00:00
Boulder Rh vs. T 20 Year Composite
250 260 270 280 290 3000
50
100
150
0
5
10
15
20
RelHumid, %
Temp, K
Day
tim
e, h
Boulder Derived Modes (subtractive clustering)
265270
275280
285290
0
20
40
60
80
100
0
5
10
15
20
RelHumid, %
Temp, K
Day
tim
e, h
Conclusions• Increasing data volumes demand new tools and
methods• Mathematical methods exist which, provide
analysis, classification and forecast methods for large data volumes
• Fuzzy based systems hold great promise as knowledge extraction tools.
Resources
BooksFuzzy Cluster Analysis, Hoppner et alNeuro-Fuzzy and Soft Computing, Jang et alFuzzy Logic , YenSystem Identification, Ljung
Web The Environmental Scenario Generator http://esg.ngdc.noaa.gov
Data Mining and Knowledge Discoveryhttp://www.digimine.com/usama/datamine/
An excellent introductory article is: Bezdek, James C, "Fuzzy Models --- What Are They, and Why?", IEEE Transactions on Fuzzy Systems, 1:1, pp. 1-6, 1993.