Data Mining Techniques in Support of Science Data Stewardship

Data Mining Techniques in Support of Science Data

Stewardship

Eric A. Kihn, M. ZhizhinNOAA/NGDC RAS/CGDS

Presentation outline

I. Background for the talkII. What is science data stewardship?III. What is data mining?IV. Techniques for SDSIV. Conclusions

Motivation for this presentation?

• Present an innovative technology application to a new community

• Show different methods for accessing and characterizing data

• Present some interesting results in area of employing intelligent systems to support environmental data archives

What is being presented

• A set of tools and techniques developed or utilized at the National Geophysical Data Center

• A system meant to mimic the expertise of a subject matter expert (SME)

• Some key concepts such as fuzzy logic, data mining, knowledge tools.

Nature June 10, 1999• “It’s sink or swim as a tidal wave of data

approaches. … Are scientists ready for the flood?”• “Most researchers are accustomed to studying a

relatively small data set for a long time, using statistical models to tease out patterns. At some fundamental level that paradigm has broken down.”

• NASA’s EOS exceeds 1 Tb/Day• CERN exceeds 20 Tb/Day• The internet as a distributed data source provides

100’s of petabytes.

Ph.D’s and Networked Data

• The number of eyes looking at data remains constant.

• The amount of data tends to follow Moore’s law.

• In order to turn data into knowledge new techniques are required.

Nature June 10, 1999

NOAA NATIONAL DATA CENTERSNOAA NATIONAL DATA CENTERS

NGDC Holdings - % Mbytes by Data NGDC Holdings - % Mbytes by Data TypeType

DMSP SATELLITEDATA 97%

ALL OTHER 3%

Data archived as of September 2002Data archived as of September 2002

SIDE SCAN SONAR

16% GEOMAGNETISM 10%

MARINE TRACKLINE+ OTHER MARINE

10%

ECO SYSTEMS9%

BATHYMETRY,TOPOGRAPHY, & RELIEF

8%

SOLAR 7%

HAZARDS 2%MARINE GEOLOGY 1%

LAND GRAVITY < 1%

LAND GEOCHEMISTRY <1%COSMIC RAY < 1%

AURORA <1%

LAND GEOTHERMAL < 1%

SOLAR-TERRESTRIALPUBLICATIONS < 1%

IONOSPHERIC 12%

SATELLITE - GOES, NOAA TIROS 24%

What is science data stewardship?

Why the emphasis on data mining now?Answer: Layers of data archives

Data Collection

Data Archive

Data Warehouse

Data Collection

Data Archive

Data Warehouse

Data Mart• Standard Metadata• Access methods (i.e. XML)

• Enterprise organization of data

• Data quality control• Local holdings

Levels of Information Analysis

OLAP&

Reporting

AdvancedAnalysis

Bet

ter I

nfor

mat

ion

• Simulation/Optimization• Forecasting• Segmentation• Model Building• Hypothesis Testing• Statistics• Conditional Climatology• Visualization• Climatology• Percentages• Counts & Sums• Queries

Research Quality Data

Productization

Processing, Calibration

Collection and Storage

ScientistsScientistsQuality Control

Quality Control

TechniquesTechniques

Raw DataRaw Data

UsersUsers

Skilled Skilled

UsersUsers

Mission Mission

ScientistsScientists

UserUserRequirementsRequirements

KnowledgeSocietySocietyAnalysis

Analysis

What is data mining?

Definition of Data Mining

Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[1] It uses machine learning, statistical and visualization techniques to discovery and present knowledge in a form which is easily comprehensible to humans.

Application to Environmental Data

•Data quality control•Human linguistic translation•Event and trend detection•Data classification•Forecast•Deviation detection

Categories of Knowledge Tools

• Reporting and OLAP • Theory driven modeling:

– Correlations – t-tests – ANOVA – Linear Regression – Logistic Regression – Discriminant Analysis – Forecasting Methods

• Data driven modeling:– Cluster Analysis – Factor Analysis – Decision Trees – Fuzzy Classifier – Neural Networks – Association rules – Rule induction

2-D Fuzzy C-Means Clustering

Why Fuzzy Logic?

Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the concept of partial truth -- truth values between "completely true" and "completely false". It was introduced by Dr. Lotfi Zadeh of UC/Berkeley in the 1960's as a means to model the uncertainty of natural language.

Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. Fuzzy logic lets computers function closer to the way our brains work. We aggregate data and form a number of partial truths which we aggregate further into higher truths which in turn, when certain thresholds are exceeded, cause certain further results such as motor reaction.

Fuzzy-Logic

• Jim is 5’2” (157 cm) tall. Is Jim tall?– Boolean Logic - “NO” (0)– Fuzzy-Logic - “Jim is .082 tall” (.082)

• Major Advantages:– Allows more realistic (natural) definition of sets– More graceful handling of boundaries/intersections– Provides more human-like searching

• Fuzzy-Logic does NOT impact the data. It is simply a classification technique for selecting the most relevant data, given a set of complex conditions.

Definition of a fuzzy set

Fuzzy set A in X is asa set of ordered pairs

,, XxxxA A

10 xAdefined by membershipfunction

Classical set A in X isa set of ordered pairs

,, XxxIxA A

defined by indicatorfunction 1,0xI A

Fuzzy logic

First operand: fuzzy set A

Second operand: fuzzy set B

Fuzzy NOT

Fuzzy AND

Fuzzy OR

AA 1

BABA ,min

BABA ,max

January Wind Speed Record

0

5

10

15

20

1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97

Date

Win

d Sp

eed

(kts

)

January Temperature Record

05

1015202530

1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97

Date

Tem

pera

ture

(deg

C)

January Relative Humidity Record

0

20

40

60

80

100

1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97

Date

Rel

. Hum

idity

(%)

“High” Wind

“Average”Temperature

“About” 60%Humidity

List of Events

What is fuzzy clustering?

• In non-fuzzy or hard clustering, data is divided into crisp clusters, where each data point belongs to exactly one cluster.

• In fuzzy clustering, the data points can belong to more than one cluster, and associated with each of the points are membership grades which indicate the degree to which the data points belong to the different clusters.

Types of Fuzzy Cluster AlgorithmsClassical Fuzzy Algorithms (cummulus like clusters)

The fuzzy c-means algorithmThe Gustafson-Kessel algorithmThe Gath-Geva algorithmMountain and Subtractive

Linear and Ellipsodial (lines)

The fuzzy c-varieties algorithmThe adaptive clustering algorithm

Shell (circles,ellipses, parabolas)

Fuzzy c-shells algorithmFuzzy c-spherical algorithmAdaptive fuzzy c-shells algorithm

Mountain fuzzy clustering algorithm

• Form a grid on the data space; intersections are candidates for cluster centers

• Construct mountain function representing data density• Sequentially destruct the mountain function:

– Make dent where highest values are

(each data point contributes

to the height)• Subtracted amount inversely proportional to distance

between v and c1 and height m(c1)

N

i

xv i

evm1

2 2

2

)(

2

21

21)()()(

cv

new ecmvmvm

2D density mountains

Mountain function with b) σ=0.05 c) 0.1 d) 0.2

0 0.5 10

0.2

0.4

0.6

0.8

1(a)

0

0.5

1

0

0.5

1

10

20

30

(b)

0

0.5

1

0

0.5

1

20

40

60

(c)

0

0.5

1

0

0.5

1

20406080

(d)

2D mountain clustering

0

0.5

1

0

0.5

1

10

20

30

(a)

0

0.5

1

0

0.5

1

10

20

30

(b)

0

0.5

1

0

0.5

1

10

20

30

(c)

0

0.5

1

0

0.5

1

10

20

30

(d)

Mountain destruction with β=1 b) first cluster c) second d) third

Mountain fuzzy clustering

• No need to set number of clusters a priori• Simple, but computationally expensive• May be used to generate fuzzy rules relating the

variables (knowledge discovery)• May be generalized to subtractive clusteringYager, R. and D. Filev, "Generation of Fuzzy Rules

by Mountain Clustering," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, pp. 209-219, 1994.

Subtractive clusteringThe method assumes each data point is a

potential cluster center and calculates a measure of the likelihood that each data point would define the cluster center, based on the density of surrounding data points:

• Selects the data point with the highest potential to be the first cluster center

• Removes all data points in the vicinity of the first cluster center, in order to determine the next data cluster and its center location

• Iterates on this process until all of the data is within radii of a cluster center

Subtractive clustering advantages

• No grids in the parameter space: computationally efficient

• Fuzzy clusters centered at the observation points: real modes selection

• May be used to generate fuzzy rules relating the variables (knowledge discovery)

Chiu, S., "Fuzzy Model Identification Based on Cluster Estimation," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, Sept. 1994

Techniques for SDS

Data Quality Control

The Space Weather Reanalysis a long term re-analysis requires careful quality control of a huge volume of data. A single instance of bad data can have ripple effects throughout the entire model run. Working with satellite and station data in particular can be tricky, with spikes, baseline shifts, dropouts all prominent in the data stream. In a typical small scale study it would be possible for a researcher to hand screen the data, but here the volume requires the application of “intelligent” computer techniques, based on fuzzy-logic, neural computing and other mathematical functions.

Sample station data used in the SWR effort.

Some Preliminary Results

Scenario: Boulder for Mid October

Parameters Studied: Temperature (surface), Relative Humidity

Impacts: Scenario represents likely impacts on an IR sensor instrument.

Data Source: NCEP Reanalysis (20 years)

Technique: Subtractive Clustering

Visual Standard Output

Boulder 00 UT Rh vs. T

265 270 275 280 285 290 295 3000

10

20

30

40

50

60

70

80

90

100

Temp, K

Rel

Hum

id, %

00:00

Boulder Rh vs. T 20 Year Composite

250 260 270 280 290 3000

50

100

150

0

5

10

15

20

RelHumid, %

Temp, K

Day

tim

e, h

Boulder Derived Modes (subtractive clustering)

265270

275280

285290

0

20

40

60

80

100

0

5

10

15

20

RelHumid, %

Temp, K

Day

tim

e, h

Conclusions• Increasing data volumes demand new tools and

methods• Mathematical methods exist which, provide

analysis, classification and forecast methods for large data volumes

• Fuzzy based systems hold great promise as knowledge extraction tools.

Resources

BooksFuzzy Cluster Analysis, Hoppner et alNeuro-Fuzzy and Soft Computing, Jang et alFuzzy Logic , YenSystem Identification, Ljung

Web The Environmental Scenario Generator http://esg.ngdc.noaa.gov

Data Mining and Knowledge Discoveryhttp://www.digimine.com/usama/datamine/

An excellent introductory article is: Bezdek, James C, "Fuzzy Models --- What Are They, and Why?", IEEE Transactions on Fuzzy Systems, 1:1, pp. 1-6, 1993.

http://esg.ngdc.noaa.gov/

http://esg.ngdc.noaa.gov/

Date post:	19-Mar-2016
Category:	Documents
Upload:	arwen
View:	28 times
Download:	1 times

Data Mining Techniques in Support of Science Data Stewardship

Documents