Kamath 1 Scientific Data Mining Chandrika Kamath October 7, 2008 Lawrence Livermore National...

kamath 11

Scientific Data Mining

Chandrika Kamath

October 7, 2008

Lawrence Livermore National Laboratory

kamath 22

Goal: solving the problem of data overload

• Use scientific data mining techniques to analyze

data from various SciDAC applications

• Techniques borrowed from image and video

processing, machine learning, statistics, pattern

recognition, …

• Leveraging the Sapphire scientific data mining

software, with functions added as required

• Contributors to the SciDAC part: Erick Cantú-Paz,

Imola K. Fodor, Siddharth Manay, Nicole S. Love

kamath 33

Overview of Sapphire

kamath 44

Sapphire: scientific data mining(1998-2008)

• We analyze science data from experiments,

observations, and simulations: massive *and* complex

• Sapphire has a three-fold focus

• research in robust, accurate, scalable algorithms

• modular, extensible software

• analysis of data from practical problems

• Funded through DOE NNSA, LLNL LDRD, SDM SciDAC

Center, GSEP SciDAC project

https://computation.llnl.gov/casc/sapphire

kamath 55

Scientific data mining - from a Terabyte to a Megabyte

Raw Data

Target Data

Preprocessed Data

Transformed Data

Patterns Knowledge

Data Preprocessing Pattern Recognition

Data FusionSamplingMulti-resolution analysis

De-noisingObject - identificationFeature- extractionNormalization

Dimension- reduction

ClassificationClusteringRegression

Interpreting Results

VisualizationValidation

An iterative and interactive process

kamath 66

De-noise dataBackground- subtraction

Identify objectsExtract features

Sample dataFuse data

Multi-resolution-analysis

Dataitems

Features

RDB: Data Store

NormalizationDimension-reduction

Decision trees

Neural Networks

SVMs

k-nearestneighbors

Clustering

Evolutionary algorithms

Tracking

….

FITSBSQPNMView. . .

DisplayPatterns

User Input & feedback

SapphireSoftware

Public Domain Software

Sapphire & Domain Software

Components linked by Python

The Sapphire system architecture: flexible, portable, scalable

US Patents 6675164 (1/04), 6859804 (2/05), 6879729 (4/05), 6938049 (8/05), 7007035 (2/06), 7062504 (6/06)

kamath 77

The modular software is used to meet the needs of different applications

Sapphire Software

Rem

ote Sensing

Astronomy

Climate Sim

ulations

Video surveillance

Plasma Physics Frag

men

tatio

n of

mat

eria

ls

Sapphire librariesScientific data processing,

dimension reduction, pattern recognition

Drivers,support functions

Drivers,support functions

Graphical Interface

Command-lineInterface

…

…

Flu

id m

ix,

turb

ulen

ce

Sim/E

xpt c

ompariso

n

In this talk, I focus only on SciDAC applications

kamath 88

SciDAC achievements

kamath 99

Application 1: Separating signals in climate data

• We used independent component analysis to separate

El Niño and volcano signals in climate simulations

• Showed that the technique can be used to enable

better comparisons of simulations

Collaboration with Ben Santer (LLNL)

kamath 1010

Application 2: Identifying key features for EHOs in DIII-D

• We used dimension reduction techniques from

statistics and machine learning to identify key

features associated with edge harmonic

oscillations in the DIII-D tokamak

• H-mode is the preferred mode of operation, but

associated with ELMs – which can damage

components of the tokamak

• A quiescent H-mode has been observed; associated

with EHOs – need to understand EHOs better

• The key variables identified are being used to

understand the cause of EHOs; the software has

been licensed to GAT

Collaboration with Keith Burrell and Mike Walker (GAT)

kamath 1111

The data is from sensors in DIII-D

• 700 experiments, each lasting 6 seconds

• Each 50ms window of an experiment is assigned a low

or high EHO-ness label

• Each window is described by 37 sensor measurements

• Data cleanup

• discard windows with at least one missing sensor value

• use median value of variable in window

• discard windows with at least one variable in the top or bottom

percentile of its range

• resulted in 41818 instances

kamath 1212

Challenge: no preconceived notion of which sensor values are important• Data cleanup: prevents outliers from influencing results

• Use different feature selection methods to gain

confidence

• PCA filter – use magnitude of coefficients

• Distance filter – Kullback-Liebler distance between histograms

• Stump filter

• Chi-square filter

• Boosting approach

• Introduce a “noise” feature

kamath 1313

We evaluated the features using a naïve Bayes classifier

kamath 1414

We also considered the top ten features selected by the methods

PCA filter Distance Chi-

squareStump Boosting

F33 F27 F25 F25 F27

F32 F25 F27 F7 F25

F6 F9 F7 F27 F9

F23 F26 F29 F26 F14

F17 F30 F9 F30 F8

F12 F29 F26 F9 F26

F20 F7 F6 F19 F6

F21 F8 F8 F29 F22

F10 F14 F30 F31 F29

F11 F6 F19 F8 F23

kamath 1515

Several features are common across different methods

PCA filter Distance Chi-

squareStump Boosting

F33 F27 F25 F25 F27

F32 F25 F27 F7 F25

F6 F9 F7 F27 F9

F23 F26 F29 F26 F14

F17 F30 F9 F30 F8

F12 F29 F26 F9 F26

F20 F7 F6 F19 F6

F21 F8 F8 F29 F22

F10 F14 F30 F31 F29

F11 F6 F19 F8 F23Multiple methods provide confidence in results

kamath 1616

Application 3: Classifying and characterizing orbits in Poincaré plots

• I am using techniques from scientific data mining

to assign one of four labels to an orbit and

extract characteristics of separatrix and island

chain orbits.

Collaboration with J. Breslau, N. Pomphrey, D. Monticello(PPPL), S. Klasky(ORNL)

kamath 1717

There are four classes of orbits – based on the location of the initial point

Quasi-periodicIsland chain

SeparatrixStochastic

kamath 1818

Challenge: There is a large variation in the orbits of any one class

quasiperiodic orbits

kamath 1919

Variation in island-chain orbits

kamath 2020

Variation in separatrix orbits

5000

points

1000

points

kamath 2121

How do we extract representative features for an orbit? • Variation in the data makes it difficult to identify

good features and extract them in a robust way

• Issues with labels assigned to orbits

• Next steps: characterizing island chains and

separatrix orbits

Identifying missing orbits

kamath 2222

Application 4: Tracking blobs in fusion plasma

• We are using image and video processing techniques to

identify and track blobs in experimental data from NSTX

to validate and refine theories of edge turbulence

t t+1 t+2

Denoisedoriginal

After removal of background

Detection of blobs

Collaboration with S. Zweben, R. Maqueda, and D. Stotler (PPPL)

kamath 2323

Goal: understand the turbulence which causes leakage of the plasma

• Requirements for fusion – high temperature and

confined plasma

• Fine-scale turbulence at the edge causes leakage of

plasma from the center to the edge

• Loss of confinement

• Heat loss of plasma

• Erosion or vaporization of the containment wall

kamath 2424

The Gas-Puff Imaging diagnostic is used to view the coherent structures

• Turbulence in the form of density filaments highly

elongated in the direction of the magnetic field

• Inject a gas cloud in the torus, and capture the

intersection of the cloud with the filament using a

camera which views the filament along the magnetic

fieldGPIview

16x32 cm

kamath 2525

Data from GPI in NSTX

• PSI-5 camera capture GPI images

• 300 frame sequences taken at 250,000 frames/sec

• 16-bit images with 64x64 pixels

kamath 2626

Why is this difficult?

• coherent structures are poorly understood

empirically and not understood theoretically

• no known ground-truth

• noisy images

• variation within a sequence

kamath 2727

Example frames to segment (sequence 113734: frames 1-50)

kamath 2828

We are investigating several image segmentation methods

• Immersion-Based: basic immersion, constrained

watershed, watershed merging

• Region Growing: seeded region growing, seed

competition

• Model-Based: 2-D Gaussian fit

• Challenges: how do we select the parameters in

an algorithm, how do we handle the variability in

the data especially for longer sequences, how do

the choices of algorithms and parameters

influence the “science”, …

Ongoing work: see AHM 2007 slides

kamath 2929

Vision for the future

• Meeting algorithm requirements of current

applications

• Robust extraction of feature vectors (orbit characterization)

• Improved algorithms for image analysis (blob characterization)

• Uncertainty quantification (how much can we trust the result?)

• Meeting the science goals

• Classification and characterization of Poincaré plots

• Tracking the blobs in NSTX

• Extraction of coherent structures in fluid and particle data and

their non-linear interactions (GSEP)

• Addressing requests from new applications – SNS,

materials science, combustion, power grid, …

• Deploy as requested

Date post:	04-Jan-2016
Category:	Documents
Upload:	jonas-blair
View:	217 times
Download:	4 times

Kamath 1 Scientific Data Mining Chandrika Kamath October 7, 2008 Lawrence Livermore National...

Documents