VISUALIZATION TECHNIQUES UTILIZING THE SENSITIVITY...

Post on 15-Jul-2020

7 views 0 download

transcript

VISUALIZATION VISUALIZATION

TECHNIQUES UTILIZING TECHNIQUES UTILIZING

THE SENSITIVITY THE SENSITIVITY

ANALYSIS OF MODELSANALYSIS OF MODELS

Ivo Kondapaneni, Pavel Kordík, Pavel SlavíkDepartment of Computer Science and Engineering, Facult y of Eletrical Engineering,

Czech Technical University in Prague, Czech Republic

Presenting author: Pavel Kordík (kordikp@fel.cvut.cz)

Ivo Kondapaneni, Pavel Kordík, Pavel SlavíkDepartment of Computer Science and Engineering, Facult y of Eletrical Engineering,

Czech Technical University in Prague, Czech Republic

Presenting author: Pavel Kordík (kordikp@fel.cvut.cz)

2

OverviewOverview

• Motivation• Data mining models• Visualization based on sensitivity

analysis• Regression problems• Classification problems• Definition of interesting plots• Genetic search for 2D and 3D plots

3

MotivationMotivation

• Data mining – extracting new, potentially useful information from data

• DM Models are automatically generated• Are models always credible?• Are models comprehensible?• How to extract information from

models?Visualization

4

Data mining modelsData mining models

• Often black-box models generated from data

• E.g. Neural networks

• What is inside? Data mining black box model

Input variables

Output variable (s)

5

Inductive modelInductive model

• Estimates output from inputs

• Generated automatically• Evolved by niching GA• Grows from minimal

form• Contains hybrid units• Several training

methods

• Ensemble of models

input variables

output variable

first layer

second layer

third layer

output layer

interlayer connection

3 inputsmax

4 inputs max

P C P G

P P C

L

P L C

input variables

output variable

first layer

second layer

third layer

output layer

interlayer connection

3 inputsmax

4 inputs max

P C P G

P P C

L

P L C

6

Example: Housing dataExample: Housing data

CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

MEDV

Per capita crime rate by town

Proportion of owner-occupied units built prior to 1940

Weighted distances to five Boston employment centres

Input variables

Output variable

Median value of owner-occupied homes in $1000's

7

Housing data Housing data –– recordsrecords

CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

MEDV

Input variables

Output variable

24 0.00632 18 2.31 53.8 6.575 65.2 4.09 1 296 15.3 396.9

4.98

21.6 0.02731 0 7.07 46.9 6.421 78.9 4.9671 2 242 17.8 396.9

9.14

… … …

8

Housing data Housing data –– inductive inductive

modelmodel

CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

MEDV

Input variables

Output variable

Niching genetic algorithmevolves units in first layer

sigmoid

MEDV=1/(1-exp(-5.724*CRIM+ 1.126))

sigmoid

MEDV=1/(1-exp(-5.861*AGE+ 2.111))

Error: 0.13 Error: 0.21

9

Housing data Housing data –– inductive inductive

modelmodel

CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

MEDV

Input variables

Output variable

Niching genetic algorithmevolves units in second layer

sigmoid sigmoid

Error: 0.13 Error: 0.21

sigmoid

Error: 0.26

linear

Error: 0.24

polynomial

MEDV=0.747*(1/(1-exp(-5.724*CRIM+ 1.126))) +0.582*(1/(1-exp(-5.861*AGE+ 2.111)))2+0.016

Error: 0.10

10

Housing data Housing data –– inductive inductive

modelmodel

CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

MEDV

Input variables

Output variable

sigmoid sigmoid sigmoid linear

polynomial

polynomial

linear

exponential

Error: 0.08

Constructed model has very low validation error!

11

Housing data Housing data –– inductive inductive

modelmodel

CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA

MEDV

Input variables

Output variable

S S S L

P

P L

E

Error: 0.08

MEDV=(exp((0.038* 3.451*(1/(1-exp(-5.724*CRIM+ 1.126)))*(1/(1-exp(2.413*DIS-2.581)))*(1/(1-exp(2.413*DIS-2.581)))+0.429*(1/(1-exp(-5.861*AGE+ 2.111)))+0.024*(1/(1-exp(2.413*DIS-2.581)))+0.036+ 0.038*0.350*(1/(1-exp(-3.613*RAD-0.088)))+ 0.999*( 0.747*(1/(1-exp(-5.724*CRIM+ 1.126)))+0.582*(1/(1-exp(-5.861*AGE+ 2.111)))*(1/(1-exp(-5.861*AGE+ 2.111)))+0.016)-0.046*(1/(1-exp(-5.724*CRIM+ 1.126)))-0.079+ 0.002*INDUS-0.001*LSTA+ 0.150)*0.860)*13.072)-14.874

Math equation is not comprehensible any more – we have to treat it as a black box model !

12

Visualization based on Visualization based on

sensitivity analysissensitivity analysis

x

ModGMDH min

=

yk

const.

ModGMDH

ym

constant

constant

constant

moving

moving

moving

yk1

x1

x1

x2

x3

x2

x3

x2 max

x1 x3 =

min

const.ym

x2 max

x3 =

max

GAME

GAME

13

Sensitivity analysis of Sensitivity analysis of

inductive model of MEDVinductive model of MEDV

House no. 189 House no. 164

What will happen with the value of house when criminality in the area decreases/increases?

Credible output?

14

Ensemble of inductive Ensemble of inductive

modelsmodels

x1

ModGMDH

x2

x3

y

x1

ModGMDH

x2

x3

k

x1

ModGMDH

x2

x3

yk-1

yk+1

•• Random initializationRandom initialization

•• Developing on the same Developing on the same

training settraining set

•• Training affect just well Training affect just well

defined areas of input spacedefined areas of input space

•• Each model Each model -- unique architecture, unique architecture,

similar complexitysimilar complexity

similar transfer functions similar transfer functions

•• Similar behavior for well defined areasSimilar behavior for well defined areas

•• Different behavior Different behavior –– underunder--defined areas defined areas

yk

yk-1

yk+1

i = x2min max

GAME

GAME

GAME

15

Credibility of models: Credibility of models:

Artificial data setArtificial data set

Credibility: the criterion is a dispersion of models` responses.

Advantages:

• No need of the training data set,

• Modeling method success considered,

• Inputs importance considered.

16

Example: Models of hot Example: Models of hot

water consumption water consumption

17

Cold water consumption, Cold water consumption,

increasing humidityincreasing humidity

18

Models on Housing dataModels on Housing data

• Single model • Ensemble of 10 models

Before After

19

Classification problemsClassification problems

• Data: Setoza class Virginica class Versicolor class

Petal lengthP

etal

wid

th

• Blue GAME model (Iris Setoza class)

output = 1decision boundary

output = 0

20

Credibility of classifiersCredibility of classifiers

*

*

** =

=*

* =Ir

is S

etoz

aIr

is V

irgin

ica

Iris

Ver

sico

lor

GAME model 1 GAME model 2 GAME model 3 GAME models (1*2*3)

21

Overlapping models X ensembleOverlapping models X ensemble

22

Random behavior filtered outRandom behavior filtered out

Before After

23

Problem Problem –– how to find how to find

information in ninformation in n--dim space?dim space?

• Multidimensional space of input variables

• What we are looking for?– Interesting relationship of IO variables– Regions of high sensitivity

– Credible models (compromise response)

• Can we automate the search?

24

When a plot is interesting When a plot is interesting

for us?for us?

xistartxisize

xi

25

Definition of interesting plotDefinition of interesting plot

• Minimal volume of the envelope p min

• Maximal sensitivity of the output to the change of xi input variable – ysize max

• Maximal size of the area – xisize max

26

MultiobjectiveMultiobjective optimizationoptimization

• Interestingness:

• Unknown variables:– x1,x2,..., xi-1,xi+1,…xn xistart, xisize

• We will use “Niching” genetic algorithm

Chromosome: x1 x2 ... xi-1 xi+1 … xn xistart xisize

27

NichingNiching GA on simple dataGA on simple data

GAME ensemble

Training dataChromosome:

xstart xsize

fitness = xsize * 1/p * ysize

fitness

xsize

xstart

Very simple problem

Search space is 2D,can be visualized

28

Niching GA locates also

local optima

• Three subpopulations (niches) of individuals survived

29

Automated retrieval of plots Automated retrieval of plots

showing interesting behaviorshowing interesting behavior

Genetic Algorithm

Genetic algorithm with special fitness function is used to adjust all other inputs (dimensions)

Best so far individual found(generation 0 – 17)

30

Housing data Housing data –– interesting interesting

plot retrievedplot retrieved

Before After

Low fitness High fitness

31

ConclusionConclusion

• Credible regression

• Credible classification

• Automated retrieval

32

Future work IFuture work I

• Automated knowledge extraction from data

FAKE INTERFACE

AUTOMATEDDATA MINING

INPUTDATA AUTOMATED

DATAPREPROCESSING

KNOWLEDGEEXTRACTION

andINFORMATION

VISUALIZATIONKNOWLEDGE

GAME ENGINE

33

Future work IIFuture work II

• FAKE GAME framework

FAKE INTERFACE

MO DELMO DEL

MO DEL MO DEL

MO DEL

MO DEL Math equations

Feature ranking

Interesting behaviour

Credibilityestimation

Classes boundaries,relationship of variables

DATAWAREHOUSING

DATAINTEGRATION

DATACLEANING

INPUTDATA

Classification, Prediction,Identification and Regression

DATACOLLECTION

PROBLEMIDENTIFICATION

DATAINSPECTION

AUTOMATEDDATA

PREPROCESSINGGAME ENGINE

34

Future work IIIFuture work III

• Just released as open source project– Automated data preprocessing– Automated model building, validation– Optimization methods– Visualization

see and join us:http://www.sourceforge.net/projects/fakegame