VISUALIZATION VISUALIZATION
TECHNIQUES UTILIZING TECHNIQUES UTILIZING
THE SENSITIVITY THE SENSITIVITY
ANALYSIS OF MODELSANALYSIS OF MODELS
Ivo Kondapaneni, Pavel Kordík, Pavel SlavíkDepartment of Computer Science and Engineering, Facult y of Eletrical Engineering,
Czech Technical University in Prague, Czech Republic
Presenting author: Pavel Kordík ([email protected])
Ivo Kondapaneni, Pavel Kordík, Pavel SlavíkDepartment of Computer Science and Engineering, Facult y of Eletrical Engineering,
Czech Technical University in Prague, Czech Republic
Presenting author: Pavel Kordík ([email protected])
2
OverviewOverview
• Motivation• Data mining models• Visualization based on sensitivity
analysis• Regression problems• Classification problems• Definition of interesting plots• Genetic search for 2D and 3D plots
3
MotivationMotivation
• Data mining – extracting new, potentially useful information from data
• DM Models are automatically generated• Are models always credible?• Are models comprehensible?• How to extract information from
models?Visualization
4
Data mining modelsData mining models
• Often black-box models generated from data
• E.g. Neural networks
• What is inside? Data mining black box model
Input variables
Output variable (s)
5
Inductive modelInductive model
• Estimates output from inputs
• Generated automatically• Evolved by niching GA• Grows from minimal
form• Contains hybrid units• Several training
methods
• Ensemble of models
input variables
output variable
first layer
second layer
third layer
output layer
interlayer connection
3 inputsmax
4 inputs max
P C P G
P P C
L
P L C
input variables
output variable
first layer
second layer
third layer
output layer
interlayer connection
3 inputsmax
4 inputs max
P C P G
P P C
L
P L C
6
Example: Housing dataExample: Housing data
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
MEDV
Per capita crime rate by town
Proportion of owner-occupied units built prior to 1940
Weighted distances to five Boston employment centres
Input variables
Output variable
Median value of owner-occupied homes in $1000's
7
Housing data Housing data –– recordsrecords
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
MEDV
Input variables
Output variable
24 0.00632 18 2.31 53.8 6.575 65.2 4.09 1 296 15.3 396.9
4.98
21.6 0.02731 0 7.07 46.9 6.421 78.9 4.9671 2 242 17.8 396.9
9.14
… … …
8
Housing data Housing data –– inductive inductive
modelmodel
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
MEDV
Input variables
Output variable
Niching genetic algorithmevolves units in first layer
sigmoid
MEDV=1/(1-exp(-5.724*CRIM+ 1.126))
sigmoid
MEDV=1/(1-exp(-5.861*AGE+ 2.111))
Error: 0.13 Error: 0.21
9
Housing data Housing data –– inductive inductive
modelmodel
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
MEDV
Input variables
Output variable
Niching genetic algorithmevolves units in second layer
sigmoid sigmoid
Error: 0.13 Error: 0.21
sigmoid
Error: 0.26
linear
Error: 0.24
polynomial
MEDV=0.747*(1/(1-exp(-5.724*CRIM+ 1.126))) +0.582*(1/(1-exp(-5.861*AGE+ 2.111)))2+0.016
Error: 0.10
10
Housing data Housing data –– inductive inductive
modelmodel
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
MEDV
Input variables
Output variable
sigmoid sigmoid sigmoid linear
polynomial
polynomial
linear
exponential
Error: 0.08
Constructed model has very low validation error!
11
Housing data Housing data –– inductive inductive
modelmodel
CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA
MEDV
Input variables
Output variable
S S S L
P
P L
E
Error: 0.08
MEDV=(exp((0.038* 3.451*(1/(1-exp(-5.724*CRIM+ 1.126)))*(1/(1-exp(2.413*DIS-2.581)))*(1/(1-exp(2.413*DIS-2.581)))+0.429*(1/(1-exp(-5.861*AGE+ 2.111)))+0.024*(1/(1-exp(2.413*DIS-2.581)))+0.036+ 0.038*0.350*(1/(1-exp(-3.613*RAD-0.088)))+ 0.999*( 0.747*(1/(1-exp(-5.724*CRIM+ 1.126)))+0.582*(1/(1-exp(-5.861*AGE+ 2.111)))*(1/(1-exp(-5.861*AGE+ 2.111)))+0.016)-0.046*(1/(1-exp(-5.724*CRIM+ 1.126)))-0.079+ 0.002*INDUS-0.001*LSTA+ 0.150)*0.860)*13.072)-14.874
Math equation is not comprehensible any more – we have to treat it as a black box model !
12
Visualization based on Visualization based on
sensitivity analysissensitivity analysis
x
ModGMDH min
=
yk
const.
ModGMDH
ym
constant
constant
constant
moving
moving
moving
yk1
x1
x1
x2
x3
x2
x3
x2 max
x1 x3 =
min
const.ym
x2 max
x3 =
max
GAME
GAME
13
Sensitivity analysis of Sensitivity analysis of
inductive model of MEDVinductive model of MEDV
House no. 189 House no. 164
What will happen with the value of house when criminality in the area decreases/increases?
Credible output?
14
Ensemble of inductive Ensemble of inductive
modelsmodels
x1
ModGMDH
x2
x3
y
x1
ModGMDH
x2
x3
k
x1
ModGMDH
x2
x3
yk-1
yk+1
•• Random initializationRandom initialization
•• Developing on the same Developing on the same
training settraining set
•• Training affect just well Training affect just well
defined areas of input spacedefined areas of input space
•• Each model Each model -- unique architecture, unique architecture,
similar complexitysimilar complexity
similar transfer functions similar transfer functions
•• Similar behavior for well defined areasSimilar behavior for well defined areas
•• Different behavior Different behavior –– underunder--defined areas defined areas
yk
yk-1
yk+1
i = x2min max
GAME
GAME
GAME
15
Credibility of models: Credibility of models:
Artificial data setArtificial data set
Credibility: the criterion is a dispersion of models` responses.
Advantages:
• No need of the training data set,
• Modeling method success considered,
• Inputs importance considered.
16
Example: Models of hot Example: Models of hot
water consumption water consumption
17
Cold water consumption, Cold water consumption,
increasing humidityincreasing humidity
18
Models on Housing dataModels on Housing data
• Single model • Ensemble of 10 models
Before After
19
Classification problemsClassification problems
• Data: Setoza class Virginica class Versicolor class
Petal lengthP
etal
wid
th
• Blue GAME model (Iris Setoza class)
output = 1decision boundary
output = 0
20
Credibility of classifiersCredibility of classifiers
*
*
** =
=*
* =Ir
is S
etoz
aIr
is V
irgin
ica
Iris
Ver
sico
lor
GAME model 1 GAME model 2 GAME model 3 GAME models (1*2*3)
21
Overlapping models X ensembleOverlapping models X ensemble
22
Random behavior filtered outRandom behavior filtered out
Before After
23
Problem Problem –– how to find how to find
information in ninformation in n--dim space?dim space?
• Multidimensional space of input variables
• What we are looking for?– Interesting relationship of IO variables– Regions of high sensitivity
– Credible models (compromise response)
• Can we automate the search?
24
When a plot is interesting When a plot is interesting
for us?for us?
xistartxisize
xi
25
Definition of interesting plotDefinition of interesting plot
• Minimal volume of the envelope p min
• Maximal sensitivity of the output to the change of xi input variable – ysize max
• Maximal size of the area – xisize max
26
MultiobjectiveMultiobjective optimizationoptimization
• Interestingness:
• Unknown variables:– x1,x2,..., xi-1,xi+1,…xn xistart, xisize
• We will use “Niching” genetic algorithm
Chromosome: x1 x2 ... xi-1 xi+1 … xn xistart xisize
27
NichingNiching GA on simple dataGA on simple data
GAME ensemble
Training dataChromosome:
xstart xsize
fitness = xsize * 1/p * ysize
fitness
xsize
xstart
Very simple problem
Search space is 2D,can be visualized
28
Niching GA locates also
local optima
• Three subpopulations (niches) of individuals survived
29
Automated retrieval of plots Automated retrieval of plots
showing interesting behaviorshowing interesting behavior
Genetic Algorithm
Genetic algorithm with special fitness function is used to adjust all other inputs (dimensions)
Best so far individual found(generation 0 – 17)
30
Housing data Housing data –– interesting interesting
plot retrievedplot retrieved
Before After
Low fitness High fitness
31
ConclusionConclusion
• Credible regression
• Credible classification
• Automated retrieval
32
Future work IFuture work I
• Automated knowledge extraction from data
FAKE INTERFACE
AUTOMATEDDATA MINING
INPUTDATA AUTOMATED
DATAPREPROCESSING
KNOWLEDGEEXTRACTION
andINFORMATION
VISUALIZATIONKNOWLEDGE
GAME ENGINE
33
Future work IIFuture work II
• FAKE GAME framework
FAKE INTERFACE
MO DELMO DEL
MO DEL MO DEL
MO DEL
MO DEL Math equations
Feature ranking
Interesting behaviour
Credibilityestimation
Classes boundaries,relationship of variables
DATAWAREHOUSING
DATAINTEGRATION
DATACLEANING
INPUTDATA
Classification, Prediction,Identification and Regression
DATACOLLECTION
PROBLEMIDENTIFICATION
DATAINSPECTION
AUTOMATEDDATA
PREPROCESSINGGAME ENGINE
34
Future work IIIFuture work III
• Just released as open source project– Automated data preprocessing– Automated model building, validation– Optimization methods– Visualization
see and join us:http://www.sourceforge.net/projects/fakegame