+ All Categories
Home > Documents > IV. MODELS FROM DATA - IJSkt.ijs.si/markodebeljak/Lectures/ARHIV/Metode_EM/2011_12/Lectures/... ·...

IV. MODELS FROM DATA - IJSkt.ijs.si/markodebeljak/Lectures/ARHIV/Metode_EM/2011_12/Lectures/... ·...

Date post: 11-Apr-2019
Category:
Upload: buicong
View: 213 times
Download: 0 times
Share this document with a friend
34
1 1 IV. MODELS FROM DATA 1 Data mining 2 Data mining MODELLING METHODS- Data mining data 3 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: Equations Decision trees Rules
Transcript

1

1

IV. MODELS FROM DATA

1

Data mining

2

Data mining

MODELLING METHODS- Data mining

data

3

Outline

A) THEORETICAL BACKGROUND

1. Knowledge discovery in data bases (KDD)

2. Data mining

• Data

• Patterns

• Data mining algorithms

B) PRACTICAL IMPLEMENTATIONS

3. Applications:

• Equations

• Decision trees

• Rules

2

4

Knowledge discovery in data bases (KDD)

What is KDD?

Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”,

How to find patters in data?

Data mining (DM) – central step in the KDD process concerned with applying computational techniques to actually find patterns in the data (15-25% of the effort of the overall KDD process).

- step 1: preparing data for DM (data preprocessing)

- step 3: evaluating the discovered patterns (results of DM)

5

Knowledge discovery in data bases (KDD)

When the patterns can be treated as knowledge?

Frawley et al., (1991): “A pattern that is interesting (according to

a user- imposed interest measure) and certain enough (again

according to the user’s criteria) is called knowledge. “

Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user).

Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria).

6

Knowledge discovery in data bases (KDD)

What may KDD contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, …)?

ES deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions.

The amount of collected environmental data is increasing exponentially.

KDD was purposively designed to cope with such complex questions about complex systems like:

- understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, …)- predicting future values of system variables of interest (e.g., rate of out-crossing with GM plants at location x at time y, seedbank dynamics, …)

3

7

Data mining (DM)

What is data mining?Data Mining, is the process of automatically searching large

volumes of data for patterns using algorithms.

Data Mining – Machine learningData Mining is the application of Machine Learning techniques

to data analysis problems.

The most relevant notions of data mining:1. Data

2. Patterns

3. Data mining algorithms

8

Data mining (DM) - data

1. What is data?According to Fayyad et al. (1996):” Data is a set of facts, e.g., cases in a database.”

Data in DM is given in a single flat table:- rows: objects or records (examples in ML)- columns: properties of objects (attributes, features in ML)

which is then used as input to a data mining algorithm.

Properties of objects Distance (m) Wind direction (0) Wind speed (m/s) Out-crossing rate (%) 10 123 3 8 12 88 4 7 14 121 6 3 18 147 2 4 20 93 1 5 22 115 3 1

Ob

jec

ts

… … … ..

9

Data mining (DM) - pattern

2. What is a pattern?

A pattern is defined as: ”A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset” (Frawley et al. 1991, Fayyad et al. 1996).

Classes of patterns considered in DM (depend on the data mining task at hand):

1. equations,2. decision trees3. association, classification, and regression rules

4

10

Data mining (DM) - pattern

1.EquationsTo predict the value of a target (dependent) variable as a linear or non linear combination of the input (independent) variables.

�Linear equations involving:- two variables: straight lines in a two dimensional space - three variables: planes in a three-dimensional space - more variables: hyper-plains in multidimensional spaces

�Nonlinear equations involving:- two variables: curves in a two dimensional space - three variables: surfaces in a three-dimensional space - more variables: hyper-surfaces in multidimensional spaces

11

Data mining (DM) - pattern

2. Decision treesTo predict the value of one or several target dependentvariables (class) from the values of other independentvariables (attributes) by decision tree.

Decision tree is a hierarchical structure, where: - each internal node contains a test on an attribute, - each branch corresponds to an outcome of the test, - each leaf gives a prediction for the value of the class variable.

12

Data mining (DM) - pattern

Decision tree is called:

� A classification tree: class value in leaf is discrete (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, …)

� A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, …

� A model tree: leaf contains linear modelpredicting the class value (piece-wise linear function): out-crossing rate= 12.3 distance - 0.123 wind speed + 0.00123 wind direction

5

13

Data mining (DM) - pattern

3. RulesTo perform association analysis between attributes discovered by association rules.

The rule denotes patterns of the form:“IF Conjunction of conditions THEN Conclusion.”

� For classification rules, the conclusion assigns one of the possible discrete values to the class (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D)

� For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g.,120, 220, 312, …

14

Data mining (DM) - algorithm

3. What is data mining algorithm?Algorithm in general:- a procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined end-stat.

Data mining algorithm:- a computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data

15

Data mining (DM) - algorithm

What kind of possible algorithms do we use for discovering patterns?

It depends on the goals:

1. Equations = Linear and multiple regressions

2. Decision trees = Top/down induction of decision trees

3. Rules = Rule induction

6

16

Data mining (DM) - algorithm

1. Linear and multiple regression

• Bivariate linear regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of one attribute (A):

C = α+ β×A

• Multiple regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (Ai):

C = Σni =1 βi×Ai

17

Data mining (DM) - algorithm

2. Top/down induction of decision treesDecision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986)

Tree construction proceeds recursively starting with the entireset of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute.

18

Data mining (DM) - algorithm

3. Rule induction

A rule that correctly classifies some examples is constructed first.

The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain.

7

19

Data mining (DM) - Statistics

Data mining vs. statistics

Common to both approaches:

Reasoning FROM properties of a data sample TO properties of a population.

20

Data mining (DM) – Machine learning -Statistics

StatisticsHypothesis testing when certain theoretical expectations about the data distribution, independence, random sampling, sample size, etc. are satisfied.

Main approach: best fitting all the available data.

Data miningAutomated construction of understandable patterns, and structured models.

Main approach: structuring the data space, heuristic search for decision trees, rules, … covering (parts of) the data space.

2121

DATA MINING – CASE STUDIES

8

2222

Practical implementations

Each class of described patterns is illustrated with examples of applications:

1. Equations:• Algebraic equations • Differential equations

2. Decision trees:• Classification trees• Regression trees• Model trees

3. Predictive rules

2323

Applications – Difference equations

Algebraic equations: CIPER

2424

Applications – Algebraic equations

Dataset

Hydrological conditions(HMS Lendava; monthly data on

minimal, average and maximum values)

- Ledava River levels - groundwater levels

Meteorological conditions(monthly data, HMS Lendava):-Time of solar radiation (h), - precipitation (mm), - ET (mm)- Number of days with white frost - Number of days with snow- T: max, aver, min- Cumulative T>0ºC, >5ºC, and >10ºC- Number of days with:

- minT>0ºC- minT<-10ºC- minT<-4ºC- minT>25ºC- maxT>10ºC- maxT>25ºC

Measured radial increments:- 8 trees

- 69 years old

Management data(thinning; m3/y removed from the

stand; Forestry Unit Lendava)

•Monthly data + aggregated data (AMJ, MJJ, JJA, MJJA etc.)

•Σ: 333 attributes; 35 years

Materials and methods

9

2525

Applications – Algebraic equations

• 52 different combinations of attributes were tested.

Σ: 124 models

Experiment RRSE # eq. elements

jnj3_2m 0,7282 6jnj3_3s 0,7599 6jnj3_1s 0,7614 6jnj3_4m 0,76455 3jnj2_2 0,7685 5jly_4xl 0,7686 6

2626

Applications – Algebraic equations

Model jnj3_2m:

RadialGrowthIncrement =+ -0.0511025526922 minL8-10^1 + -0.0291795197998 maxL8-10^1 + -0.017479975134 t-sun4-7^1 + 0.0346935385853 t-sun8-10^1 + -1.950606536e-05 t-sun8-10^2 + -2.01014710248 d-wf-4-7^1 + 9.35586778387e-05 minL4-7^1 t-sun4-7^1 + -0.000179339939732 minL4-7^1 t-sun8-10^1 + 6.45688563611e-05 minL8-10^1 t-sun8-10^1 + 3.06551434164e-05 maxL8-10^1 t-sun4-7^1 + 0.00282485442386 t-sun4-7^1 d-wf-4-7^1 + -0.00141078675225 t-sun8-10^1 d-wf-4-7^1 + 7.91071710872

Relative Root Squared Error = 0.728229824611

Correlation between average measured

(r-aver8) and modeled increments: linear regression:

R2 = 0.8771

2727

Applications – Algebraic equations

Model jnj3_2m

10

2828

Applications – Algebraic equations

Algebraic equations: Lagramge

2929

Data source:

• Federal Biological Research Centre (BBA), Braunschweig, Germany

(2000, 2001)

• Slovenian Agricultural Institute (KIS), Slovenia (2006)

Plants involved:

• BBA: - transgenic maize (var. “Acrobat”, glufosinate tolerant line) – donor - non-transgenic maize field (var. “Anjou”) - receptor

• KIS: - yellowkernel variety of maize (hybrid Bc462, simulatinga transgenic maize variety) – donor

- white kernel variety of maize (variety Bc38W, simulating a non-GM variety) - receptor

Applications – Algebraic equations

3030

100 m

a

b

c

d

efghj

k

l

m

n o p q 1

23

4 5

6Field design 2000

transgenic field / donor

non-transgenic field / recipient

2 m

access paths

Direction of drilling

2 m3 m4.5 m7.5 m

13.5 m

25,5m

49,5 m

220 m

sampling point

a3 system of coordinates

for the sampling points

N

Experiment design: 96 points

60 cobs – 2500 kernels

% of outcrossing

Receptors -NT corns

Donors -GMO corns

Applications – Algebraic equations

11

3131

Selected attributes:

• % of outcrossing

• Cardinal direction of the sampling point from the center of donor field

• Visual angle between the sampling point and the donor field

• Distance between the point to the center of donor filed

• The shortest distance between the sampling point and the donor field

• % of appropriate wind direction (exposure time)

• Length of the wind ventilation route

• Wind velocity

Applications – Algebraic equations: outcrossing rate

3232

Applications – Algebraic equations

3333

Applications – Algebraic equations

12

3434

Applications – Algebraic equations

3535

Applications – Algebraic equations

3636

Applications – Algebraic equations

13

3737

Applications – Differential equations

Differential equations: Lagramge

3838

Applications – Differential equations

3939

Applications – Differential equations

14

4040

Applications – Differential equations

water in-flow out-flow

respiration

growth

Phosphorus

4141

Applications – Differential equations

growthrespiration

sedimentation

grazing

Phytoplankton

4242

Applications – Differential equations

respiration mortality

Feeds on phytoplankton

Zooplankton

15

4343

Applications – Differential equations

4444

Applications – Classification trees: habitat models

Classification trees: J48

4545

Applications – Classification trees: habitat models

Observed locations of BBs

### ### ###### ## # ##

#######

###### ## ###

### ########### # ############################## ##

###### ####### ### ####### # ######## #### ##

#

## #### # # ###### ###

# ########## #######

## ## #### ######## ##############################################

#################

#####

## ################## ######## ### ## ######

### ##########

############## ###

#### ##

### ############################## ## # ### #

### # ## ######### # # ############## # # #### ###

###### ####### ###### # ## ##

##### ####

#########################################

###### # # ##

### ### #########

# #

###

#

#

#

#

# ####

#

#

###

##

#

#

#

###

#

#

#

##

##

#

#

#

### #

#

##

#

#

###

#

#

#

#

#

#

#

#####

#

####

###

#

##

###

##

#

###

#

##

#

### # ### ##

#

##

#

#

#

#

##

#

#

##

#

#

##

#

###

#

####

#

##

##

##

#

###

#### ###

##

##

#

##

##

#

#

#

#

### #

## ##

###

## #

###

#

##

##

#

#

#

###

#

###

#

#

# #

#

#

#

#

# ### # #

#

### ##

#

##

#

##

#

##

###

###

##

#

##

#

# #

# # # # ##### ##

##### ####

##

####

#

#

#

#

##

#

##

##

## ##

###

##

###

##

##

###

#

#

#####

#

#

##

####

#

## #

###

##

#

#

#

#### ##

######

#####

#

##

##

#

##

#####

# #### #

#########

# ###### ### #

# ## ##

# ## ######## ##### # #### ####### ### ##

######### # ####

#

# ########### #

#########

#

#

#

##

#

# #

###

# #

#

#####

#####

# ## #

###

###

# ####

##

# ###

#

####

#

#####

#

##

###

#

# #

#

#

####

#

#

##

#

##

##

#

### ##

#

#

#

##

#

#

# ##

#

# #

#

##

##

# #

#

#

# #

#

#

##

## ##

#

# #

##

######

#

#

#

#

########

##

##

#

#

#

######

###

###

# ###

# ###

######

#

##

##

#

##

#####

###

#

# #

#

#

#

##

## ##

###

#

##

#

#

##

##

#

###

####

#

#

##

##

### ###

##

#

###

###

###

##

##

#

###

#

##

# ###

#

#

#

##

#

##

# # ###

###

#

#

###

#

#

#

##

##

##

#

#

###

##

#

#

##

#

#

###

#

#

#

#

##

#

#

##

# #

#

#

#

#

#

#

##

# #

#

#

####

#

#

#

#

##

##

#

###

##

#

#

#

#

##

#

##

#

##

##

# ##

#

#

##

#

###

#

# #

##

#

#

#

#

##

#

##

#

#

###

###

##

#

#

#

#

##

##

###

###

###

##

## ##

##

##

#

### #

##

#

# #

#

#

##

#

##

####

#

#

##

#

#

## #

#

##

#

#

#

#

#

#

#

##

#

##

##

##

##

#

#

##

#

#

##

#

##

#

#

#

#

#

#

#

#

#

# #

# ##

#

#

#

#

#

######

#

##

#

#

##

#

###

#####

#

## ####

##

##

#

#

#

#

#

#

# #

#

#

#

#

##

##

##

##

#

##

#

#

##

#

##

##

#

##

#

#

#

##

#

#

#

#

#

#

#

#

#

#

##

#

##

#

##

##

##

#

##

###

#

##

#

#

##

#

##

# #

##

#

## ###

# ##

#

#

##

###

#

#

#

#

##

#

#

##

##

#

##

##

## #

#

##

#

#

#

##

#

# ###

#

##

#

##

# #

####

#### #

###

#

##

#### #

#

##

#

#

# #

##

##

#

## #

####

#

##

####

#

##

#

##

##

#

#

##

#

#

#

#

##

###

##

###

# #

#

#

#

#

##

### #

###

#####

#

#

#

###

#

##

#

##

##

#####

#

##

#

###

#

##

#

##

#

##

##

####

#

##

#

#

###

#

##

##

##

##

##

##

###

###

#####

###

#

###

####

# ##

##

##

#

#

##

#

#

#

# ##

#

## #

###

#

#

###

###

###

#

#

##

###

#

#

# #

# #

### ##

#

##

#### #

##

#

#####

# ### ##

##

##

#

##

###

##

#

###

##

#

##

###

#

##

#

#

#

###

#

#

#

#

#

##

##

# #

#

##

#

##

##

#

#

#

#

##

#

##

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

###

#

#

#

#

###

#

##

#

##

##

##

##

#

#

#

#### #####

######### ### # #

####

##

#

#

#

###

##

##

#

#

##

#

#

##

# #

### #

###

#

#

#

##

#

#

##

#

###

##

###

#

#

#

##

#

# #

##

##

#

#

# #

#

#

#

#

#

#

##

#

##

#

#

##

##

#

#

##

#

#

#

### #

#

##

##

#

#

##

#

#

##

###

#

#

#

# #

##

## ####

#

#

##

##

#

##

##

##

#

#

#

#

#

###

##

###

#

#

#

##

####

##

## #####

#

#

#

#

#

#

##

# #

#

######

##

## #

#

#

#

###

##

#

###

###

##

#

##

# #####

##

###

# ##

######

###

#

##

#

##

#

#

#

##

#

##

####

##

#

#

#

#

#

##

##

#

#

#

#

#

#

###

#

#

#

##

#

#

# #####

#

#

#

#

# ###

#

##

#

#

#

##

#

# #

#

#

#

#

##

##

##

##

#

###

#

#

#

#

##

# #

## #

#

##

## #

#

# ### ## ##

######

#

##### #######

##

##

#

####

#############################

#####################

#### #####

#####

#

####

#

#

##

#

##

#

###

##

# ##

#

##

####

####

# ##

### ##

####

#

#

## #

#

##

##

#

#

##

#

##

#

#

###

#

##

##

#

#

## #

#####

##

##

##

###

#

#

## ##

#

#

##

#

#

#

#

##

#

#######

#

##

# ###

##

########

####

##

##

##

#

##

#

#

#

#

#

#

##

# #

###

#

#

#

#

#####

#

#

#

# ##

#

###

# ###

##

##

#

##

##

#

#

#

#

###

##

# #

##

#

#

#

#

#

##### #

#

#

#

# ##

#

#

##

######

###

###

#

#

######

##

#

#

#

#

###

#

#

#

#

##

#

#

# #

#

#

# #####

##

###

# #

#

# #

#

#

#

#

#

##

#

#

#

###

#

##

#

##

#

#

###

##

#

#

### #

#

##

##

#

####

###

###### ##########

#############

### ###########

##

##

###################

##########

#

##########

#

#

# ###

#

#

####

#

#

#

#

#

#

##

#

##

#

#

###

#

#

##

#

#

##

##

##

##

#

#

#

##

#

#

#

#

#

#

##

##

#

#

##

#

##

#

#

#

#

##

##

##

#

#

###

##

#

#

#

# ###

#

##

#####

##

#

#

#

#

##

#

##

#

#

#

#

#

#

#

#

#

#

##

#

##

#

##

#

###

#

##

# #

#

##

#

##

#

##

#

# ##

#

##

#

# #

###

##

##

#

#

#

#

###

##

#

##

###

#

#

#

#

# ##

##

#

#

#

#

#

#

#

#

#

# # ##

#

## ##

#

####

#

##

#

#

#

#

##

#####

##

##

#

#

#

##

##

## ####

###

#

#####

###

# #

#

#

#

#

##

# #####

##

###### #

###

#

### ###

##

# ####

##

##

#

#

###

#####

# #

#

##

#

#

#

#

##

#

## ####

#

###

#

#

####

# #

###

# ##

##########

#

#

# ###

#

#

####

#

#

#

#

#

#

##

#

##

#

#

#

#

###

#

###

####

#

##

# ######

#

#

##

#############

#

#######

###

#

########

####

# ###

#

#

##

#

#

##

#

#

##

#

#

# #

#

#

#

###

#

#

##

#

#

#

#

#

#

#

##

# #

#

## ###

##

##

#

#

#

#

## ##

#

#

#### ##

##

#####

#

#

#

## #####

#

#

##

#

#

##

#

## #

###

####

####

##

#

#

#

#

##

#

#

#

#

#

# ##

#

##

#

##

#

#

##

##

###

#

#

#

#

#

#

#

#

#

#

#

#

##

# ##

### ###

####

## ####

#

####

#

#

####

######

#####

# ##

###

#

#####

#

##

#

#

#

## #

###

# ########## #

##

#

#

#

#

##

# #

#

#

##

##

##

#

#

#

##

#

#

#

##

#

#

#

#

#

#

##

##

# ### ##

##

# ##

#

#

##

#

#

#

#

#

#

###

##

#

###

#

#

##

## ####

#

#

#

#

#

##

#

##

##

##

#

##

#

##

#

#### ##

##

#

###

##

## ###

##

#

#

#

# ##

#

#

# ##

#

#

#

#

#

#

##

#

#

#

# ####

#

# #

#

#

#

####

#

#

##

#

###

####

#

# ##

##

#

###

#

#

##

#

#

###

#

# #

# #

####

## #####

#

#

#

##

#

#

#

#

# ### ##

##

#

# ##

#

#

#

#

# ##

#

#

##

#

###

#

## ###

#

#

##

##

#

##

##

###

##

#######

###

#

#

###

################

# ##

###

#

#

#

#

#

#

#

#

###

#

##

##

##

####

##

##

#

# #

#

#

#

#

##

##

##

#####

#

#

#

##

##

######

#

#

#

#

#

#

#

#

#

###

###

#

#

#####

#

#

#

#

##

###

#

####

#

##

#

#

##

###

###

#

#

# #

#

#

#

# #

#

#

#

#

###

##

#

#

##

#

#

###

##

## ##

# #

## #

#

#

#

#

## ##

#

###

###

#

#

#####

#

##

##

# ##

#

####

#

##

##

###

#

#

#### #

#

#

#

#

#

#

##

###

###

#

#

#

###

#

#####

#

#

#

##

###

#

#

#

####

##

##

#

#

#

#

#

#

#

###

#

##

#####

#

#

#

#

#

##

#

#

##

#

#

#

##

#

#

##

#

#

## #

#

##

#

##

#

#

# #

#

#

#

#

#

#

##

##

#

#

#

## #

#

#

#

##

##

##

##

# #

#

#

#

#

##

#

#

#

#

#

#

##

#

#

#

## ##

#

#

# #

#

#

# #

#

#

### #

##

#

##

#

##

##

#

#

#

# #

#

#

##

###

##

#

#

#

#####

###

# #

####

###

###

#

#

#

##

##

#

#

##

##

##

#

#

#

##

##

##

###

## #

#

##

##

#

####

###

##

##

#

##

#

# ####

##

#

#

##

#

# #

## #

#

#

#

#

###

#

##

###

###

###

# #

##

##

#

##

##

##

#

#

######

#

####

#

#

###

#

####

#

###

##

#

## #

#

#

###

##

#

#

#

#

#

#

###

#

##

#

#

####

#

##

#

#

#

##

#

####

#

#

##

#

# ##

#

##

##

## #

####

## ##

####

#

####

#

#

#

#

#

#

###

#

#

##### #

#

# ##

#

#

##

###

###

##

###

#

## #

##

###

#

##

##

##

##

#

##

#

#

##

##

###

#

#

#

##

##

#

##

#

##

###

#

#

#

#

#

##

#

##

##

##

###

#

#

#

##

#

#

#

##

#

#

##

##

##

#

###

#

#

##

#

#

###

#

####

#

###

#

#

##

##

#

##

#

#

##

#

#

# #

##

#

##

##

#

# #

###

#

#

#

#

####

#

##

#

###

#

##

##

#

##

#

#

#

#

#

#

#####

#

#####

#

##

#####

### ## ##

## # ###

##

####

##

##

#

##

#

#

###

#

##

##

#

##

#

#

#

##

#

#

# #

#

# #

### ###

#

#

#

## ########

##

######

##

#

# #

#

#####

#

########

#

###

#

#

#

#

#

#

#####

#### ###

#

# ##

#

#

#

##

##

######

#

#

#

#

##

#

#

###

#

#

#

###

#

####

##

###

# ##

##

##

###

#

#

##

#

# #

##

###

##

## ##

# #

#

#

#

#

#

#

#

##

###

#

#

#

#

###

########### ##

##### #

#

#

#

#

##

###

## ##

###

##

#

#######

#

#

#

##

##

##

####

##

##

#

##

#

##

##############

##########

###

###

#

##

#

#

##

#

#

# #

## ## #

###

#

#

##

#

##

#

# ###

# ##

##

#

#

#

#

# ##

### #

#

###

###

##

## #

#

## ## ###

#

#

##

##

#

#

#

#

#

#

#

# ### #

###

# #

#

#

###

###

#

#

#

######

#

#

#

#

## #

#

#

###

#

#

#####

###

# ##

#

###

##

#

####

# #

#

#

#

###

#

# #

##

#

#

#

#

#

# #

#

#

##

##

##

###

#

###

#

##

# ####

###

#

#

#

#

#

#

# ###

##

##

#

####

#

####

###

#

#

#

#

###

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

###

#

###

#

###

##

##

#

#

##

## ##

# ##

#

##

###

#

#

#

###

#

##

#

# ##

##

#

#

#

###

#

## #

#

##

##

###

#

## #

#

#

#

#

###

## #

##

##

#

#

#

##

#

#

#

#

#

##

#

###

#

#

##

##

#

#

#

#

#

#

# #

#

##

#

#

#

#

##

##

##

# #

#

##

#

#

#

#

#

#

#

##

##

#

#

#

#

#

# #

#

##

##

#

# ##

##

#

##

#

#

##

##

##

#

#####

#

#

##

#

#

##

#

###

#

##########

##########

#####

#

##

##

######

###

#

##

#

#

#

#

###

###

#

##

#

###

#

#

####

#

#

#

#

#

#

#

#

#

##

##

# ##

#

#

#

#

#

##

###

#

# #

#

#####

###

#

#

#

## #

#

##

#

#

#

#

##

#

#

#

#

#

###

#

#

#

#

#

#

#

#

##

##

##

#

##

##

#

#

#

##

#

#

## ##

#

#

#

#

#

#

#

#

##

#

##

###

##

#

##

###

#

##

#

###

#

##

#

#

##

#

##

#

#

##

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

###

#

#

#

#

##

#

#

#

##

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

##

##

#

#

#

#

#

#

#

#

###

##

#

#

#

##

#

#

#

#

#

#

##

###

##

#

##

#

#

#

##

##

#

#

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

##

#

##

#

#

#

##

#

# #

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

##

#

##

#

#

#

###

#

#

#

##

#

####

###

##

##############

#######

###

##

#

####

##########

##

##

######

######

#######

#######

###

#

##########

#

#

###

#

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

#

#

##

#

##

#

#

#

##

##

#

#

#

# ##

##

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

##

#

#

#

#

#

#

#

##

#

##

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

##

#

##

#

#

#

#

#

#

#

#

#

##

#

#

#

## #

#

#

###

#

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

##

##

##

#

##

#

##

###

##

#

##

#

#

#

#

#

#

##

#

###

#

#

##

##

#

##

##

#

#

###

##

#

#

#

###

##

##

#

#

#

#

###

###

##

##

#

##

#

#

#

#

##

#

##

###

#

#

#

#

#

#

#

##

##

##

#

##

##

#

##########

#

#

###

#

#

#

##

##

#

#

#

#

#

#

#

#

#

##

#

#

#

#

##

#

#

##

#

##

##

#

##

######

#

#

#

# #

###

##

########

#

#

##

#

# ##

## #

#

#

###

###

#

#

#

#

#

##

##

#

#

##

#

16

4646

Applications – Classification trees: habitat models

The training dataset• Positive examples:

- Locations of bear sightings (Hunting association; telemetry)

- Females only- Using home-range (HR) areas instead of “raw” locations- Narrower HR for optimal habitat, wider for maximal

• Negative examples:- Sampled from the unsuitable part of the study area- Stratified random sampling- Different land cover types equally accounted for

4747

Applications – Classification trees: habitat models

1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155.1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858.2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862.2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088.1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199.4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013.1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897.1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732.…….…….…….

Dataset

Present: 1

Absent: 0

4848

Applications – Classification trees: habitat models

The model for optimal habitat

17

4949

Applications – Classification trees: habitat models

The model for maximal habitat

5050

Applications – Classification trees: habitat models

Map of optimal habitat(13% SLO territory)

5151

Applications – Classification trees: habitat models

Map of maximal habitat (39% SLO territory)

18

5252

Applications – Multi-target classification : outcrossing rate

Multi-target classification model (Clus):

Modelling pollen dispersal of genetically modified oilseed rape within the field

Marko Debeljak, Claire Lavigne, Sašo Džeroski,Damjan Demšar

In: 90th ESA Annual Meeting [jointly with the]IX International Congress of Ecology, August 7-12,2005, Montréal, Canada. Abstracts. [S.l.]: ESA, 2005, p.

152.

5353

Experiment design:

90m

90m

Donors: MF transgenic oilseed rape “B004.oxy”

(10×10m)

Filed for receptors (90×90m)

3×3m grid = 841 nodes

10 seeds of MS oilseed rape “FU58B004” planted per node

Field planted with MF oilseed rape ”B004”

% MS outcrossing

% MF outcrossing

Applications – Multi-target classification : outcrossing rate

5454

Selected attributes for modelling:

• Rate of outcrossing of MS and MF receptor plants [rate per 1000]

• Cardinal direction of the sampling point from the center of donor field [rad]

• Visual angle between the sampling point and the donor field [rad]

• Distance between the point to the center of donor filed [m]

• The shortest distance between the sampling point and the donor field [m]

Applications – Multi-target classification : outcrossing rate

19

5555

• Number of examples: 817

• Correlation coefficient: MF: 0.821MS: 0.846

MSMF

Applications – Multi-target classification : outcrossing rate

5656

Applications – Multi-target regression model: soil resilience

5757

The dataset: soil samples taken on 26 location throughout SCO

The dataset: The flat table of data:26 by 18 data entries

Applications – Multi-target regression trees: soil resilience

20

5858

The dataset:

� physical properties: soil texture: sand, silt, clay

� chemical properties: pH, C, N, SOM (soil organic matter)

� FAO soil classification: Order and Suborder

� physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: eg, recovery from overburden stress after two days cycles: eg2dc

� biological resilience: heat, copper

Applications – Multi-target regression trees: soil resilience

5959

Applications – Multi-target regression trees: soil resilience

Different scenarios and multi-target regression models have been constructed:

A model predicting the resistance and resilience of soils to copper perturbation.

6060

Applications – Multi-target regression trees: soil resilience

The increasing importance of mapping soil functions to advice on land use and environmental management - to make a map of soil resilience for Scotland.

The models = filters for existing GIS datasets about physical and chemical properties of Scottish soils.

21

6161

Applications – Multi-target regression trees: soil resilience

Macaulay Institute (Aberdeen): soils data – attributes and maps:

Approximately 13 000 soil profiles held in database

Descriptions of over 40 000 soil horizons

6262

Application

6363

Application

22

6464

Application

USING RELATIONAL DECISION TREES TO MODEL FLEXIBLE CO-EXISTENCE

MEASURES IN A MULTI-FIELD SETTING

Marko Debeljak1, Aneta Trajanov1, Daniela Stojanova1, Florence Leprince2 ,Sašo Džeroski1

1: Jožef Stefan Institute, Ljubljana, Slovenia

2:ARVALIS-Institut du végétal, Montardon, France

Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007

Introduction

Initial questions:

To what extent will GM maze grown on Geens genetically interfere with the maize on Yelows?

Will this interference be small enough to allow co-existence?

DKC6041 YGSemis du 26/04

N

100 m

1.4 haPr33A46

Semis du 21/04

8 ha

Pr34N44 YG

Semis du 20/04

24 rgsPr33A46

23

Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007

GIS

Relational data preprocessing

Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007

Relational data mining – results

Out-crossing rate: Threshold 0.9 %

24

Data

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

• 130 sites, monitoring every 7 to 14 days for 5 month (2665 samples: 1322 conventional, 1333,HT OSR observations)

• Each sample (observation) described with 65 attributes

• Original data collected by Centre for Ecology and Hydrology, Rothamsted Research and SCRI within Farm Scale Evaluation Program (2000, 2001, 2002)

Results scenario B: Multiple target regression tree

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

Target: Avg Crop Covers,Avg Weed Covers

Excluded attributes: /

Constraints: MinimalInstances = 64.0; MaxSize = 15

Predictive power: Corr.Coef.: 0.8513, 0.3746 RMSE: 16.504,12.6038 RRMSE: 0.5248,0.9301

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

syntactic constraint

Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)

25

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

Target: Avg Weed Covers (Time Series) Scenario 3.9

Constraints: Syntactic, MinInstances = 32

Predictive power: TSRMSExval: 4.98 TSRMSEtrain: 4.86 ICVtrain: 30.44

Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)

Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

26

7676

Applications – Rules

7777

Applications – Rules

7878

Applications – Rules

The simulations were run with the first GENESYS version (published 2001, evaluated 2005, studied in sensitivity analyses 2004, 2005)

Only one field plan was used:

- maximising pollen and seed dispersal

27

7979

Applications – Rules

Large-risk field pattern

8080

Applications – Rules

Variables describing simulations

- simulation number- genetic variables- for each field (1 to 35), the cultivation techniques of year -3, -2, -1, 0- for each field (1 to 35) the number of years since the last GM oilseed rape crop- the number of years since the last non-GM oilseed rape crop

- proportion of GM seeds in non-GM oilseed rape of field 14 at year 0

-TOTAL NUMBER OF VARIABLES: 1899

8181

Applications – Rules

Run of experiment

• simulation started with an empty seed bank

• lasted 25 years,

• but only the last 4 years were kept in the files for data mining

• TOTAL NUMBER of simulations on the field pattern without the borders: 100 000

28

8282

Applications – Rules

Non aggregated data: CUBIST

Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules

Rule 1: [29499 cases, mean 0.0160539, range 0 to 0.9883608, est err 0.0160594]if

SowingDateY0F14 > 258then

PropGMinField14 = 0.0056903

Rule 2: [12726 cases, mean 0.0297134, range 7.62473e-07 to 0.9883608, est err 0.0388636]

ifSowingDateY0F14 > 258SowingDateY0F14 <= 277

thenPropGMinField14 = 0.0451188 - 0.0024 YearsSinceLastGMcrop_F14

8383

Applications – Rules

Rule 4: [22830 cases, mean 0.0958018, range 0 to 0.9994726, est err 0.0884937]if

SowingDateY0F14 <= 258YearsSinceLastGMcrop_F14 > 2

thenPropGMinField14 = 0.3722531 - 0.0013 SowingDateY0F14 - 0.00024 SowingDensityY0F14

8484

Applications – Rules

Rule 10: [1911 cases, mean 0.5392408, est err 0.2179439]if

TillageSoilBedPrepY0F14 in {0, 2}SowingDateY0F14 <= 258SowingDensityY0F14 <= 55YearsSinceLastGMcrop_F14 <= 2

thenPropGMinField14 = 2.8659117 - 0.3607 YearsSinceLastGMcrop_F14 - 0.0087 SowingDateY0F14 - 0.0032 SowingDensityY0F14 + 0.00073 SowingDateY-1F14 + 0.21 HarvestLossY-1F14 + 0.13 HarvestLossY-2F14 + 0.00017 SowingDensityY-2F14-0.09 HarvestLossY-1F8 - 0.07 EfficHerb2Y-3F27 - 7e-05 2cuttingY-3F23 + 7e-05 2cuttingY-2F12- 0.00027 SowingDateY-1F35 + 0.08 HarvestLossY0F9 + 5e-05 1cuttingY-3F17 + 0.00012 SowingDensityY-2F18 - 6e-05 2cuttingY-2F24 - 6e-05 2cuttingY-2F15 + 6e-05 2cuttingY0F32 - 6e-05 2cuttingY-2F16 + 0.00022 SowingDateY0F11 - 4e-05 1cuttingY0F33 + 0.0001 SowingDensityY-1F7 - 0.05 EfficHerb2Y-3F16 + 0.06 HarvestLossY-1F22 - 5e-05 2cuttingY-2F25 + 0.04 EfficHerb2GMvolunY0F6 + 0.04 EfficHerb1GMvolunY-2F27 + 0.04 fficHerb2GMvolunY0F5

29

8585

Applications – Rules

Non aggregated data: CUBIST Options:Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules

Target attribute `PropGMinField14'

Evaluation on training data (60000 cases):Average |error| 0.0633466Relative |error| 0.47Correlation coefficient 0.77

Evaluation on test data (40000 cases):

Average |error| 0.0657057Relative |error| 0.49Correlation coefficient 0.75

8686

Conclusions

What can data mining do for you?

Knowledge discovered by analyzing data with DM techniques can help:

� Understand the domain studied

� Make predictions/classifications

� Support decision processes in environmental management

8787

Conclusions

What data mining cannot do for you?

� The law of information conservation (garbage-in-garbage-out)

� The knowledge we are seeking to discover has to come from the combination of data and background knowledge

� If we have very little data of very low quality and no background knowledge no form of data analysis will help

30

8888

Conclusions

Side-effects?

• Discovering problems with the data during analysis

– missing values

– erroneous values

– inappropriately measured variables

• Identifying new opportunities

– new problems to be addressed

– recommendations on what data to collect and how

89

1. Data preprocessing

DATA MINING – Hands-on exercises

90

DATA MINING – data preprocessing

DATA FORMAT• File extension .arff• This a plain text format, files should be edited by

editors such as Notepad, TextPad, WordPad (that do not add extra formatting information)

• File consists of – Title: @relation NameOfDataset– List of attributes: @attribute AttName AttType

• AttType can be ‘numeric’ or nominal list of categorical values, e.g., ‘{red, green, blue}’

– Data: @data (in a separate line), followed by the actual data in comma separated value (.csv) format

31

91

DATA MINING – data preprocessing

DATA FORMAT

@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yes…

92

DATA MINING – data preprocessing

• Excel• Attributes (variables) in columns and cases in

lines

• Use decimal POINT and not decimal COMMA for numbers

• Save excel sheet as CSV file

93

DATA MINING – data preprocessing

• TextPad, Notepad

• Open CSV file

• Delete “ “ on the beginning of lines and save (just save, don’t change the format)

• Change all ; to ,

• Numbers must have decimal dot (.) and not decimal comma (,)

• Save file as CSV file (don't change format)

32

94

2. Data minig

DATA MINING – Hands-on exercises

95

DATA MINING – data preprocessing

• WEKA

• Open CVS file in WEKA

• Select algorithm and attributes

• Perform data mining

http://www.cs.waikato.ac.nz/ml/weka/index.html

96

How to select the “best” classification tree?

Performance of the classification tree:

Confusion matrix is a matrix showing actual and predicted classifications

classification accuracy:

(correctly classified examples)/(all examples)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

False positive rate

Tru

e p

osi

tive

rat

e

33

97

How to select the “best” classification tree?

Classification trees: J48

the number of instances CORRECTLY classified into this leaf

the number of instances INCORRECTLY classified into this leaf

It could appear:(13) – no incorrectly classified instancesor(3.5/0.5) – due to missing values (?) where instances are fracturedor(0,0) – a split on a nominal attribute and one or more of the values do not occur in the subset of instances at the node in question

Interpretable size:

-pruned or unpruned

- minimal number of objects per leaf

98

How to select the “best” regression / model tree?

Performance of the regression / model tree:

99

The number of instances that REACH this leaf

Root of the mean squared error (RMSE) of the predictions from the leaf's linear model for the instances that reach the leaf, expressed as a percentage of the global standard deviation of the class attribute (i.e. the standard deviation of the class attribute computed from all the training data). Sum is not 100%.

The smaller this value, the better.

How to select the “best” regression / model tree?

The interpretable size:

-pruned or unpruned

- minimal number of objects pre leaf

34

100

Accuracy and error

Avoid overfitting the data by tree pruning.

Pruned trees are:- less accurate (percentage of correct classifications) on training data- more accurate when classifying unseen data

101

How to prune optimally?

Pre-pruning: stop growing the tree e.g., when data split not statistically significant or too few examples are in a split (minimum number of objects in leaf)

Post-pruning: grow full tree, then post-prune (confidence factor-classification trees)

102

Optimal accuracy

10-fold cross-validation is a standard classifier evaluation method used in machine learning:

- Break data into 10 sets of size n/10. - Train on 9 datasets and test on 1. - Repeat 10 times and take a mean accuracy.


Recommended