www.ciat.cgiar.org Agricultura Eco-Eficiente para Reducir la Pobreza www.ciat.cgiar.org Agricultura Eco-Eficiente para Reducir la Pobreza
Interpreting yield variation in commercial production of crops
DAPA
(Decision and Policy Analysis Program)
Farmers’ production experiences/ commercial
production of crops
Principles of operational
research
Modern information technology
What we do
Environmental characterization of the production system
Analysis of the Observations to optimize the system
Kg/Arbol Temperatura Edad
Observations made by farmers according to their particular circumstances
Interpreting yield variation in commercial production of crops
Distribution of yield
The challenges ! Parametric, non-parametric?.... The reality!
Introduction
23
• Models rely on on assumptions of:
• Normality
• Homogeneity of Variance
• Independence
• Mostly based on linear relationships
• Models do not rely on assumptions
• Linear/ non-linear relationships
The challenges ! Parametric, non-parametric?... depends on distribution of residuals
Introduction
PARAMETRIC
NON- PARAMETRIC
As Sharon quoted: “La sabiduria del internet”: I have never come across a situation where a normal test is the right thing to do. When the sample size is small, even big departures from normality are not detected, and when your sample size is large, even the smallest deviation from normality will lead to a rejected null http://stackoverflow.com/questions/7781798/seeing-if-data-is-normally-
distributed-in-r :
The challenges ! Parametric, non-parametric?
Introduction
“La sabiduria de”: Nassim Nicholas Taleb a “superhero of the mind”
(The Black Swan, Fooled by Randommess, Antifragile) - Nassim Nicholas Taleb
The statistical regress argument
“We need the data to tells us what the probability distribution is, and a probability distribution to tell us how much data we need”
The challenges ! Parametric, non-parametric?
Introduction
The challenges ! Parametric, non-parametric?
Introduction
In terms of Big Data
• Approaching “N=All”
• The first is to collect and use a lot of data rather than settle for small amounts or samples, as researchers have done for well over a century
• We can learn from a large body of information things that we could not comprehend when we used only smaller amounts
• Sometimes to inform is better than explain – Looking for patterns
Doctors save lives in Canada by knowing that something is likely to occur, this can be far more important than understanding exactly why Big Data (Foreign Affairs magazine / McKinsey's High Tech)
What people think it is…
What it actually is… Was clear for Antoine de Saint-Exupéry (The little prince )
What people think it is…
What it actually is… Some of our findings !
The challenges ! Parametric, non-parametric? Not always normal distribution !
Introduction
Analytical approaches
V1 V2 V3 V4 V5 … V60 L 2 L 3 L 4 L 5 … Kg/plot
Obs 1 0.1 18 3 312 0.3 … 89 0 1 0 1 0 … 2.39
Obs 2 0.2 15 4 526 0.1 … 52 1 0 0 0 1 … 30.35
Obs 3 0.6 14 1 489 0.2 … 64 0 1 1 1 1 … 42.25
Obs 4 0.05 19 2 523 0.5 … 13 0 0 0 0 1 … 52.50
Obs 5 0.4 13 3 214 0.6 … 57 1 1 1 1 1 …
Obs 6 0.8 12 4 265 0.4 … 24 1 1 0 1 0 … 82.25
Obs 7 0.2 15 1 236 0.8 … 26 0 0 1 0 0 … 89.28
Obs 8 0.1 17 3 541 0.1 … 35 0 1 1 1 0 … 125.0
Obs9 0.6 16 2 845 0.3 … 51 0 0 1 1 0 … 142.8
Obs10 0.1 18 1 126 0.1 … 43 1 1 0 0 1 … 150.0
… … … … … … … … … … … … … … …
Obs3000 0.04 15 3 235 0.6 … 85 1 1 1 1 0 … 180
70.52
L 1
Supervised models – Parametric and non parametrics
Independent variables/ Inpust/predictors dependent /output/ response (known)
…
11
12
L 1
Unsupervised models
V1 V2 V3 V4 V5 … V60 L 2 L 3 L 4 L 5
Obs 1 0.1 18 3 312 0.3 … 89 0 1 0 1 0
Obs 2 0.2 15 4 526 0.1 … 52 1 0 0 0 1
Obs 3 0.6 14 1 489 0.2 … 64 0 1 1 1 1
Obs 4 0.05 19 2 523 0.5 … 13 0 0 0 0 1
Obs 5 0.4 13 3 214 0.6 … 57 1 1 1 1 1
Obs 6 0.8 12 4 265 0.4 … 24 1 1 0 1 0
Obs 7 0.2 15 1 236 0.8 … 26 0 0 1 0 0
Obs 8 0.1 17 3 541 0.1 … 35 0 1 1 1 0
Obs9 0.6 16 2 845 0.3 … 51 0 0 1 1 0
Obs10 0.1 18 1 126 0.1 … 43 1 1 0 0 1
… … … … … … … … … … … … …
Obs3000 0.04 15 3 235 0.6 … 85 1 1 1 1 0
L 1
…………
…………
…………
…………
…………
…………
…………
…………
…………
…………
…………
…………
Analytical approaches – Parametric and non parametrics
Self-organizing Maps (SOM)
Observations close to each other in the visualization space
-4 -2 0 2 4 6 8
-4
-2
0
2
4
Axis1
Axis
2
1st case study- Andean blackberry based on ANNs
Scatter plot displaying MLP predicted yield versus real Andean blackberry yield, using only the validation dataset 17 15
R² = 0.892
-0.2
0.3
0.8
1.3
1.8
-0.2 0.3 0.8 1.3 1.8
Pre
dic
ted
yie
ld (
kg/p
lan
t/w
eek)
Real yield (kg/plant/week)
Predicted
Supervised models - Non-linear regression Coefficient of determination= 0.89
Histogram displaying yield data distribution of Andean blackberry (Kg/plant/week)
Nu
mb
er o
f o
bse
rvat
ion
s
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Eff
Dep
th
Tem
pA
vg_
1
Na_
un
_ch
ical
Na_
un
_cu
sba
Tem
pA
vg_
0
Tem
pA
vg_
2
Tem
pA
vg_
3
ExtD
rain
Pre
cAcc
_1
Trm
m_
3
Nar
-Cal
Cal
_ri
osu
_zr
Srt
m
Slo
pe
Pre
cAcc
_0
Trm
m_
2
Na_
un
_cu
sal
Trm
m_
0
Pre
cAcc
_3
Tem
pR
ang_
0
Tem
pR
ang_
2
AB
_T
ho
rn_
N
Na_
un
_la
jac
Pre
cAcc
_2
Trm
m_
1
IntD
rain
Tem
pR
ang_
3
Tem
pR
ang_
1
12 20 3 5 17 23 26 11 22 16 2 7 8 9 19 15 4 13 28 18 24 1 6 25 14 10 27 21
% S
en
siti
vity
Sensitivity distribution of the model with respect to the inputs/predictors
Jiménez, D., Cock, J., Satizábal, F., Barreto, M., Pérez-Uribe, A., Jarvis, A. and Van Damme, P., 2009. Computers and Electronics in Agriculture. 69 (2): 198–208
Sensitivity Matrix
Results - Andean blackberry
16
Effective soil depth
Temperature averages
Geographic location
Results - Andean blackberry
(a) Kohonen map displaying the resultant 6 clusters and their labels according to yield values (b) Component plane of Andean blackberry yield, the scale bar (right) indicates the range value of productivity in kg/plant/week The upper side exhibits high values of yield, whereas the lower displays low values
Unsupervised model - Visualization – component planes - SOM
17
Andean blackberry yield Kohonen map – 6 clusters
(a) (b)
Results - Andean blackberry
Component plane of effective soil depth. The scale bar (right) indicates the range value in cm of soil depth: the upper side of the scale exhibits high values, whereas the lower displays low values
18
Effective soil depth
Unsupervised model - Visualization – component planes - SOM
Results - Andean blackberry
Components planes of the temperature averages. In all figures, the scale bar (right) indicates the range value in ◦C of temperature. The upper side exhibits high values, whereas the lower displays low values
19
Unsupervised model - Visualization – component planes - SOM
Results - Andean blackberry
Component planes of the specifics geographic areas Nariño–La Union–Chical alto (left) and Nariño–La union–Cusillo bajo (right). The highest values indicate presence and the lowest absence as they are categorical variables
Visualization – component planes - SOM
20
Nariño - La Union – Chical Alto Nariño - La Union – Cusillo bajo
Drawbacks
20
• Crop management factors not included (only variety)
• Only non-parametric approaches (Based on ANNs)
• Limited spatial variation (Two locations- two departaments)
Advantages
• Predictor-predictor and predictor- response dependencies through Kohonen’s Maps
• Combination of factors
• Non-linear approach
2nd case study- Lulo
Distribution of R2 obtained with each model
Regression R2
(mean) Confidence
interval (95%)
Robust (linear) 0.65 0.63 - 0.66
MLP (non-linear) 0.69 0.67 - 0.70
Both models explained more than 60% of variability in Lulo production
23 21
Histogram displaying yield data distribution of lulo (g/plant/week)
R2 provided by each approach
MLP
Robust regression
0.2877 0.3545 0.4214 0.4883 0.5552 0.6221 0.6889 0.7558 0.82270
2
4
6
8
10
12
14
16
18
20
22
24
26
Nu
mb
er
of
ob
serv
ati
on
sN
um
ber
of
ob
serv
atio
ns
Nu
mb
er o
f o
bse
rvat
ion
s
Supervised modelling
Results - Lulo The Sensitivity Matrix
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
% S
en
sitiv
ity
Jiménez, D., Cock, J., Jarvis, A., Garcia, J., Satizábal, H.F., Van Damme, Pérez-Uribe, A., and Barreto, M., 2010. Interpretation of Commercial Production Information: A case study of lulo, an under-researched Andean fruit. Agricultural Systems. 104 (3): 258-270
22
Sensitivity distribution of the model with respect to the inputs/predictors
Effective soil depth
Temperature averages
Slope
(a) U-matrix displaying the distance among prototypes. The scale bar (right) indicates the values of distance. The upper side exhibits high distances, whilst the lower displays low distances; (b) Kohonen map displaying the 3 clusters obtained after using the K-means algorithm and the Davies–Bouldin index
The three most relevant variables were used to train a Kohonen map and identify clusters of Homogeneous Environmental Conditions (HECs)
Results - Lulo Unsupervised model - Clustering – component planes - SOM
23
U-Matrix Kohonen map – 3 clusters
Results - Lulo Clustering – component planes - SOM
A mixed model with the categorical variables of three HECs, location and farmer explained more than 80% of variation in lulo yield
Parameters Estimate
(g/plant/week)
Standard
Error
%
of total variance
Model including categorical variables of 3 HECs, location and farm
HEC 1.85 2.01 61.2%
Location 0.07 0.20 2.5%
Site-Farm 0.57 0.21 19.0%
Error 0.52 0.04 17.3%
Total 100.0%
Variance components of the mixed model estimations
24
Variable ranges HEC
Slope (degrees) EffDepth (cm) TempAvg_0
(°C)
5-14 21-40 15 -16.5 1
8-15 32-69 15 -18.9 2
13-24 40-67 15.8 -19 3
HEC 3 yielded 41 g/plant/week more fruit than average
Results - Lulo
-30.00
-20.00
-10.00
0.00
10.00
20.00
30.00
40.00
50.00
1 2 3
Lu
lo y
ield
(g
/pla
nt/
we
ek
)
Effects of clusters of environmental conditions
25
Results - Lulo
Farm 7 and 9 in HEC 3. Farm 7 produced 68 g/plant/week less than average, whilst farm 9 produced 51 g/plant/week more than average
-80.00
-60.00
-40.00
-20.00
0.00
20.00
40.00
60.00
1 2 3 4 5 8 17 5 6 8 10 11 12 13 15 16 17 19 20 7 9 14 18 19 20 21
1 2 3
Lu
lo y
ield
(g
/pla
nt/
we
ek
)
Effects of farms across clusters of environmental conditions
1 2 3
26
Jiménez, D., Cock, J., Jarvis, A., Garcia, J., Satizábal, H.F., Van Damme, Pérez-Uribe, A., and Barreto, M., 2010. Interpretation of Commercial Production Information: A case study of lulo, an under-researched Andean fruit. Agricultural Systems. 104 (3): 258-270
Drawbacks
20
• Crop management factors not included (only variety)
• Compared with the Andean blackberry study, even more limited spatial
Variation (locations within one department)
Advantages
• Iterative procedure (combination of parametric & non parametric /linear & non-linear)
• Combination of factors
• The study is the first formal research study that evidences the yield gap
between farmers under similar climatic conditions in Colombia...provided the basis for the site-specific analytical approaches
• Successfully identified farms that have superior management practices for given environmental conditions
23
Facto Class (Clusters de Clima)
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (44.64%)
Dim
2 (
27.6
2%
) bio_1
bio_2
bio_3bio_4
bio_5
bio_6
bio_7
bio_8bio_9bio_10bio_11
bio_12bio_13
bio_14
bio_15
bio_16
bio_17bio_18
bio_19
-5 0 5 10
-4-2
02
46
Dim 1 (43.43%)
Dim
2 (
29
.83
%)
Cluster
1
2
3
4
5
6
7
8
3er Estudio de Caso- Plátano
23
PCA
CATPCA (Clusters de Suelo)
3er Estudio de Caso- Plátano
23
C4S5 Cluster de Clima 4
Cluster Suelo 5
3er Estudio de Caso- Plátano
C4S5
3er Estudio de Caso- Plátano
Modelo Linear Generalizado ( MLG)
Log(Yield) = (1.22) + densidad de siembra (0.0008) + E
El modelo - Dependencias entre predictores y la variable de respuesta
Nivel de significancia al 5%
Log (Y) = B0 + X (B1) + E
Log (Y) = B0 + X (B1) + X(B2) + E
C5S5
Log(Yield) = 0.80 + densidad de siembra (0.00101) + MezcVar (0.324154) + E
Modelo Linear Generalizado ( MLG)
3er Estudio de Caso- Plátano
Nivel de significancia al 5%
23
log(Yield) = β0+ β1𝑋1 + β2𝑋2 + … + ε
𝑒log (𝑌𝑖𝑒𝑙𝑑) = 𝑒β0+ β1𝑋1+ β2𝑋2+ … + ε (No linear)
𝑌𝑖𝑒𝑙𝑑 = 𝑒β0+ β1𝑋1+ β2𝑋2+ … + ε (regresando a unidad inicial Tons/ha)
𝑌𝑖𝑒𝑙𝑑 = 𝑒β0𝑒β1𝑋1𝑒β2𝑋2 … 𝑒ε (dependencias entre predictores y Tons/ha)
Con el modelo es posible calcular en cuantas veces se aumenta o disminuye el rendimiento, mediante el cambio de una práctica específica
• Interpretación de los parámetros
3er Estudio de Caso- Plátano
Modelo Linear Generalizado ( MLG)
23
Log(Yield) = (1.22) + densidad de siembra (0.0008) + E
Yield = 𝒆(1.22) 𝒆densidad de siembra (0.0008) 𝒆E
Densidad de siembra = 100 𝑒100 (0.008)
Con un nivel de confianza del 90%, se puede esperar que por cada 100 árboles/ha, el rendimiento anual en tons/ha aumente de un 3.2% a un 14.2%.
C4S5 (Densidad de siembra)
• Interpretación de los parámetros
Modelo Linear Generalizado ( MLG)
3er Estudio de Caso- Plátano
23
3rd case study- Plantain
Mezc Var = 𝟎. 𝟎𝟎𝟏𝟎 𝑒presencia (0.0010)
Con un nivel de confianza del 90% se puede esperar que sembrar variedades mezcladas pueda aumentar la producción en más de 10.46%.
Log(Yield) = 0.80 + densidad de siembra (0.00101) + Mezc Var (0.324154) + E
Yield = 𝒆(0.80) 𝒆 densidad de siembra (0.00101) 𝒆Mezc Var (0.00101) 𝒆E
C5S5 (Mezcla de Variedades)
• Interpretación de los parámetros
Modelo Linear Generalizado ( MLG)
23
C4S5 (densidad de siembra)
Yield = 𝒆(−2.078) 𝒆 densidad de siembra (0.0077) 𝒆dibujo de siembra(0.2079) 𝒆E
Con un nivel de confianza del 90%, se puede esperar que por cada 10 árboles/ha que se aumente en la densidad de siembra, el rendimiento anual en toneladas por hectárea puede aumentar de un 2.3% a un 13.2 %
Densidad de siembra = 10 𝑒10 (0.0077)
• Interpretación de los parámetros
Modelo Linear Generalizado ( MLG)
4to Estudio de Caso- Aguacate
23
C2S4 (Dibujo de siembra)
Yield = 𝒆(3.6) 𝒆 densidad de siembra (−0.006)𝒆variedad (0.434) 𝒆dibujo de siembra (0.7946) 𝒆E
Dibujo de siembra = 10 𝑒presencia (0.7946)
Con un nivel de confianza de 90%, se puede esperar que un productor de esta zona que siembre en tresbolillo en vez de cuadrado, puede aumentar su producción en más de 30.21%
4to Estudio de Caso- Aguacate
• Interpretación de los parámetros
Modelo Linear Generalizado ( MLG)
Drawbacks
20
• Not enough crop management factors to applied a hierarchical approach such as
mixed models
• Limited temporal variation
Advantages • Iterative procedure (combination of parametric and semi-parametric) • Crop management factors included (Farmer can control them)
• Predictors- response dependencies through GLM • Large spatial variation
• Soil information included • Linear & non-linear approach
Gracias !!!
-5 0 5
-4
-2
0
2
4
Factor 1: 3.8369 (48%)
Facto
r 2:
2.5
18 (
31.5
%)
1194
24752476247724782479
248424852486
24872488248924902491249224932494249724982499250025012502250325042505250625072508
251025112513251425152516251725182519
2524
25252526252725282529253025312532
25332534253525362537
253825392540301030113012301330143015301630173018301930203021302230233024
302530263027
302830293030303130323033303530363037303830393040304130423043
30443045304630473048304930503051305230533054305730583059
3060
306230633064
30653067
3360
73613201321132213231324132513261327132813291331133213331335
135513591360136113621363
1364136513661367136813691370138113821386
1390139113921393139413951399140014011402
1403140414051415141614171419142014211422
15501551
1594161116121616
1624
20642067
20692070207720782079208120842089209020932096209921002101210221042105
2106211021112112211321142115211621172118211921202121212221232124212521262127212821292130213121322133213421352136213721382139214021412142214321442145
21462147214821492150
2433
25462547254825492550255125522553255425552556255725582559256025612562256325642565256625672568256925702571257225732574257525762577257825792580
2728
577578579580581582583584585586587588589590592595596597605610613615619621 624643650
670671672673674675676679680682
687
690691
692
839840842844845
2706270727082709271127122713
2714
271527162717271827192720
2721272227262727272927302731
273627402741
2742
2743274427452748
2749275027512752275327542757
2791
31823198320032613262326332643265326632673268326932703271
32723273
99809981
99829983
9985
9986
998799889989
9990
9991
647
869870871872873874875876877878879880893894895896897898899900901904905906907908909910911912913
914915918919920923924925929938950951953954955956957958
964965
1983
1984198520122014
23862390
2465
248024812482
2483
249524962509
25122520252125222523
2822
282428252826282828292830
2836
284928502851285528562857285828592860286128632864
2865
286628672868287028722873287728782879288028812885
303430553056306130663068
310731313132
3324
9984bio_7bio_12bio_13
bio_4bio_6bio_15cons_mths
bio_14
cl1
cl2
cl3
Parametric methods
•Ordinary Least Squares regression (OLS)
•Principal component analysis (PCA)
•Robust linear regressions
•Mixed Models
•Best Linear Unbiased Prediction (BLUP)
•Facto Class (Factor analysis, Ward's method , K-means
•Categorical Principal Components Analysis (CATPCA)
Semi or non-parametric methods
• Generalized linear model (GLM)
• Self Organizing Maps (SOM)
• Multilayer perceptron (MLP)
• Fuzzy logic
Analytical approaches – Data-driven
We adapt a range of methodologies to the analysis of real data … rather than data to some methodologies.