Download - Interpreting yield variation in commercial production of crops / Como interpretar la variación de la productividad a partir de información comercial de cultivos

www.ciat.cgiar.org Agricultura Eco-Eficiente para Reducir la Pobreza www.ciat.cgiar.org Agricultura Eco-Eficiente para Reducir la Pobreza

Interpreting yield variation in commercial production of crops

DAPA

(Decision and Policy Analysis Program)

Farmers’ production experiences/ commercial

production of crops

Principles of operational

research

Modern information technology

What we do

Environmental characterization of the production system

Analysis of the Observations to optimize the system

Kg/Arbol Temperatura Edad

Observations made by farmers according to their particular circumstances

Interpreting yield variation in commercial production of crops

Distribution of yield

The challenges ! Parametric, non-parametric?.... The reality!

Introduction

23

• Models rely on on assumptions of:

• Normality

• Homogeneity of Variance

• Independence

• Mostly based on linear relationships

• Models do not rely on assumptions

• Linear/ non-linear relationships

The challenges ! Parametric, non-parametric?... depends on distribution of residuals

Introduction

PARAMETRIC

NON- PARAMETRIC

As Sharon quoted: “La sabiduria del internet”: I have never come across a situation where a normal test is the right thing to do. When the sample size is small, even big departures from normality are not detected, and when your sample size is large, even the smallest deviation from normality will lead to a rejected null http://stackoverflow.com/questions/7781798/seeing-if-data-is-normally-

distributed-in-r :

The challenges ! Parametric, non-parametric?

Introduction

“La sabiduria de”: Nassim Nicholas Taleb a “superhero of the mind”

(The Black Swan, Fooled by Randommess, Antifragile) - Nassim Nicholas Taleb

The statistical regress argument

“We need the data to tells us what the probability distribution is, and a probability distribution to tell us how much data we need”


Introduction


Introduction

In terms of Big Data

• Approaching “N=All”

• The first is to collect and use a lot of data rather than settle for small amounts or samples, as researchers have done for well over a century

• We can learn from a large body of information things that we could not comprehend when we used only smaller amounts

• Sometimes to inform is better than explain – Looking for patterns

Doctors save lives in Canada by knowing that something is likely to occur, this can be far more important than understanding exactly why Big Data (Foreign Affairs magazine / McKinsey's High Tech)

What people think it is…

What it actually is… Was clear for Antoine de Saint-Exupéry (The little prince )

What people think it is…

What it actually is… Some of our findings !

The challenges ! Parametric, non-parametric? Not always normal distribution !

Introduction

Analytical approaches

V1 V2 V3 V4 V5 … V60 L 2 L 3 L 4 L 5 … Kg/plot

Obs 1 0.1 18 3 312 0.3 … 89 0 1 0 1 0 … 2.39

Obs 2 0.2 15 4 526 0.1 … 52 1 0 0 0 1 … 30.35

Obs 3 0.6 14 1 489 0.2 … 64 0 1 1 1 1 … 42.25

Obs 4 0.05 19 2 523 0.5 … 13 0 0 0 0 1 … 52.50

Obs 5 0.4 13 3 214 0.6 … 57 1 1 1 1 1 …

Obs 6 0.8 12 4 265 0.4 … 24 1 1 0 1 0 … 82.25

Obs 7 0.2 15 1 236 0.8 … 26 0 0 1 0 0 … 89.28

Obs 8 0.1 17 3 541 0.1 … 35 0 1 1 1 0 … 125.0

Obs9 0.6 16 2 845 0.3 … 51 0 0 1 1 0 … 142.8

Obs10 0.1 18 1 126 0.1 … 43 1 1 0 0 1 … 150.0

… … … … … … … … … … … … … … …

Obs3000 0.04 15 3 235 0.6 … 85 1 1 1 1 0 … 180

70.52

L 1

Supervised models – Parametric and non parametrics

Independent variables/ Inpust/predictors dependent /output/ response (known)

…

11

12

L 1

Unsupervised models

V1 V2 V3 V4 V5 … V60 L 2 L 3 L 4 L 5

Obs 1 0.1 18 3 312 0.3 … 89 0 1 0 1 0

Obs 2 0.2 15 4 526 0.1 … 52 1 0 0 0 1

Obs 3 0.6 14 1 489 0.2 … 64 0 1 1 1 1

Obs 4 0.05 19 2 523 0.5 … 13 0 0 0 0 1

Obs 5 0.4 13 3 214 0.6 … 57 1 1 1 1 1

Obs 6 0.8 12 4 265 0.4 … 24 1 1 0 1 0

Obs 7 0.2 15 1 236 0.8 … 26 0 0 1 0 0

Obs 8 0.1 17 3 541 0.1 … 35 0 1 1 1 0

Obs9 0.6 16 2 845 0.3 … 51 0 0 1 1 0

Obs10 0.1 18 1 126 0.1 … 43 1 1 0 0 1

… … … … … … … … … … … … …

Obs3000 0.04 15 3 235 0.6 … 85 1 1 1 1 0

L 1

…………

…………

…………

…………

…………

…………

…………

…………

…………

…………

…………

…………

Analytical approaches – Parametric and non parametrics

Self-organizing Maps (SOM)

Observations close to each other in the visualization space

-4 -2 0 2 4 6 8

-4

-2

0

2

4

Axis1

Axis

2

1st case study- Andean blackberry based on ANNs

Scatter plot displaying MLP predicted yield versus real Andean blackberry yield, using only the validation dataset 17 15

R² = 0.892

-0.2

0.3

0.8

1.3

1.8

-0.2 0.3 0.8 1.3 1.8

Pre

dic

ted

yie

ld (

kg/p

lan

t/w

eek)

Real yield (kg/plant/week)

Predicted

Supervised models - Non-linear regression Coefficient of determination= 0.89

Histogram displaying yield data distribution of Andean blackberry (Kg/plant/week)

Nu

mb

er o

f o

bse

rvat

ion

s

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Eff

Dep

th

Tem

pA

vg_

1

Na_

un

_ch

ical

Na_

un

_cu

sba

Tem

pA

vg_

0

Tem

pA

vg_

2

Tem

pA

vg_

3

ExtD

rain

Pre

cAcc

_1

Trm

m_

3

Nar

-Cal

Cal

_ri

osu

_zr

Srt

m

Slo

pe

Pre

cAcc

_0

Trm

m_

2

Na_

un

_cu

sal

Trm

m_

0

Pre

cAcc

_3

Tem

pR

ang_

0

Tem

pR

ang_

2

AB

_T

ho

rn_

N

Na_

un

_la

jac

Pre

cAcc

_2

Trm

m_

1

IntD

rain

Tem

pR

ang_

3

Tem

pR

ang_

1

12 20 3 5 17 23 26 11 22 16 2 7 8 9 19 15 4 13 28 18 24 1 6 25 14 10 27 21

% S

en

siti

vity

Sensitivity distribution of the model with respect to the inputs/predictors

Jiménez, D., Cock, J., Satizábal, F., Barreto, M., Pérez-Uribe, A., Jarvis, A. and Van Damme, P., 2009. Computers and Electronics in Agriculture. 69 (2): 198–208

Sensitivity Matrix

Results - Andean blackberry

16

Effective soil depth

Temperature averages

Geographic location


(a) Kohonen map displaying the resultant 6 clusters and their labels according to yield values (b) Component plane of Andean blackberry yield, the scale bar (right) indicates the range value of productivity in kg/plant/week The upper side exhibits high values of yield, whereas the lower displays low values

Unsupervised model - Visualization – component planes - SOM

17

Andean blackberry yield Kohonen map – 6 clusters

(a) (b)


Component plane of effective soil depth. The scale bar (right) indicates the range value in cm of soil depth: the upper side of the scale exhibits high values, whereas the lower displays low values

18




Components planes of the temperature averages. In all figures, the scale bar (right) indicates the range value in ◦C of temperature. The upper side exhibits high values, whereas the lower displays low values

19



Component planes of the specifics geographic areas Nariño–La Union–Chical alto (left) and Nariño–La union–Cusillo bajo (right). The highest values indicate presence and the lowest absence as they are categorical variables

Visualization – component planes - SOM

20

Nariño - La Union – Chical Alto Nariño - La Union – Cusillo bajo

Drawbacks

20

• Crop management factors not included (only variety)

• Only non-parametric approaches (Based on ANNs)

• Limited spatial variation (Two locations- two departaments)

Advantages

• Predictor-predictor and predictor- response dependencies through Kohonen’s Maps

• Combination of factors

• Non-linear approach

2nd case study- Lulo

Distribution of R2 obtained with each model

Regression R2

(mean) Confidence

interval (95%)

Robust (linear) 0.65 0.63 - 0.66

MLP (non-linear) 0.69 0.67 - 0.70

Both models explained more than 60% of variability in Lulo production

23 21

Histogram displaying yield data distribution of lulo (g/plant/week)

R2 provided by each approach

MLP

Robust regression

0.2877 0.3545 0.4214 0.4883 0.5552 0.6221 0.6889 0.7558 0.82270

2

4

6

8

10

12

14

16

18

20

22

24

26

Nu

mb

er

of

ob

serv

ati

on

sN

um

ber

of

ob

serv

atio

ns

Nu

mb

er o

f o

bse

rvat

ion

s

Supervised modelling

Results - Lulo The Sensitivity Matrix

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

% S

en

sitiv

ity

Jiménez, D., Cock, J., Jarvis, A., Garcia, J., Satizábal, H.F., Van Damme, Pérez-Uribe, A., and Barreto, M., 2010. Interpretation of Commercial Production Information: A case study of lulo, an under-researched Andean fruit. Agricultural Systems. 104 (3): 258-270

22

Sensitivity distribution of the model with respect to the inputs/predictors


Temperature averages

Slope

(a) U-matrix displaying the distance among prototypes. The scale bar (right) indicates the values of distance. The upper side exhibits high distances, whilst the lower displays low distances; (b) Kohonen map displaying the 3 clusters obtained after using the K-means algorithm and the Davies–Bouldin index

The three most relevant variables were used to train a Kohonen map and identify clusters of Homogeneous Environmental Conditions (HECs)

Results - Lulo Unsupervised model - Clustering – component planes - SOM

23

U-Matrix Kohonen map – 3 clusters

Results - Lulo Clustering – component planes - SOM

A mixed model with the categorical variables of three HECs, location and farmer explained more than 80% of variation in lulo yield

Parameters Estimate

(g/plant/week)

Standard

Error

%

of total variance

Model including categorical variables of 3 HECs, location and farm

HEC 1.85 2.01 61.2%

Location 0.07 0.20 2.5%

Site-Farm 0.57 0.21 19.0%

Error 0.52 0.04 17.3%

Total 100.0%

Variance components of the mixed model estimations

24

Variable ranges HEC

Slope (degrees) EffDepth (cm) TempAvg_0

(°C)

5-14 21-40 15 -16.5 1

8-15 32-69 15 -18.9 2

13-24 40-67 15.8 -19 3

HEC 3 yielded 41 g/plant/week more fruit than average

Results - Lulo

-30.00

-20.00

-10.00

0.00

10.00

20.00

30.00

40.00

50.00

1 2 3

Lu

lo y

ield

(g

/pla

nt/

we

ek

)

Effects of clusters of environmental conditions

25

Results - Lulo

Farm 7 and 9 in HEC 3. Farm 7 produced 68 g/plant/week less than average, whilst farm 9 produced 51 g/plant/week more than average

-80.00

-60.00

-40.00

-20.00

0.00

20.00

40.00

60.00

1 2 3 4 5 8 17 5 6 8 10 11 12 13 15 16 17 19 20 7 9 14 18 19 20 21

1 2 3

Lu

lo y

ield

(g

/pla

nt/

we

ek

)

Effects of farms across clusters of environmental conditions

1 2 3

26

Jiménez, D., Cock, J., Jarvis, A., Garcia, J., Satizábal, H.F., Van Damme, Pérez-Uribe, A., and Barreto, M., 2010. Interpretation of Commercial Production Information: A case study of lulo, an under-researched Andean fruit. Agricultural Systems. 104 (3): 258-270

Drawbacks

20

• Crop management factors not included (only variety)

• Compared with the Andean blackberry study, even more limited spatial

Variation (locations within one department)

Advantages

• Iterative procedure (combination of parametric & non parametric /linear & non-linear)

• Combination of factors

• The study is the first formal research study that evidences the yield gap

between farmers under similar climatic conditions in Colombia...provided the basis for the site-specific analytical approaches

• Successfully identified farms that have superior management practices for given environmental conditions

23

Facto Class (Clusters de Clima)

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (44.64%)

Dim

2 (

27.6

2%

) bio_1

bio_2

bio_3bio_4

bio_5

bio_6

bio_7

bio_8bio_9bio_10bio_11

bio_12bio_13

bio_14

bio_15

bio_16

bio_17bio_18

bio_19

-5 0 5 10

-4-2

02

46

Dim 1 (43.43%)

Dim

2 (

29

.83

%)

Cluster

1

2

3

4

5

6

7

8

3er Estudio de Caso- Plátano

23

PCA

CATPCA (Clusters de Suelo)


23

C4S5 Cluster de Clima 4

Cluster Suelo 5


C4S5


Modelo Linear Generalizado ( MLG)

Log(Yield) = (1.22) + densidad de siembra (0.0008) + E

El modelo - Dependencias entre predictores y la variable de respuesta

Nivel de significancia al 5%

Log (Y) = B0 + X (B1) + E

Log (Y) = B0 + X (B1) + X(B2) + E

C5S5

Log(Yield) = 0.80 + densidad de siembra (0.00101) + MezcVar (0.324154) + E



Nivel de significancia al 5%

23

log(Yield) = β0+ β1𝑋1 + β2𝑋2 + … + ε

𝑒log (𝑌𝑖𝑒𝑙𝑑) = 𝑒β0+ β1𝑋1+ β2𝑋2+ … + ε (No linear)

𝑌𝑖𝑒𝑙𝑑 = 𝑒β0+ β1𝑋1+ β2𝑋2+ … + ε (regresando a unidad inicial Tons/ha)

𝑌𝑖𝑒𝑙𝑑 = 𝑒β0𝑒β1𝑋1𝑒β2𝑋2 … 𝑒ε (dependencias entre predictores y Tons/ha)

Con el modelo es posible calcular en cuantas veces se aumenta o disminuye el rendimiento, mediante el cambio de una práctica específica

• Interpretación de los parámetros



23

Log(Yield) = (1.22) + densidad de siembra (0.0008) + E

Yield = 𝒆(1.22) 𝒆densidad de siembra (0.0008) 𝒆E

Densidad de siembra = 100 𝑒100 (0.008)

Con un nivel de confianza del 90%, se puede esperar que por cada 100 árboles/ha, el rendimiento anual en tons/ha aumente de un 3.2% a un 14.2%.

C4S5 (Densidad de siembra)




23

3rd case study- Plantain

Mezc Var = 𝟎. 𝟎𝟎𝟏𝟎 𝑒presencia (0.0010)

Con un nivel de confianza del 90% se puede esperar que sembrar variedades mezcladas pueda aumentar la producción en más de 10.46%.

Log(Yield) = 0.80 + densidad de siembra (0.00101) + Mezc Var (0.324154) + E

Yield = 𝒆(0.80) 𝒆 densidad de siembra (0.00101) 𝒆Mezc Var (0.00101) 𝒆E

C5S5 (Mezcla de Variedades)



23

C4S5 (densidad de siembra)

Yield = 𝒆(−2.078) 𝒆 densidad de siembra (0.0077) 𝒆dibujo de siembra(0.2079) 𝒆E

Con un nivel de confianza del 90%, se puede esperar que por cada 10 árboles/ha que se aumente en la densidad de siembra, el rendimiento anual en toneladas por hectárea puede aumentar de un 2.3% a un 13.2 %

Densidad de siembra = 10 𝑒10 (0.0077)



4to Estudio de Caso- Aguacate

23

C2S4 (Dibujo de siembra)

Yield = 𝒆(3.6) 𝒆 densidad de siembra (−0.006)𝒆variedad (0.434) 𝒆dibujo de siembra (0.7946) 𝒆E

Dibujo de siembra = 10 𝑒presencia (0.7946)

Con un nivel de confianza de 90%, se puede esperar que un productor de esta zona que siembre en tresbolillo en vez de cuadrado, puede aumentar su producción en más de 30.21%

4to Estudio de Caso- Aguacate



Drawbacks

20

• Not enough crop management factors to applied a hierarchical approach such as

mixed models

• Limited temporal variation

Advantages • Iterative procedure (combination of parametric and semi-parametric) • Crop management factors included (Farmer can control them)

• Predictors- response dependencies through GLM • Large spatial variation

• Soil information included • Linear & non-linear approach

Gracias !!!

-5 0 5

-4

-2

0

2

4

Factor 1: 3.8369 (48%)

Facto

r 2:

2.5

18 (

31.5

%)

1194

24752476247724782479

248424852486

24872488248924902491249224932494249724982499250025012502250325042505250625072508

251025112513251425152516251725182519

2524

25252526252725282529253025312532

25332534253525362537

253825392540301030113012301330143015301630173018301930203021302230233024

302530263027

302830293030303130323033303530363037303830393040304130423043

30443045304630473048304930503051305230533054305730583059

3060

306230633064

30653067

3360

73613201321132213231324132513261327132813291331133213331335

135513591360136113621363

1364136513661367136813691370138113821386

1390139113921393139413951399140014011402

1403140414051415141614171419142014211422

15501551

1594161116121616

1624

20642067

20692070207720782079208120842089209020932096209921002101210221042105

2106211021112112211321142115211621172118211921202121212221232124212521262127212821292130213121322133213421352136213721382139214021412142214321442145

21462147214821492150

2433

25462547254825492550255125522553255425552556255725582559256025612562256325642565256625672568256925702571257225732574257525762577257825792580

2728

577578579580581582583584585586587588589590592595596597605610613615619621 624643650

670671672673674675676679680682

687

690691

692

839840842844845

2706270727082709271127122713

2714

271527162717271827192720

2721272227262727272927302731

273627402741

2742

2743274427452748

2749275027512752275327542757

2791

31823198320032613262326332643265326632673268326932703271

32723273

99809981

99829983

9985

9986

998799889989

9990

9991

647

869870871872873874875876877878879880893894895896897898899900901904905906907908909910911912913

914915918919920923924925929938950951953954955956957958

964965

1983

1984198520122014

23862390

2465

248024812482

2483

249524962509

25122520252125222523

2822

282428252826282828292830

2836

284928502851285528562857285828592860286128632864

2865

286628672868287028722873287728782879288028812885

303430553056306130663068

310731313132

3324

9984bio_7bio_12bio_13

bio_4bio_6bio_15cons_mths

bio_14

cl1

cl2

cl3

Parametric methods

•Ordinary Least Squares regression (OLS)

•Principal component analysis (PCA)

•Robust linear regressions

•Mixed Models

•Best Linear Unbiased Prediction (BLUP)

•Facto Class (Factor analysis, Ward's method , K-means

•Categorical Principal Components Analysis (CATPCA)

Semi or non-parametric methods

• Generalized linear model (GLM)

• Self Organizing Maps (SOM)

• Multilayer perceptron (MLP)

• Fuzzy logic

Analytical approaches – Data-driven

We adapt a range of methodologies to the analysis of real data … rather than data to some methodologies.