+ All Categories
Home > Documents > NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Date post: 10-Jan-2016
Category:
Upload: dwain-allen
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
205
NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis
Transcript
Page 1: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

NUMERICAL ANALYSIS OF BIOLOGICAL AND

ENVIRONMENTAL DATA

Lecture 7.Direct Gradient

Analysis

Page 2: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Canonical correspondence analysis (CCA)

Introduction

Basic terms and ordination plots

Other topics in CCA

Robustness

Scaling and interpretation of CCA plots

Example

Redundancy analysis (RDA) (= constrained PCA)

Scaling and interpretation of RDA plots

Statistical testing of constrained ordination axes

Partial constrained ordinations

Partial ordinations

DIRECT GRADIENT ANALYSIS

Partitioning variance

Environmental (predictor) variables and their selection

Canonical correlation analysis

Distance-based redundancy analysis

Canonical analysis of principal co-ordinates

Principal response curves

Polynomial RDA and CCA

CCA/RDA as predictive tools

Non-linear canonical analysis of principal co-ordinates

Canonical Gaussian ordination

Constrained additive ordination

CANODRAW

Interpretation of ordination axes with external data

Canonical or constrained ordination techniques (= direct gradient analysis)

Page 3: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

BASIS OF CLASSICAL ORDINATION INTERPRETATION AND ENVIRONMENT

We tend to assume that biological assemblages are controlled by environment, so:

1. Two sites close to each other in an indirect ordination are assumed to have similar composition, and

2. if two sites have similar composition, they are assumed to have similar environment.

In addition:

3. Two sites far away from each other in ordination are assumed to have dissimilar composition, and thus

4. if two sites have different composition, they are assumed to have different environment.

J. Oksanen (2002)

Page 4: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Values of environmental variables and Ellenberg’s indicator values of species written alongside the ordered data table of the Dune Meadow Data, in which species and sites are arranged in order of their scores on the second DCA axis. A1: thickness of A1 horizon (cm), 9 meaning 9cm or more; moisture: moistness in five classes from 1 = dry to 5 = wet; use: 1 = hayfield, 2 = a mixture of pasture and hayfield, 3 = pasture; manure: amount applied in five classes from 0 = no manure to 5 = heavy use of manure. The meadows are classified by type of management: SF, standard farming; BF, biological farming; HF, hobby farming; NM, nature management; F, R, N refer to Ellenberg’s indicator values for moisture, acidity and nutrients, respectively .

Veg

eta

tion

al

data

En

vir

on

men

tal

data

DUNE-MEADOW DATA

Page 5: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

The amount of manure written on the DCA ordination. The trend in the amount across the diagram is shown by an arrow, obtained by a multiple regression of manure on the site scores of the DCA axes. Also shown are the mean scores for the four types of management, which indicate, for example, that the nature reserves (NM) tend to lie at the top of the diagram. Ez=b0 + b1x1 + b2x2

Angle ()with axis 1 = arctan(b2 / b1)

DCA axis 2

DCA axis 1

Page 6: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Site scores of the second DCA axis plotted against the amount of manure.

Page 7: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Correlation coefficients (100 r) of the environmental variables for the four first DCA axes for the Dune Meadow Data

Variable Axes      

1 2 3 4

1 A1 58 24 7 9

2 moisture 76 57 7 -7

3 use 35 -21 -3 -5

4 manure 6 -68 -7 -64

5 SF 22 -29 5 -60

6 BF -28 -24 39 22

7 HF -22 -26 -55 -14

8 NM 21 73 17 56

Eigenvalue 0.54 0.29 0.08 0.05

Page 8: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Multiple regression of the first CA axis on four environmental variables of the dune meadow data, which shows that moisture contributes significantly to the explanation of the first axis, whereas the other variables do not.

Term Parameter Estimate s.e. tconstant c0 –2.32 0.50 –4.62

A1 c1 0.14 0.08 1.71

moisture c2 0.38 0.09 4.08

use c3 0.31 0.22 1.37

manure c4 –0.00 0.12 –0.01

ANOVA tabled.f. s.s. m.s. F

Regression 4 17.0 4.25 10,6Residual 15 6.2 0.41Total 19 23.2 1.22 R2 = 0.75 R2

adj = 0.66

Ey1 = b0 + b1x1 + b2x2 + ...bnxn

CA axis 1 environmental variables

x = environmental variables

Page 9: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

TWO-STEP APPROACH OF INDIRECT GRADIENT ANALYSIS

Standard approach to about 1985: started by D.W. Goodall in 1954

Limitations: (1) environmental variables studied may turn out to be poorly related to the first few ordination axes.

(2) may only be related to 'residual' minor directions of variation in species data.

(3) remaining variation can be substantial, especially in large data sets with many zero values.

(4) a strong relation of the environmental variables with, say, axis 5 or 6 can easily be overlooked and unnoticed.

Limitations overcome by canonical or constrained ordination techniques = multivariate direct gradient analysis.

Page 10: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Ordination and regression in one technique – Cajo ter Braak 1986

Search for a weighted sum of environmental variables that fits the species best, i.e. that gives the maximum regression sum of squares Ordination diagram  1) patterns of variation in the species data  2) main relationships between species and each environmental variable

Redundancy analysis constrained or canonical PCA

Canonical correspondence analysis (CCA) constrained CA

(Detrended CCA) constrained DCA

Axes constrained to be linear combinations of environmental variables.

In effect PCA or CA with one extra step:

Do a multiple regression of site scores on the environmental variables and take as new site scores the fitted values of this regression.

Multivariate regression of Y on X.

CANONICAL ORDINATION TECHNIQUES

Page 11: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRIMARY DATA IN GRADIENT ANALYSIS

Indir

ect

G

A

Dir

ect

GA

Abundances or +/-

variables

Response variables

Values

Classes

Predictor or explanatory variables

Speci

es

Env. vars

PLUS

Page 12: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Artificial example of unimodal response curves of five species (A-E) with respect to standard-ised environmental variables showing different degrees of separation of the species curves

a: Moistureb: Linear combination of moisture and phosphate, chosen a prioric: Best linear combination of environmental variables, chosen by CCA. Sites are shown as dots, at y = 1 if Species D is present and at y = 0 if Species D is absent

moisture

linear combination of moisture and phosphate

CCA linear combination

Page 13: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Combinations of environmental variables 

e.g. 3 x moisture + 2 x phosphate 

e.g. all possible linear combinations 

 zj = environmental variable at site j

 c = weights

 xj = resulting ‘compound’ environmental variable

 CCA selects linear combination of environmental variables that maximises dispersion of species scores, i.e. chooses the best weights (ci) of the environmental variables.

..... jjjoj zczczccx 332211

Page 14: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Algorithms for (A) Correspondence Analysis, (B) Detrended Correspondence Analysis, and (C) Canonical Correspondence Analysis, diagrammed as flowcharts. LC scores are the linear combination site scores, and WA scores are the weighted averaging scores.

ALTERNATING REGRESSION ALGORITHMS

- CA - DCA - CCA

Page 15: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1) Start with arbitrary, but unequal, site scores xi.

 2) Calculate species scores by weighted averaging of site

scores. 

 3) Calculate new site scores by weighted averaging of species

scores. 

 

[So far, two-way weighted average algorithm of correspondence analysis].

 

n

iik

n

iiikk yxyu

11

/

m

kik

m

kkiki yuyx

11

/

CANONICAL CORRESPONDENCE ANALYSIS

Algorithm

REF

REF

REF

REF

Page 16: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

4) Obtain regression coefficients of site scores on the environmental variables by weighted multiple regression.

 

where b and x* are column vectorsZ is environmental data n x (q +1)R is n x n matrix with site totals in diagonal

 5) Calculate new site scores 

or 6) Centre and standardise site scores so that: 

and 7) Stop on convergence, i.e. when site scores are sufficiently close to

site scores of previous iteration. If not, go to 2.

xRZRZZb 111

bzbx 0

n

iii xy

1

0 11

2

n

iii xy

zbx

REF

REF

REF

REF

Page 17: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CANONICAL OR CONSTRAINED CORRESPONDENCE ANALYSIS (CCA)

Ordinary correspondence analysis gives:

1. Site scores which may be regarded as reflecting the underlying gradients.

2. Species scores which may be regarded as the location of species optima in the space spanned by site scores.

Canonical or constrained correspondence analysis gives in addition:

3. Environmental scores which define the gradient space.

These optimise the interpretability of the results.J. Oksanen (2002)

Page 18: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

BASIC TERMS

Eigenvalue = Maximised dispersion of species scores along axis. In CCA usually smaller than in CA. If not, constraints are not useful.

Canonical coefficients = ‘Best’ weights or parameters of final regression.

Multiple correlation of regression = Species–environment correlation. Correlation between site scores that are linear combinations of the environmental variables and site scores that are WA of species scores. Multiple correlation from the regression. Can be high even with poor models. Use with care!

Species scores = WA optima of site scores, approximations to Gaussian optima along individual environmental gradients.

Site scores = Linear combinations of environmental variables (‘fitted values’ of regression) (1).

Can also be calculated as weighted averages of species scores that are themselves WA of site scores (2).

(1) LC scores are predicted or fitted values of multiple regression with constraining predictor variables 'constraints'.

(2) WA scores are weighted averages of species scores.

Generally always use (1) unless all predictor variables are 1/0 variables.

Page 19: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

SUMMARY OF DUNE MEADOW DATA

Dune Meadow Data. Unordered table that contains 20 relevées (columns) and 30 species (rows). The right-hand column gives the abbreviation of the species names listed in the left-hand column; these abbreviations will be used throughout the book in other tables and figures. The species scores are according to the scale of van der Maarel (1979b).

Page 20: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1 2.8 1 SF 2 42 3.5 1 BF 2 23 4.3 2 SF 2 44 4.2 2 SF 2 45 6.3 1 HF 1 26 4.3 1 HF 2 27 2.8 1 HF 3 38 4.5 5 HF 3 39 3.7 4 HF 1 110 3.3 2 BF 1 111 3.5 1 BF 3 112 5.8 4 SF 22*13 6.0 5 SF 2 314 9.3 5 NM 3 015 11.5 5 NM 2 016 5.7 5 SF 3 317 4.0 2 NM 1 018 4.6* 1 NM 1 019 3.7 5 NM 1 020 3.5 5 NM 1 0

Sample number

A1 horizon

Moisture class

Management type

UseManure

class

Environmental data of 20 relevées

from the dune meadows

Use categories: 1 = hay 2 = intermediate 3 = grazing

 * = mean value of variable

Page 21: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

DCA ordination diagram of the Dune Meadow Data

DCAaxis 2

DCAaxis 1

DCA

Page 22: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Correlations of environmental variables with DCA axes 1 and 2

Axis One1 = 0.54

Axis Two2 = 0.29

DCA

Page 23: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA of the Dune Meadow Data. a: Ordination diagram with environmental variables represented by arrows. the c scale applies to environmental variables, the u scale to species and sites. the types of management are also shown by closed squares at the centroids of the meadows of the corresponding types of management.

11 22 R axis 1 R axis 2R axis 1 R axis 2

DCADCA 0.54 0.54 0.40 0.40 0.87 0.87 0.83 0.83CCACCA 0.46 0.46 0.29 0.29 0.96 0.96 0.89 0.89

CCA

Page 24: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Canonical correspondence analysis: canonical coefficients (100 x c) and intra-set correlations (100 x r) of environmental variables with the first two axes of CCA for the Dune Meadow Data. The environmental variables were standardised first to make the canonical coefficients of different environmental variables comparable. The class SF of the nominal variable 'type of management' was used as a reference class in the analysis.

A1 9 -37 57 -17Moisture 71 -29 93 -14Use 25 5 21 -41Manure -7 -27 -30 -79SF - - 16 -70BF -9 16 -37 15HF 18 19 -36 -12NM 20 92 56 76

Variable Coefficients Correlations

Axis 1 Axis 2 Axis 1 Axis 2

CANONICAL CORRESPONDENCE ANALYSIS

Page 25: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA of the Dune Meadow Data. a: Ordination diagram with environmental variables

represented by arrows. the c scale applies to environmental variables,

the u scale to species and sites. the types of management are also

shown by closed squares at the centroids of the meadows of the

corresponding types of management.

b: Inferred ranking of the species along the variable amount of manure, based on the biplot interpretation of Part a of this figure.

aa

bb

Page 26: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

BIPLOT PREDICTION OF ENVIRONMENTAL VARIABLES

Modified from J. Oksanen (2002)

• Project a site point onto environmental arrow: predict its environmental value

• Exact with two constraints only

• Projections are exact only in the full multi-dimensional space. Often curved when projected onto a plane

Page 27: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CLASS VARIABLES

• Class 1/0 variables usually represented as 'dummy' variables: make m - 1 indicator variables out of m levels (moisture classes 1, 2, 4, 5). Ignore class 3

• Scoring: 1 if site belongs to the class, 0 otherwise

• One dummy less than levels, because one is redundant. If it does not belong to any of the m - 1 classes, it must belong to the remaining one.

• Ordered factors such as the four moisture classes can be expressed as polynomial constraints. A four-level ordered factor can be expressed in three 'dummy' variables - linear, quadratic, and cubic effects. Plot as biplot arrows. Helps to find when one can replace a multilevel factor by a single continuous variable (e.g. Moisture Linear)

J. Oksanen (2002)

REF

REF

REF

Page 28: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

• You may have in a same figure• WA scores of species• WA or LC scores of sites• Biplot arrows or class centroids of environmental variables

 • In full space, the length of an environmental vector is 1: When projected

onto ordination space• Length tells the strength of the variable• Direction shows the gradient• For every arrow, there is an equal arrow to the opposite direction,

decreasing direction of the gradient• Project sample points onto a biplot arrow to get the expected value 

• Class variables coded as dummy variables• Plotted as class centroids• Class centroids are weighted averages 

• LC score shows the class centroid, WA scores show the dispersion of the centroid

• With class variables only: Multiple Correspondence Analysis or Analysis of Concentration

CCA: JOINT PLOTS AND TRIPLOTS

Page 29: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Summary Axes

Axes 1 2 3 4 Total inertiaEigenvalues .461 .298 .160 .134 2.115Species-environment .958 .902 .855 .889 correlationsCumulative percentage variance

of species data 21.8 35.9 43.5 49.8 of species-environment 37.8 62.3 75.4 86.3 relation

Sum of all unconstrained eigenvalues = inertia 2.115Sum of all canonical eigenvalues = species-environment 1.220 relation

'Fitted' species data

10021

seigenvalue canonical

Rules of thumb:>0.30 strong gradient>0.40 good niche separation of species

CANOCO

Page 30: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Oksanen & Minchin (1997) J. Vegetation Science 8, 447–454

Tolerance Number of

iterations DECORANA 0.0001 10

CANOCO version 2 0.00005 15 version 3.1

CANOCO version 3.12a 0.000005 999(‘STRICT’ CRITERIA)

CANOCO version 4 0.000005 999

CONVERGENCE CRITERIA IN EIGENVALUE EXTRACTION

Page 31: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1) Environmental variables  continuous – biplot arrows

classes – centroid (weighted average) of sites belonging to that class

2) CA approximates ML solution of Gaussian model

CCA approximates ML solution of Gaussian model if CA axis is close to the linear com-bination of environmental variables. [Johnson & Altman (1999) Environmetrics 10, 39-52]

In CCA species compositional data are explained through a Gaussian unimodal response model in which the explanatory variable is a linear combination of environmental variables.

3) CCA – very robust, major assumption is that response model is UNIMODAL.

(Tolerances, maxima, and location of optima can be violated - see Johnson & Altman 1999)

4) Constraints become less and less strict the more environmental variables there are. If q, number of environmental variables ≥ number of samples -1, no real constraints and CCA = CA.

5) Arch effect may crop up. Detrending (by polynomials) DCCA. Useful for estimating gradient lengths (use segments).

6) Arch effect can often be removed by dropping superfluous environmental variables, especially those highly correlated with the arched axis.

OTHER CCA TOPICS

Page 32: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

REPRESENTATION OF CLASS VARIABLES (1/0) IN CCA

1. Make class centroids as distinct as possible

2. Make clouds about centroids as compact as possible

• Success

• LC scores are the class centroids: the expected locations, WA scores are the dispersion of the centroid

• If high , WA scores are close to LC scores

• With several class variables, or together with continuous variables, the simple structure can become blurred J. Oksanen (2002)

Page 33: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Canonical correspondence analysis

Unimodal curves for the expected abundance response (y) of four species against an environmental gradient or variable (x). The optima, estimated by weighted averages, (u) [k=1,2,3], of three species are indicated. The curve for the species on the left is truncated and therefore appears monotonic instead of unimodal; its optimum is outside the sampled interval but, its weighted average is inside. The curves drawn are symmetric, but this is no strict requirement for CCA.

Page 34: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

7) t-values of canonical coefficients or forward selection option in CANOCO to find minimal set of significant variables that explain data about as well as full set.

8) Can be sensitive to deviant sites, but only if there are outliers in terms of both species composition and environment. CCA usually much more robust than CA.

9) Can regard CCA as a display of the main patterns in weighted averages of each species with respect to the environmental variables.

  Intermediate between CA and separate WA regressions for each species.  Separate WA regressions point in q-dimensional space of environmental

variables. NICHE.  CCA attempts to provide a low-dimensional representation of this niche.

10) ‘Dummy’ variables (e.g. group membership or classes) as environmental variables. Shows maximum separation between pre-defined groups.

11) ‘Passive’ species or samples or environmental variables. Some environmental variables active, others passive

  e.g. group membership – activeenvironmental variables – passive

12) CANOCO ordination diagnostics  fit of species and samples

pointwise goodness of fit can be expressed either as residual distance from the ordination axis or plane, or as proportion of projection from the total chi-squared distance species tolerances, sample heterogeneity

Page 35: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Passive ‘fossil’ samples added into CCA of modern data

Page 36: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Canonical correspondence analysis (CCA) time-tracks of selected cores from the Round Loch of Glenhead; (a) K5, (b) K2, (c) K16, (d) k86, (e) K6, (f) environmental variables. Cores are presented in order of decreasing sediment accumulation rate.

Page 37: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

13) Indicator species

14) Behaves well with simulated data. 

M W Palmer (1993) Ecology 74, 2215–2230

Copes with skewed species distributions‘noise’ in species abundance dataunequal sampling designshighly intercorrelated environmental variablessituations when not all environmental factors are known

Page 38: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Site scores along the first two axes in CCA and DCA ordinations, with varying levels of quantitative noise in species abundance. Quantitative noise was not simulated. The top set represents CCA LC scores and environmental arrows, the middle represents CCA WA scores, and the bottom represents DCA scores. Sites with equal positions along the environmental gradient 2 are connected with lines to facilitate comparisons.

Palmer, M.W. (1993) Ecology 74, 2215–2230

Page 39: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Site scores along the first two axes in CCA and DCA ordinations, with varying levels of quantitative noise in species abundance. Quantitative noise was not simulated. The top set represents CCA LC scores and environmental arrows, the middle represents CCA WA scores, and the bottom represents DCA scores. Sites with equal positions along the environmental gradient 2 are connected with lines to facilitate comparisons.

Palmer, M.W. (1993) Ecology 74, 2215–2230

..continued

Page 40: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Like all numerical techniques, CCA makes certain assumptions, most particularly that the abundance of a species is a unimodal function of position along environmental gradient. Does not have to be symmetric unimodal function.

Simulated data Palmer 1993 – CCA performs well even with highly skewed species distributions.

‘Noise’ in ecological data – errors in data collection, chance variation, site-specific factors, etc. Noise is also regarded as ‘unexplained’ or ‘residual’ variance. Regardless of cause, noise does not affect seriously CCA.

‘Noise’ in environmental data is another matter. In regression, assumed that predictor variables are measured without error. CCA is a form of regression, so noise in environmental variables can affect CCA.

Highly correlated environmental variables, e.g. soil pH and Ca. Species distributions along Ca gradient may be identical to distributions along pH gradient, even if one is ecologically unimportant. Species and object arrangement in CCA plot not upset by strong inter-correlations. CCA (like all other regression techniques) cannot tell us which is the ‘real’ important variable.

Both may be statistically significant – small amount of variation in Ca at a fixed level of pH may cause differences in species composition.

Arch – very rarely occurs in CCA. Detrended CCA generally should not be used except in special cases.

ROBUSTNESS OF CANONICAL CORRESPONDENCE ANALYSIS

Page 41: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

McCune (1997) Ecology 78, 2617–2623

Simulated artificial data 10 x 10 grid. 40 species following Gaussian response model.

2 environmental variables X and Y co-ordinatesTENXTEN

2 environmental variables with added noise NOISMOD

(random number mean = 0, variance 17%)

added to each cell10 random environmental variables NOIS1O 2 environmental variables with added noise from

NOISMOD+10 random environmental variables from NOIS

10

NOISBOTH99 random environmental variables NOISFULL

NOISFULL – ‘Species-environment’ correlation increases as number of random variables increases for axis 1 and 2.

Is in fact the correlation between the linear combination and WA site scores.

Poor criterion for evaluating success.Not interpreted as measure of strength of relationship.Monte Carlo permutation tests - NO STATISTICAL SIGNIFICANCE!

INFLUENCE OF NOISY ENVIRONMENTAL DATA ON CANONICAL CORRESPONDENCE

ANALYSIS

(1(1))

(2(2))

(3(3))

(4(4))

(5(5))

Page 42: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Dependence of the 'species-environment correlation,' the correlation between the LC and WA site scores, on a second matrix composed of from 1 to 99 random environmental variables. This correlation coefficient is inversely related to the degree of statistical constraint exerted by the environmental variables.

TEN x TEN

NOISMOD

NOISIO

NOISFULL

Page 43: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Monte Carlo tests 1 2 r1 r2

TENXTEN * 0.77 0.79 0.98 0.9NOISMOD * 0.65 0.65 0.90 0.8NOISE 10 ns 0.20 0.12 0.49 0.4NOISBOTH * 0.65 0.65 0.91 0.8NOISFULL ns 0.12 0.08 1.0 1.0(99 env vars)

Linear combination site best fit of species abundances to scores the environmental dataWA site scores best represent the assemblage structure

LC scores WA scoresSensitive to noise + –True direct gradient analysis + – (multivariate regression)Aim to describe biological + – variation in relation to environmentAssemblage structure – +Which to use depends on one's aims and the nature of the data.

‘Species-environmental correlation’ better called ‘LC-WA’ correlation. Better measure of the strength of the relationship is the proportion of the variance in the species data that is explained by the environmental data. Evaluation should always be by a Monte Carlo permutation test.

Page 44: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

LC OR WA SCORES?

MIKE PALMER

"Use LC scores, because they give the best fit with the environment and WA scores are a step from CCA towards CA."

BRUCE MCCUNE

"LC scores are excellent, if you have no error in constraining variables. Even with small error, LC scores can become poor, but WA scores can be good even in noisy data."

LC scores are the default in CANODRAW.

Be aware of both - plot both to be sure.

J. Oksanen (2002)

Page 45: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

DATA ORDERINGS

Page 46: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA DIAGRAMTEN SETS OF DISTANCES TO REPRESENT, EMPHASIS ON 5, 8, AND 1

(FITTED ABUNDANCES OF SPECIES AND SITES)

Page 47: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Data-tables in an ecological study on species environmental relations. Primary data are the sub-table 1 of abundance values of species and the sub-tables 4 and 7 of values and class labels of quantitative and qualitative environmental variables (env. var), respectively. The primary data are input for canonical correspondence analysis (CCA). The other sub-tables contain derived (secondary) data, as the arrows indicate, named after the (dis)similarity coefficient they contain. The coefficients shown in the figure are optimal when species-environmental relations are unimodal. The CA ordination diagram represents these sub-tables, with emphasis on sub-tables 5 (weighted averages of species with respect to quantitative environmental variables), 8 (totals of species in classes of qualitative environmental variables) and 1 (with fitted, as opposed to observed, abundance values of species). The sub-tables 6, 9, and 10 contain correlations among quantitative environmental variables, means of the quantitative environmental variables in each of the classes of the qualitative variables and chi-square distances among the classes, respectively. (Chis-sq = Chi-square; Aver = Averages; Rel = Relative)

Page 48: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

DEFAULT CCA PLOT

J. Oksanen (2002)

• Like CA biplot, but now a triplot: vectors for linear constraints.

• Classes as weighted averages or centroids.

• Most use LC scores: these are the fitted values.

• Popular to scale species relative to eigenvalues, but keep sites unscaled. Species-conditional plot.

• Sites do not display their real configuration, but their projections onto environmental vectors are the estimated values.

Page 49: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Hill scaling Default scaling

–1 2

Emphasis on SITES SPECIES

1 Species x sites Rel abundances Fitted abundances (rel)

2 Species x species – Chi-squared distances

3 Sites x sites Turnover distances –

Quant env vars

4 Sites x env vars3 – Values of env vars

5 Species x env varsWeighted averages Weighted averages

6 Env vars x env vars Effects2 Correlations2

Qualit env vars

7 Sites x env classes4 Membership1

Membership1

8 Species x env classes Rel total abund Rel total abund

9 Env vars x env classes – Mean values of env vars

10 Env classes x env Turnover distances – classes

fitted by least fitted by least squaressquares

11 by centroid by centroid principleprinciple

22 change in site change in site scores if env scores if env variable variable changes changes are are one standard one standard deviationdeviation

33 inter-set inter-set correlationscorrelations

44 group centroids group centroids

SCALING IN CCAREF

REF

REF

REF

Page 50: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Sub-tables (row numbers) that can be displayed by two differently scaled ordination diagrams in canonical correspondence analysis (CCA). Display is by the biplot rule unless noted otherwise. Hill's scaling (column 2) was the default in CANOCO 2.1, whereas the species-conditional biplot scaling (column 3) is the default in CANOCO 3.1 and 4. The weighted sum of squares of sites scores of an axis is equal to /(1-) with its eigenvalue and equal to 1 in scaling -1 and scaling 2, respectively. The weighted sum of squares of species scores of an axis is equal to 1/(1-) and equal to in scaling -1 and scaling 2, respectively. If the scale unit is the same of both species and sites scores, then sites are weighted averages of species scores in scaling -1 and species are weighted averages of site scores in scaling 2. Table in italics are fitted by weighted least-squares (rel. = relative; env. = environmental; cl. = classes; - = interpretation unknown).

NoteNote that that symmetric scalingsymmetric scaling (=3) has many optimal properties (=3) has many optimal properties (Gabriel, 2002; ter Braak, personal communication)(Gabriel, 2002; ter Braak, personal communication)

REF

REF

REF

REF

Page 51: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Scaling-1: focus on sites Hill's scaling

Interpreta-tion

2: focus on species biplot scaling of CCA

Interpreta-tion

1. species x sitesa Rel. Abundancesb,c

CENTROID Fitted rel. abund.b

BIPLOT rule or CENTROID rule

2. species x species

- UNKNOWN -square distancesd

DISTANCE rule

3. sites x sites Turnover distancesc,e

DISTANCE f DISTANCE rule

Quantitative env. vars

4. sites x env. varsg - UNKNOWN Values of env.vars

BIPLOT rule

5. species x env. vars

Weighted averages

BIPLOT Weighted averages

BIPLOT rule

6. env.vars x env. vars

Effectsh ? BIPLOT Correlations BIPLOT rule

Qualitative env. vars

7. sites x env. classesi

Membershipk CENTROID Membershipk CENTROID rule

8. species x env. cls. Rel. total abund.c,b

CENTROID Rel. total abund.b

CENTROID rule

9. env.vars x env. classes

- UNKNOWN Mean values of env. vars

BIPLOT rule

10. env. classes x env. classes.

Turnover distancesc,e

DISTANCE f DISTANCE rule

Page 52: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

a Site scores are linear combinations of the environmental variables. The adjective "fitted" must be deleted if site scores are proportional to the weighted average of species scores.

b The centroid principle can be applied also if sites and species scores are plotted in the same units, i in scaling -1, species that occur in a site lie around it, whereas in scaling 2, the species' distribution is centred at the species point.

c The biplot rule cannot be applied

d In the definition of this coefficient, abundance must be replaced by fitted abundance values, because CCA is correspondence analysis of fitted abundance values

e No explicit formula known

f Chi-square distances, provided the eigenvalues of the axes are of the same magnitude

g Environmental scores are (intra-set) correlations in scaling 2; more precisely, the coordinate of an arrow head on an axis (i.e. the score) is the weighted product-moment coefficient of the environmental variable with the axis, the weights being the abundance totals of the sites (yi+). The scores in scaling -1 are {(1-)}½ times

those in scaling 2.

h Effect is defined as the change in site scores if the environmental variable changes one standard deviation in value (while neglecting the other variables).

i Environmental points are centroids of site points

k Via centroid principle, not via biplot

REF

REF

REF

REF

Page 53: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Centroid principle

Distance principle

Biplot principle (of relative abundances)

Small eigenvalues, short (< 4SD) gradients – Biplot principle

Large eigenvalues (> 0.40), long (> 4SD) gradients – Centroid and distance principles and some biplot principles

Note that the centroid and distance principle may approximate biplot principle if gradients are short and eigenvalues small.

Differences are least important if 12

INTERPRETATION OF CCA PLOTS

Page 54: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Example data: quantitative and qualitative environmental variables (a) and qualitative covariables (b) recorded at 40 sites along two tributaries from the Hierden stream (sd: standard deviation, min: minimum, max: maximum). Aquatic macro-fauna data

CCA EXAMPLE

Ordinal

4 classes

3 classes

7 binary class variables

Remove effect ofseasonal variation

}

Page 55: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Ranking environmental variables in importance by their marginal (left) and conditional (right) effects of the macrofauna in the example data-set, as obtained by forward selection. (1 = fit = eigenvalue with variable j only; a = additional fit = increase in eigenvalue; cum (a) = cumulative total of eigenvalues a; P = significance level of the effect, as obtained with a Monte Carlo permutation test under the null model with 199 random permutations; - additional variables tested; veg. = vegetation). Seasonal variation is partialled out by taking the month class variables as covariables.Marginal effects (forward: step 1)

Conditional effects (forward: continued)

j Variable 1P j Variable a

P Cum (a)

1 Shrubs (1/0) 0.25 (0.01) 1 Shrubs (1/0) 0.25 (0.01) 0.25

2 Source distance

0.22 (0.01) 2 Source distance

0.19 (0.01) 0.44

3 EC 0.20 (0.01) 3 EC 0.19 (0.01) 0.63

4 Discharge 0.17 (0.01) 4 Discharge 0.14 (0.03) 0.75

5 Total veg cover

0.16 (0.01)

6 Shading 0.15 (0.01) - Cover emergent

0.11 (0.10) -

7 Soil grain size 0.14 (0.02) - Cover bank veg

0.11 (0. 12)

-

8 Stream width 0.14 (0.05) - Soil grain size 0.10 (0.13) -

9 High weedy veg

0.14 (0.08)

10

Cover bank veg

0.13 (0.11)

- U vs L stream 0.22 (0.01) - U vs L stream 0.09 (0.01) -

EXTRA FIT

Each variable is the only env. var.

Change in eigenvalue if particular variable selected

MARGINAL EFFECTS i.e. ignoring all other variables

CONDITIONAL EFFECTS given other selected variables

Page 56: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Species-conditional triplot based on a canonical correspondence analysis of the example macro-invertebrate data displaying 13% of the inertia (=weighted variance) in the abundances and 69% of the variance in the weighted averages and class totals of species with respect to the environmental variables. The eigenvalue of axis 1 (horizontally) and axis 2 (vertically) are 0.35 and 0.17 respectively; the eigenvalue of the axis 3 (not displayed) is 0.13. Sites are labelled with stream code (U, L) and are ranked by distance from the source (rank number within the stream). Species (triangles) are weighted averages of site scores (circles). Quantitative environmental variables are indicated by arrows. The class variable shrub is indicated by the square points labelled Shrub and No shrub. The scale marks along the axes apply to the quantitative environmental variables; the species scores, site scores and class scores were

multiplied by 0.4 to fit in the coordinate system. Only selected species are displayed which have N2>4 and a small N2-adjusted root mean square tolerance for the first two axes. The species names are abbreviated to the part in italics as follows Ceratopogonidae, Dendrocoelum lacteum, Dryops luridus, Erpobdella testacea, Glossiphonia complanata, Haliplus lineatocollis, Helodidae, Micropsectra atrofasciata, Micropsectra fusca, Micropterna sequax, Prodiamesa olivacea, Stictochironomus sp.

Page 57: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

• Unconstrained CA gives

• Species ordination which is derived from site ordination

• Site ordination which is derived from species ordination

• Fitted vectors for environmental variables (indirect gradient analysis)

  • Constrained CA (Canonical CA) gives a direct gradient analysis

• Species ordination which is derived from site ordination

• Site scores which are linear combinations of environmental variables (LC scores)

• Site ordination which is derived from species ordination (WA scores) so that species-environment correlation is maximised with the LC scores

• Vectors of environmental variables that define the linear combination scores for sites

CANONICAL CORRESPONDENCE ANALYSIS (CCA) - A SUMMARY

Page 58: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA CCA DirectlyDirectly

CACAIndirectlyIndirectly

Gradient length Gradient length estimationestimation

Outline of ordination techniques present-ed here. DCA (detrended correspondence analysis) was applied for the determina-tion of the length of the gradient (LG). LG is important for choosing between ordination based on a linear or on an unimodal response model. Correspond-ence analysis (CA) is not considered any further because in “microcosm experi-ment discussed here LG < or = 1.5 SD units. LG < 3 SD units are considered to be typical in experimental ecotoxicology. In cases where LG < 3, ordination based on linear response models is considered to be most appropriate. PCA (principal component analysis) visualizes variation in species data in relation to best fitting theoretical variables. Environmental variables explaining this visualised variation are deduced afterwards, hence, indirectly. RDA ( redundancy analysis) visualises variation in species data directly in relation to quantified environ-

mental variables. Before analysis, covariables may be introduced in RDA to compensate for systematic differences in experimental units. After RDA, a permutation test can be used to examine the significance of effects.

Page 59: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

REDUNDANCY ANALYSIS – CONSTRAINED PCA

Short (< 2SD) compositional gradientsLinear or monotonic responses Reduced-rank regressionPCA of y with respect to xTwo-block mode C PLSPCA of instrumental variables Rao (1964)

PCA - best hypothetical latent variable is the one that gives the smallest total residual sum of squares

RDA - selects linear combination of environmental variables that gives smallest total residual sum of squares

ter Braak (1994) Ecoscience 1, 127–140 Canonical community ordination Part I: Basic theory and linear methods

Page 60: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

RDA ordination diagram of the Dune Meadow Data with environmental variables represen-ted as arrows. The scale of the diagram is: 1 unit in the plot corresponds to 1 unit for the sites, to 0.067 units for the species and to 0.4 units for the environmental variables.

Page 61: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Redundancy analysis: canonical coefficients (100 x c) and intra-set correlations (100 x r) of environmental variables with the first two axes of RDA for the Dune Meadow Data. The environmental variables were standardized first to make the canonical coefficients of different environmental variables comparable. The class SF of the nominal variable “type of management” was used as reference class in the analysis.

Variable Coefficients CorrelationsAxis1 Axis2 Axis1 Axis2

A1 -1 -5 54 -6Moisture 15 9 92 12Use 5 -6 15 29Manure -8 16 -26 86SF - - 25 76BF -10 0 -48 -11HF -10 -2 -40 13NM -4 -13 51 -79

Page 62: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Axis 1 Axis 2

PCA % 29 21

RDA % 26 17

PCA Correlation 0.90 0.82

RDA Correlation 0.95 0.89

PCA and RDA comparisons

Important to do the check that the environmental variables relate to the major gradients in composition detected by the PCA.

Page 63: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Cosine of angle correlation

Long arrows of species and environmental variables most important

Euclidean distance biplotCovariance (correlation) biplotRDA covariance or correlation matrix of speciesRDA – constrained form of multiple regressionUses 2 (q + m) + m parameters (q env variables, m species)Multiple regression m (q + 1)e.g. 40 species 10 envir variablesRDA 140 parametersMR 440 parametersRDA is thus reduced rank regression (RR)

speciesunconstrained

Goodness of fit 1 + 2 sum of

eigenvalues constrained

fitted species

BIPLOT INTERPRETATION

Page 64: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Primary and secondary data tables in a typical community ecological study of species-environment relations. Indirect methods of ordination use the tables under (a). Direct methods also use the tables under (b). The primary data are the table of abundance values and the tables of values and class labels of quantitative and qualitative environmental variables (env. var), respectively. The secondary tables are named after the (dis)similarity coefficients they contain. The appropriate coefficients must be chosen by the ecologist. The coefficients shown in the figure are optimal when species-environment relations are linear.

Page 65: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Tables that can be displayed by two differently scaled biplots in principal components analysis (a) and redundancy analysis (b).

The sum of squares of site scores of an axis is equal to its eigenvalue in scaling 1, and equal to 1 in scaling 2. The sum of squares of species scores of an axis is equal to 1 in scaling 1 and equal to its eigenvalue in scaling 2. Tables in bold are fitted by

(weighted) least-squares.Biplot scaling 1: focus on sites 2: focus on species

distance biplot correlation biplot(a) principal components analysisspecies x sites abundances abundancessites Euclidean distances -species - correlationsa

(b) redundancy analysisspecies x sitesb fitted abundances fitted abundancessitesb Euclidean distancesc -species - correlationsa,c

Quantitative env. vars.:species x env. vars. d correlations correlationssites x env. vars. d - values of env. varsenv. vars. d effectse correlationsQualitative env. vars:species x env. classesf means meanssites x env. classesf g g

env. classesf Euclidean distances -env. vars. x env. classes - means

Page 66: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

a Automatic if abundance is standardised by species. If abundance is only centred by species, a post-hoc rescaling of the site scores is needed so as to account for the differences in variance amongst species. b Site scores are a linear combination of the environment variables instead of being a weighted sum of species abundances. c In the definition of this coefficient, abundance must be replaced by the fitted abundance. d Environmental scores are intraset correlations in scaling 2 and s

½ times those in scaling 1 with s the eigenvalue of axis . In CANOCO, the scores are termed biplot scores for environmental variables. e Effect of the environmental variable on the ordination scores, while neglecting the other environmental variables; length of arrow is the effect size, i.e. the variance explained by the variable. f Environmental classes are centroids of site points belonging to the class. g membership via centroid principle, not via the biplot rules.

REF

REF

REF

REF

Page 67: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

The scale marks along the axes apply to the species and quantitative environmental variables; the site scores and class scores were multiplied by 0.46 to fit in the coordinate system. The abbreviations are given in Jongman et al. (1987).The rule for interpreting a biplot (projection on an imaginary axis) is illustrated for the species Pla lan and sites 11 and 12.

Correlation biplot based on a redundancy analysis of the Dune Meadow Data displaying 43% of the variance in the abundances and 71% of the variances in the fitted abun-dances. Quantitative environ-ment variables are indicated by arrows. The qualitative variable Management type is indicated by the square points labelled SF, BF, HF, and NM. The displayed species are selected on the basis that more than 30% of their variance is accounted for by the diagram. Eigenvalues of the first three axes are 0.26, 0.17,and 0.07; the sum of all canonical eigenvalues is 0.61.

Page 68: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PROPOSED NEW SCALING FOR CCA AND RDA

Gabriel, K.R. (2002) Biometrika 89, 423-436

Symmetric scaling (3) of biplots preserves the optimal fit to the species data table and preserves the (proportional) fit of at least 95% of the inter-species correlations/distances and inter-sample distances. It is a very good compromise.

Only recommended (ter Braak, pers. comm.) to deviate from symmetric scaling if the focus of study is strongly on either species (scaling 2) or on samples (scaling 1).

Data table unaffected by scaling:

Species x sites Species data (PCA)Fitted species data (RDA)Relative species data (CA)Fitted relative species data (CCA)

Species x environmental variables Correlations of species (RDA)Optima (WA) of species (CCA)

Species x environmental classes Mean abundances of species (RDA)Relative abundances of species across

classes (CCA)

Page 69: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Data tables with 95% preservation of proportional fit:

Species x species Correlations (PCA, RDA)

Chi-square distances (CA, CCA)

Sites x sites Euclidean distances (PCA, RDA)

Chi-square distances (CA, CCA)

Env. classes x env. classes Euclidean distances (RDA)

Chi-square distances (CCA)

Env. variables x env. variables Correlations (RDA, CCA)

Sites x env. variables Values (RDA, CCA)

Sites x env. classes Means (RDA, CCA)

Env. variables x env. classes Mean values of env. variables (RDA, CCA)

Page 70: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

ALTERNATIVES TO ENVIRONMENTAL VECTORS IN CCA AND RDA

• Fitted vectors natural in constrained ordination, since these have linear constraints.

• Distant sites are different, but may be different in various ways: environmental variables may have a non-linear relation to ordination.

Contours Bubble plots GAM

J. Oksanen (2002)

Page 71: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Statistical significance of species-environmental relationships. Monte Carlo permutation tests. 

Randomly permute the environmental data, relate to species data ‘random data set’. Calculate eigenvalue and sum of all canonical eigenvalues (trace). Repeat many times (99).

 If species react to the environmental variables, observed test statistic (1 or trace) for observed data should be larger than most (e.g. 95%) of test statistics calculated from random data. If observed value is in top 5% highest values, conclude species are significantly related to the environmental variables.

STATISTICAL TESTING OF CONSTRAINED ORDINATION RESULTS

Page 72: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

STATISTICAL SIGNIFICANCE OF CONSTRAINING VARIABLES

• CCA or RDA maximise correlation with constraining variables and eigenvalues.

• Permutation tests can be used to assess statistical significance:

- Permute rows of environmental data.

- Repeat CCA or RDA with permuted data many times.

- If observed higher than (most) permutations, it is regarded as statistically significant.

J. Oksanen (2002)

Page 73: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

e.g. pollution effects

seasonal effects COVARIABLES

 

Eliminate (partial out) effect of covariables. Relate residual variation to pollution variables.

Replace environmental variables by their residuals obtained by regressing each pollution variable on the covariables.

Analysis is conditioned on specified variables or covariables. These conditioning variables may typically be 'random' or background variables, and their effect is removed from the CCA or RDA based on the 'fixed' or interesting variables.

PARTIAL CONSTRAINED ORDINATIONS

(Partial CCA, RDA, etc)

Page 74: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Natural variation due to sampling season and due to gradient from fresh to brackish water partialled out by partial CCA.

Variation due to pollution could now be assumed.

Ordination diagram of a partial canonical correspond-ence analysis of diatom species (A) in dykes with as explanatory variables 24 variables-of-interest (arrows) and 2 covariables (chloride concentration and season). The diagram is symmetrically scaled [23] and shows selected species and standardized variables and, instead of individual dykes, centroids (•) of dyke clusters. The variables-of-interest shown are: BOD = biological oxygen demand, Ca = calcium, Fe = ferrous compounds, N = Kjeldahl-nitrogen, O2 = oxygen, P = ortho-phosphate, Si= silicium-compunds, WIDTH = dyke width, and soil types (CLAY, PEAT). All variables except BOD, WIDTH, CLAY and PEAT were transformed to logarithms because of their skew distribution. The diatoms shown are: Ach hun = Achnanthes hungarica, Ach min = A. minutissima, Aph cas= Amphora castellata Giffen, Aph lyb = A. lybica, Aph ven = A. veneta, Coc pla = Cocconeis placentulata, Eun lun = Eunotia lunaris, Eun pec = E. pectinalis, Gei oli = Gomphoneis olivaceum, Gom par = Gomphonema parvulum, Mel jur = Melosira jürgensii, Nav acc = Navicula accomoda, Nav cus = N. cuspidata, Nav dis = N. diserta, Nav exi = N. exilis, Nav gre = N. gregaria, Nav per = N. permitis, Nav sem = N. seminulum, Nav sub= N. subminuscula,Nit amp = Nitzschia amphibia, Nit bre = N. bremensis v. brunsvigensis, Nit dis = N. dissipata, Nit pal = N. palea, Rho cur = Rhoicosphenia curvata.(Adapted from H. Smit, in prep)

PARTIAL CCA

Page 75: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

There can be many causes of variation in ecological or other data. Not all are of major interest. In partial ordination, can ‘factor out’ influence from causes not of primary interest. Directly analogous to partial correlation or partial regression. Can have partial ordination (indirect gradient analysis) and partial constrained ordination (direct gradient analysis). Variables to be factored out are ‘COVARIABLES’ or ‘CONCOMITANT VARIABLES’. Examples are: 1) Differences between observers.

2) Time of observation.

3) Between-plot variation when interest is temporal trends within repeatedly sampled plots.

4) Uninteresting gradients, e.g. elevation when interest is on grazing effects.

5) Temporal or spatial dependence, e.g. stratigraphical depth, transect position, x and y co-ordinates. Help remove autocorrelation and make objects more independent.

6) Collecting habitat – outflow, shore, lake centre.

7) Everything – partial out effects of all factors to see residual variation in data. Given ecological knowledge of sites and/or species, can try to interpret residual variation. May indicate environmental variables not measured, may be largely random, etc.

PARTIAL ORDINATION ANALYSIS(Partial PCA, CA, DCA)

Page 76: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

e.g. partial out the effects of some covariables prior to indirect gradient analysis 

within-plot change PRIMARY INTERESTbetween-plot differences NOT OF INTEREST

 Partial plot identity, ordination of residual variation, i.e. within-plot change. e.g. Swaine & Greig-Smith (1980) J Ecol 68, 33–41 Bakker et al. (1990) J Plankton Research 12, 947–972  

PARTIAL ORDINATIONS

Page 77: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Background variables or 'covariables'

Partial CCAPartial RDA

Vegetation

Environmental variables or 'constraints'

Vegetation (residual)

CCARDA

CADCAPCA

Vegetation (residual)

"Nuisance" variables or other background factors can be removed before studying interesting factors. Partial CCA or partial RDA.

Permutation tests are for environmental variables only.

Residual variation can be analysed at any level. Can partition the variance.

Final residual shows what you cannot explain with available environmental variables.

Interpretation of final residual based on other correlates and/or ecological knowledge.

COVARIABLES IN CCA AND RDA

Page 78: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PARTITIONING VARIATION

ANOVA total SS = regression SS + residual SS

Two-way ANOVA between group (factor 1)

+ between treatments (factor 2)

+ interactions

+ error component

Borcard et al. (1992) Ecology 73, 1045–1055

Variance or variation decomposition into 4 components

Important to consider groups of environmental variables relevant at same level of ecological relevance (e.g. micro-scale, species-level, assemblage-level, etc.).

Variation = variance in RDA

Variation = inertia in CCA = chi-square statistic of data divided by the data’s total = sum of all eigenvalues of CA

Page 79: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Total inertia = total variance 1.164Sum canonical eigenvalues = 0.663 57%Explained variance 57%Unexplained variance = T – E 43% 

What of explained variance component? 

Soil variables (pH, Ca, LOI)Land-use variables (e.g. grazing, mowing)Not independentDo CCA/RDA using

1) Soil variables only canonical eigenvalues0.5212) Land-use variables only canonical eigenvalues0.5033) Partial analysis Soil Land-use covariables 0.1604) Partial analysis Land-use Soil covariables 0.142a) Soil variation independent of land-use (3) 0.160 13.7%b) Land-use structured (covarying) soil variation (1–3) 0.36131%c) Land-use independent of soil (4) 0.142 12.2%

Total explained variance 56.9%d) Unexplained 43.1%

unexplaineduniqueuniquecovariance

a b c d

CANOCO

Page 80: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Qinghong & Bråkenheim, (1995) Water, Air and Soil Pollution 85, 1587–1592 Three sets of predictors – Climate (C), Geography (G) and Deposition of Pollutants (D) Series of RDA and partial RDA

Predictors Covariables Sum of canonical G+C+D - 0.811D G+C 0.027G+C - 0.784G+C D 0.132D - 0.679Joint effectDG+C=0.784-0.132=0.679-0.027=0.652C D+G 0.106G+D - 0.706G+D C 0.074C - 0.737Joint effectCD+G=0.737-0.106=0.706-0.074=0.631

0.811

0.811

0.812

0.811

VARIATION PARTITIONING OR DECOMPOSITION WITH 3 OR MORE SETS OF

PREDICTOR (EXPLANATORY) VARIABLES

Page 81: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Predictors Covariables Sum of canonical G D+G 0.034D+C - 0.777D+C G 0.228G - 0.538Joint effectGD+C=0.777-0.228=0.538-0.034=0.549

0.811

0.811

Canonical eigenvaluesAll predictors 0.811Pure deposition 0.027 PDPure climate 0.106 PCPure geography 0.034 PGJoint G + C 0.132Joint G + D 0.074Joint D + C 0.228Unexplained variance 1 – 0.811 = 0.189

PD

DGCD

CG

CDG

PGPC

D

GC

Covariance terms

CDDGCGCDG

Page 82: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CD + DG + CDG = 0.652

CD + CG + CDG = 0.631

DG + CG + CDG = 0.549

PD + PC + CD = 0.027 + 0.106 + CD = 0.777 – 0.549 = 0.228

PD PC (D+C) (DG + CG + CDG)

PD + PG + DG = 0.027 + 0.034 + DG = 0.706 – 0.631 = 0.074

PD PG (G+D) (CD + CG + CDG)

PC + PG + CG = 0.106 + 0.034 + CG = 0.784 – 0.652 = 0.132

PC PG (G+C) (CD + DG + CDG)

CD = 0.095 DG = 0.013 CG = –0.008

CDG = 0.652 – 0.013 – 0.095 = 0.544

= 0.631 – (–0.008) – 0.095 = 0.544

= 0.054 – (–0.008) – 0.013 = 0.544

Page 83: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Total explained variance 0.811 consists of: Common climate + deposition 0.095 Unique climate PC 0.106Common deposition + geography 0.013 Unique geography PG 0.034Common climate + geography 0.008 Unique deposition PD 0.027Common climate + geography + deposition 0.544 Unexplained variance 0.189

See also Qinghong Liu (1997) – Environmetrics 8, 75–85Anderson & Gribble (1998) – Australian J. Ecology 23, 158-167 Total variation: 1)  random variation 2)  unique variation from a specific predictor variable or set of predictor variables 3)  common variation contributed by all predictor variables considered together and in all possible combinations

Usually only interpretable with 2 or 3 'subsets' of predictors.

In CCA and RDA, the constraints are linear. If levels of the environmental variables are not uncorrelated (orthogonal), may find negative 'components of variation'.

Page 84: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

'NEGATIVE' VARIANCES

In variance partitioning, the groups of predictor variables used should be non-linearly independent for unbiased partitioning or decomposition.

If the groups of variables have polynomial dependencies, some of the variance components may be negative. Negative variances are, in theory, impossible.

High-order dependencies commonly arise with high numbers of variables and high number of groups of variables.

Beware of inter-relationships between predictor variables and between groups of predictors. Problem common to all regression-based techniques, including (partial) CCA or RDA.

Careful model selection (minimal adequate model) is essential for many purposes, including variance partitioning.

REF

REF

REF

REF

Page 85: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

UNBIASED ESTIMATES OF VARIANCE COMPONENTS

Peres-Neto et al. 2006 Ecology 87: 2614-2615

Legendre 2007 Journal of Plant Ecology (in press)

[a] [b] [c]

Variation explained by X

Variation explained

by Z

Unexplained (residual) variation = [d]

Variation in response variable y or Y

=

Rectangle = 100% of the variation in y or Y (response variable(s))

Fraction b is the intersection (not interaction) or covariance of the amounts of variation explained by linear model of X and Z

Page 86: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Partial linear regression

y ~X|Z – partial linear regression of response variable y on predictor variables X (m variables) whilst controlling for the linear effect of Z containing q covariables

Page 87: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

([d] (n – 1 – m – q))

Partial linear regression y = x + covariables z

Partial R2y~X|Z = SS (fitted values of y~X|Z)

(SS (fitted values) + SS (residuals))

F statistic used to test significance of partial R2 takes into account number of covariables q; in ordinary multiple regression q = 0

R2y~X|Z = [a] ([a] +

[d])

F = (R2partial m)

((1 – R2partial) (n – 1 – m –

q))

F = ([a] m)

m = number of predictor variables, n = number of objects

Also compute y~Z|X as well as y~X|Z

Page 88: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Unadjusted and adjusted coefficients of determination R2

R2 = Regression SS = 1 – Residual SSTotal SS Total SS

(A)

R2adj = 1 – Residual mean square = 1 – (1 – R2) Total

df Total mean square Residual df(B)

R2adj takes into account numbers of degrees of

freedom associated with numerator and denominator in equation A

In ordinary multiple regression total df = (n – 1) and residuals df = (n – m - 1) where n = number of observations and m = number of explanatory variables

Page 89: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Partial canonical analysis

In RDA, canonical R2 is ratio of the sum of each response variable’s regression (or fitted values) SS to the sum of all response variables’ total SS.

Significance of F tested by permutation with (m x p) and p (n – m – 1) degrees of freedom where p = number of response variables

Can calculate R2adjusted (equation B) to produce unbiased

estimates of the real contributions of variables in X to the explanation of the response matrix Y.

Page 90: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

XYX Rmn

nR |

2adj)|Y(

2 11

11

)1(

11

11 |

2XYR

nm

where m = number of predictor variables in matrix X

Page 91: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Variation partitioning of Y with respect to X and W, two sets of explanatory variables

Canonical analysis

R2 R2adj * and fractions Can test

significance

Y~X R2 of Y~X [a+b]=Ra2 of Y~X Yes

Y~W R2 of Y~W [b+c]=Ra2 of Y~W Yes

Y~(X,W)

R2 of Y~(X,W)

[a+b+c]=Ra2 Yes

[a]=[a+b]-[b] Yes

[b]=[a+b]+[b+c]-[a+b-c]

No

[c]=[b+c]-[b] Yes

Residuals = [d]=1-[a+b+c]

No

* Calculated as 1-(1-R2)(Total df/Residual df) = 1 – Residual mean square/Total mean square

varpart in R vegan

Page 92: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Significance of fractions a and c tested by ‘permutation of the residuals’ option in CANOCO

Fraction [a]

Fraction [c]

where m = number of explanatory variables in predictor set X q = number of explanatory variables in predictor set W

Significance of fractions [b] (covariance) and [d] (residuals) cannot be assessed statistically

)1(][

][qmnd

maF

))1(]([

][qmnd

mcF

Page 93: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Oribatid mite data: 35 species, 70 samples, 5 habitat variables, and spatial descriptors

Fractions Unadjusted

Adjusted

Difference

Habitat [a] 0.136 * 0.091 * ns

Shared habitat and space [b]

0.391 0.345 -

Spatial [c] 0.232 * 0.101 * ns

Residual [d] 0.241 0.462 -

varpart in R vegan

Page 94: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Representation of components

Fraction [a] dot maps of ‘bubble’ lots

Fraction [b] interpolation maps

Fraction [c] interpolation maps (e.g. kriging)

Fraction [d] maps either dot or interpolated

Can be expanded to 3 or 4 tables of explanatory variables. Gets very complex!

Applicable to CCA but algebra even more complicated.

Page 95: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

ENVIRONMENTAL CONSTRAINTS AND CURVATURE IN ORDINATIONS

• Curvature often cured because axes are forced to be linear combination of environmental variables (constraints).

• High number of constraints = no constraint.

• Absolute limit: number of constraints = min (M, N) - 1, but release from the constraints can begin much earlier.

• Reduce environmental variables so that only the important remain: heuristic value better than statistical criteria.

• Reduces multicollinearity as well.

J. Oksanen (2002)

Page 96: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Classification of gradient analysis techniques by type of problem, response model and method of

estimation Method of estimation Type of problem

Linear Least Squares

Maximum Likelihood

Unimodal Weighted Averaging

Regression Multiple regression

Gaussian regression

Weighted averaging of site scores

Calibration

Linear calibration 'inverse regression'

Gaussian calibration

Weighted averaging of species scores (WA)

OrdinationPrincipal components analysis (PCA)

Gaussian ordination

Correspondence analysis (CA); detrended correspondence analysis (DCA)

Constrained ordination1

Redundancy analysis (RDA)4

Gaussian canonical ordination

Canonical correspondence analysis (CCA); detrended CCA

Partial ordination2

Partial component analysis

Partial Gaussian ordination

Partial correspondence analysis; partial DCA

Partial constrained ordination3

Partial redundancy analysis

Partial Gaussian canonical ordination

Partial canonical correspondence analysis; partial detrended CCA

1 = constrained multivariate regression2 = ordination after regression on covariables3 = constrained ordination after regression on covariables = constrained partial multivariate regression4 = 'reduced rank regression' = “PCA of y with respect to x”

Page 97: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1) Choice can greatly influence the results. Fewer the environmental variables, the more constrained the ordination is.

2) Possible to have one only – can evaluate its explanatory power.

3) Can always remove superfluous variables if they are confusing or difficult to interpret. Can often remove large number without any marked effect. Remember post-hoc removal of variables is not valid in a hypothesis-testing analysis.

4) Linear combinations – environmental variables cannot be linear combinations of other variables. If a variable is a linear combination of other variables, singular matrix results, leads to analogous process of dividing by zero.

  Examples:  -    total cations, Ca, Mg, Na, K, etc. Delete total cations

- % clay, % silt, % sand-    dummy variables (granite or limestone or basalt)

5) Transformation of environmental data – how do we scale environmental variables in such a way that vegetation ‘perceives’ the environment? Need educated guesses.

  Log transformation usually sensible – 1 unit difference in N or P is probably more important at low concentrations than at high concentrations.

As statistical significance in CANOCO is assessed by randomisation tests, no need to transform data to fulfil statistical assumptions.

  Transformations useful to dampen influence of outliers.

  Environmental data automatically standardised in RDA and CCA.

ENVIRONMENTAL VARIABLES IN CONSTRAINED ORDINATIONS

Page 98: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

6) Dummy variables – factors such as bedrock type, land-use history, management, etc, usually described by categorical or class variables. 1 if belongs to class, 0 if it does not. For every categorical variable with K categories, only need K – 1 dummy variables e.g.

Granite Limestone Basalt GabbroPlot 1 1 0 0 0

2 0 1 0 03 1 0 0 04 0 1 0 05 0 0 1 06 0 0 0 1

7) Circular data – some variables are circular (e.g. aspect) and large values are very close to small values. Aspect – transform to trigonometric functions.

  northness = cosine (aspect)eastness = sine (aspect)

  Northness will be near 1 if aspect is generally northward and –1 if southward. Close to 0 if west or east.Alternatively for aspect

southness = 180 - |aspect - 180| (S = 180, N =0)westness = |180 - |aspect - 270|| (W = 180, E = 0)

  Day of year – usually not a problem unless dealing with sampling over whole year. Can create ‘winterness’ and ‘springness’ variables as for aspect.

Page 99: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

8) Vegetation-derived variables – maximum height, total biomass, total cover, light penetration, % open ground can all be ‘environmental’ variables. Such variables SHOULD NOT BE USED in hypothesis testing, as danger of circular reasoning.

9) Interaction terms – e.g. elevation * precipitation. Easy to implement, difficult to interpret. If elevation and precipitation interact to influence species composition, easy to make term but the ecological meaning of where in environmental space the stands or species are is unclear. Huge number of possibilities N variables ½ N (N – 1) possible interactions. 5 variables 10 interactions.

  AVOID quadratic terms [e.g. pH * pH (pH2) (cf. multiple regression and polynomial terms)]. Can create an ARCH effect or warpage of ordination space.

  Try to avoid interaction terms except in clearly defined hypothesis-testing studies where the null hypothesis is that ‘variables c and d do not interact together to influence the species composition’.

  For interaction to be significant, eigenvalue 1 of the analysis with product term should be considerably greater than 1 when there is no product term and the t-value associated with the product term should be greater than 2 in absolute value.

  Avoid product variables to avoid ‘data dredging’.

Page 100: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1) The fewer the environmental variables, stronger the constraints are.

2) With q (number of samples – 1) environmental variables, the analysis is unconstrained.

3) Small numbers of environmental variables may remove any arch effect.

4) Want to try to find MINIMAL ADEQUATE SET of environmental variables that explain the species data about as well as the FULL SET.

5) Automatic selection (e.g. forward selection) can be dangerous:

a) Several sets can be almost equally good. Automatic selection finds one but may not be the best.

  b) Selection order may change the result and important variables may not be selected.

  c) Small changes in the data can change the selected variables. Difficult to draw reliable conclusions about relative importance of variables. Omission of a variable does not mean it is not ecologically important.

6) If you are lucky, there may only be one minimal adequate model but do not assume that there is only one such model.

7) How do we go about finding a minimal adequate model or set of environmental variables?

SELECTING ENVIRONMENTAL VARIABLES IN CON-STRAINED ORDINATION ANALYSIS

(e.g. CCA, RDA)

Page 101: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1) Start with all explanatory variables in the analysis, FULL MODEL. Consider sum of canonical eigenvalues (amount of explained variance), eigenvalues and species-environmental correlations.

2) Try to simplify full model by deleting variables but not reducing the model performance. May be impossible to remove variables without some loss of information.

Deletion criteria:

a) Deletion on external criteria – variables not relevant.

  b) Deletion on correlation structure – variables may be highly correlated (e.g. pH, Ca, Mg, CEC). Any one could be used as a proxy for the others. Best to choose the one that is likely to be the most direct cause of vegetation response.

  Can do a PCA of environmental variables to explore correlation structure of variables.

  c) Interpretability – variables with short arrows.

  d) Non-significant – delete any that are non-significant (p > 0.05) in analysis with one environmental variable only in CCA or RDA.

  e) Ecological importance

  f) Stepwise analysis – forward selection, add one variable at a time until no other variables ‘significantly’ explain residual variation in species data.

3) Final selection must be based on ecological and statistical criteria. The purpose of numerical data analysis is 'INSIGHT', not complex statistics!

Page 102: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

WHAT IS DONE?

 1) CCA (or RDA) is performed on each variable separately and marginal effects are listed in order.

 2) Select the variable with largest marginal effect (= eigenvalue) and test its statistical significance by unrestricted Monte Carlo permutation tests and 999 permutations. Accept if p < 0.05.

 3) This variable is now used as a covariable and the variables are now listed in order of their conditional effects (i.e. variance explained when allowing for effects of variable one selected). Evaluate its statistical significance and apply Bonferroni-type correction for simultaneous multiple tests,

namely 1 = /t where t =number of tests.

For = 0.05 t = 1, 1 = 0.05

t = 2, 1 = 0.025

t = 3. 1 = 0.0166

t = 4, 1 = 0.0125

With 999 permutations (i.e. p of 0.001 can be evaluated), becomes very slow. Required if you are to properly evaluate the Monte Carlo permutation probabilities.

These tests do not control for overall Type I error. In practical terms this means that too many variables will be judged ‘significant’.

Alternatively, can stop when the increase in fit when including a variable is less than 1.0% (EXTRA FIT).

Page 103: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

MINIMAL ADEQUATE MODEL IN CCA

J. Oksanen (2002)

13 environmental variables 3 environmental variables

Page 104: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

OTHER PROBLEMS

 1) Selection of categorical variables coded as dummy variables. Suppose there are 5 bedrock types but only ‘granite’ is selected by forward selection. Should you select the other variables as well? If you consider the different bedrock types to be independent, the answer is NO. If you consider there to be one categorical variable (bedrock) with five states, the answer is YES.

 2) Last two remaining variables within a category will always have identical fit because they contain identical information (if it is not z then it must be y). Does not matter which you choose. Select the commoner category.

 3) No guarantee that forward selection (or any other stepwise procedure) will result in ‘best’ set of variables. Only way is perform constrained ordinations for every conceivable combination of variables. Currently impossible with current technology.

 4) Accept that minimal adequate model is one possible solution only.

 5) For exploratory, descriptive studies, do not be reluctant to use a priori ecological knowledge.

Page 105: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1 qncj / varianceresidual VIFVar

211

jRVIF

Variance of estimated regression (= canonical) coefficients (cj) are proportional to their VIF.

number of predictors

number of samples

VIF is related to the (partial) multiple correlation coefficient Rj between variable j and the other environmental variables.

If VIF > 20, that variable is almost perfectly correlated with other variables and has no unique contribution to the regression equation. Regression (= canonical) coefficient unstable, not worth considering.

Useful for finding minimal set of variables.

Not unique, e.g. pH and Ca (and other variables).

  VIF VIF

pH 123.8 6.4

Ca 2.6 –

Mg 8.1 29.3

VARIANCE INFLATION FACTORS (VIF)

Page 106: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

'AIC' FOR MODEL SELECTION IN CCA AND RDA

Jari Oksanen (2004) VEGAN 1.7-6 R deviance.ccadeviance.rda

Find statistics in CCA and RDA that resemble deviance and assess an AIC-like statistic as in regression model building.

Deviance of CCA = chi-square of the residual data matrix after fitting the constraints.

Deviance of RDA = average residual variance per species.

Can be used to help select between possible models in CCA or RDA.

AIC - index of fit that takes account of the parsimony of the model by penalising for the number of parameters.

smaller the values, better the fit.

here equals the residual deviance + 2x number of regression (canonical) coefficients fitted.

Page 107: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

STAGES IN 'AIC' MODEL SELECTION IN CCA AND RDA

1. Define a null model into which variables are sequentially added in order of their statistical importance. Null model is unconstrained PCA or CA.

2. Now do stepping by either a forward selection of a backward elimination of the predictor variables. Need to define an upper and lower scope for the stepping to occur within.

Forward selection – lower scope = null model (no predictors)

- upper scope = full model (all predictors included)

Backward elimination – lower scope = full model

- upper scope = null model

Page 108: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

3. At each step, the effect of adding or deleting a variable is evaluated in terms of the AIC criterion. Low AIC values are to be preferred.

4. If a lower AIC can be achieved by adding or deleting a variable at a stage, then this predictor variable is added/deleted.

5. Useful to use both backward elimination and forward selection at each step. Start with full model, eliminate first variable, then the next, try to add either variable back into the model, and so on.

6. After the final model is derived (lowest AIC), can test this model to see if the effects of the constraining predictor variables are statistically significant. Use Monte Carlo permutation test under the reduced model.

Page 109: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

'AIC' MODEL SELECTION

Godínez-Domínguez & Freire 2003 Marine Ecology Progress Series 253, 17-24

1. Definition of set of a priori models

2. Statistical fit of models to data (e.g. CCA)

3. Selection of 'best' model – Akaike Information Criterion (AIC)

AIC =

where k = number of free parameters in the model

= model maximum likelihood

Rather than a statistical test of one null hypothesis, AIC provides a methodology for selecting an a priori set of alternative hypotheses.

kL 2ˆlog2

L

Page 110: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

From estimated residual sum of squares (RSS) in CCA

where h = number of predictor variables in model,

where log = loge

n = sample size, and

= RSS/n

h

ii

1

)traceseigenvalue allof sum(

)ˆlog(21)ˆ(log 2 nL

2

Page 111: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

To avoid bias in AIC due to links between sample size and number of parameters, corrected AIC is

1)1(2

AICAICc KnKK

As in GLM, interested in differences in AIC between models

i = AICci – min AICc

Page 112: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Data – 5 cruises (DEM-1 – DEM-5)

- 8 models

Godinez-Dominguez & Freire, 2003

Page 113: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA permutation tests 40 models

27 p < 0.05 (global test) 20 p < 0.05 (first CCA axis)

AICc approach to find most parsimonious model

Page 114: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Can determine not only the 'best' model but rank the different underlying hypotheses according to AIC criteria of parsimony. Spatial models 2 and 5, namely depth stratification but no difference between sheltered and exposed stations, are most appropriate for these data.

Page 115: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Standard linear technique for relating two sets of variables.

Similar to RDA – assumes linear response model.

Selects canonical coefficients for species and environmental variables to MAXIMISE species – environmental correlation canonical correlation

1

RDA species scores are simply weighted sums of site scores kikybx1

CANCOR species scores are b parameters estimated by multiple regression of site scores on species variables

number of species << number of sites

In fact number of species + number of environmental variables must be smaller than number of sites

i.e. m must be < n – q

CANCOR biplot

Differs from RDA also in error component.

van der Meer (1991) J. Expl. Mar. Biol. Ecol. 148, 105–120

CANONICAL CORRELATION ANALYSIS – CANCOR

REF

REF

REF

Page 116: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

RDA uncorrelated independent errors with equal variances (least-squares technique).

CANCOR correlated normal errors (maximum likelihood technique). Is in realty, a GLM.

residual correlations between errors are additional parameters in CANCOR. Many species. Cannot estimate them reliably from data from few sites.

Generalised variance minimised in CANCOR = product of eigenvalues of matrix ‘volume’ of hyperellipsoid.

Total variance minimised in RDA = sum of diagonal elements = sum of eigenvalues.

Linear transformation Non-linear transformation Linear transformation

of species data of species data of environmental data

CA, PCA affects results no effect

CCA, RDAaffects results no effect

CANCOR no effect no effect

REF

REF

REF

REF

Page 117: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

McKecknie et al. (1975) Genetics 81, 571–594

Genetic variability in 21 colonies of butterfly Euphydryas editha in relation to 11 environmental variables in California and Oregon.

Gene frequencies of phosphoglucose-isomerase (Pgi) determined by electrophoresis.

What are relations between Pgi frequencies and environment?

Canonical correlation analysis7701 . 5602 . 1703 . Not

significantCan corr 0.88 0.75 0.41 Small sample size

Leaving aside significance, first CC contrast between percentage of 0.4 mobility genes and other genes and precipitation and low max temperatures.

 i.e. 0.4 mobility gene lacking in colonies from areas of high precipitation and low max temperatures.

 Frequency of 1.0 mobility genes high in colonies at high altitudes and high precipitation and low temperatures.

COMPARISON WITH CANONICAL CORRELATION ANALYSIS

Page 118: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Environmental variables and phosphoglucose-isomerase Pgi gene frequencies for colonies of the butterfly Euphdryas editha in California and Oregon.

Data source: McKechnie et al. (1975). Environmental variables have been rounded to integers for simplicity. The original data were for 21 colonies. For the present example, five colonies with small samples for the estimation of gene frequencies have been excluded so as to make all of the estimates about equally reliable.

Colonies of Euphydryas editha in California and Oregon

Page 119: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Canonical correspondence analysis

1 = 0.16 2 = 0.07 Pgi

Pgi 0.4 0.6 0.8 opposite to latitude, i.e. southerly

1 2 3

Pgi 1.0 high max temp, high min temp

4

Pgi 1.16 high pptn

5

Pgm 1.4 ?

6

Page 120: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Ordination diagram based on canonical correspondence analysis of allele frequencies in the butterfly Euphdryas editha in the 21 colonies () with respect to 11 environmental variables (data from McKechnie et al. 1975). The alleles () from the three loci Pgm,Pgi and Hk, are numbered in order of increasing mobility class. Infrequent alleles are displayed by a dot. The environmental variables (arrows) are: altitude; latitude; annual precipitation (Annuprec), annual maximum and minimum temperature (Max-temp, Min-temp); highest and lowest temperature in posi-diapause (H-Posdia,L-Posdia), the adult phase (H-Adul, L-Adul) and the pre-diapause (H-predia, L-predia). The abbreviations of colony names follow Manly (1985). ( = 0 0.16, = 0.07, scaling = ½).

Page 121: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

• Canonical Correlation Analysis (CANCOR)

• Continuous environmental variables and vegetation

• Can be computed only if number of sites > number of species + number of env. vars +1

• Redundancy analysis (RDA)

• As CANCOR but assumes that error variance constant for all plant species

• Technically possible to estimate in vegetation data, unlike CANCOR

• Canonical Variates Analysis (CVA) or Discriminant Analysis – see lecture 9

• Predict class membership using continuous variables

• For instance, pre-determined vegetation type using vegetation data

• LC score shows the centroids, Weighted sum scores show the dispersion and overlap

CONSTRAINED LINEAR ORDINATION (PCA FRAMEWORK)

Page 122: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

DISTPCOA Pierre Legendre & Marti Anderson (1999) Ecol. Monogr. 69, 1-24. RDA but with any distance coefficient RDA - Euclidean distance Absolute abundances Quantity dominated CCA - chi-square metric Relative abundances Shape/composition

dominated

Does it matter?Total biomass or cover and species compositionVarying e.g. ridge snow bed gradient

Other dissimilaritiesBray & Curtis non-Euclidean semi-metricJaccard +/- non-Euclidean semi-metricGower mixed data non-Euclidean semi-metric

Basic ideaReduce sample x sample DC matrix (any DC) to principal co-ordinates (principal co-ordinates analysis, classical scaling, metric scaling – Torgerson, Gower) but with correction for negative eigenvalues to preserve distances.

PCoA – embeds the Euclidean part of DC matrix, rest are negative eigenvalues for which no real axes exist. These correspond to variation in distance matrix, which cannot be represented in Euclidean space. If only use positive eigenvalues, RDA gives a biased estimate of the fraction of variance in original data.

DISTANCE-BASED REDUNDANCY ANALYSIS

Page 123: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Correction for negative eigenvalues

 where c1 is equal to absolute value of largest negative eigenvalue of matrix used in PCoA 1

 D

1

 

Use all principal co-ordinate sample scores (n - 1 or m, whichever is less) as RESPONSE (species) data in RDA. Use dummy variables for experimental design as predictors in X in RDA.

Now under framework of RDA and battery of permutation tests, can analyse structured experiments but WHOLE ASSEMBLAGE (cf. MANOVA but where m >N).

Now can test null hypothesis (as in MANOVA) that assemblages from different treatments are no more different than would be expected due to random chance at a given level of probability. BUT unlike non-parametric tests (ANOSIM, Mantel tests), can test for interactions between factors in multivariate data but using any DC (not only Euclidean as in ANOVA / MANOVA). Using permutation tests means we do not have to worry about multivariate normality or homogeneity of covariance matrices within groups, or abundance of zero values as in ecological data.

DISTPCoA www.umontreal.ca

jiforcdd ijij 501

22 .)('

2

21

ijij da

..aaaa jiijij

Page 124: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Raw dataRaw data(replicates x species)(replicates x species)

Distance matrixDistance matrix(Bray-Curtis, etc)(Bray-Curtis, etc)

Principal coordinate analysis(PCoA)

Correction for negative eigenvalues

Matrix YMatrix Y(replicates x (replicates x

principal coordinates)principal coordinates)

Matrix XMatrix X(dummy variables (dummy variables

for the factor)for the factor)

Test of one factorin a single-factor model

Redundancy analysis (RDA)F# statistic

Partial redundancy analysis (partial RDA) F# statistic

Matrix YMatrix Y(replicates x (replicates x

principal coordinates)principal coordinates)

Matrix XMatrix X(dummy (dummy variables variables

for the for the interaction)interaction)

Matrix XMatrix XCC

(dummy (dummy variables variables

for the main for the main effects)effects)

Test of F# by permutation under the full model

Test of F# by permutation

Test of interaction term in multifactorial model

Page 125: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Correspondence between the various components of the univariate F-statistics and the multivariate RDA statistics in the one-factor case.

Univariate ANOVA Multivariate RDA statistic

Total sum of squares sum of all eigenvalues of Y

Treatment sum of squares = SSTr trace = sum of all canonical eigenvalues of Y on X

Treatment degrees of freedom = dfTr q

Residual sum of squares = SSRes rss = sum of all eigenvalues – trace

Residual degrees of freedom = dfRes nT – q – 1

Treatment mean square = SStr/dfTr = MSTr trace/q

Residual mean square SSRes/dfRes = MSRes rss/(nT – q - 1)

Res

Tr

MSMSF

)1/(rss/trace#

qnq

FT

Tr = treatment, q = number of linearly independent dummy variables, nT = number of replicates

REF

REF

REF

REF

Page 126: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

  PERMUTATION TESTS in RDA (also CCA) in CANOCO. What is shuffled?

 

Y = Z B + X C + E 

B & C - unknown but fixed regression coefficients

responsescovariables

random error

predictors

(Note Z = covariables and X = predictors here)

Placing non-metric distances into Euclidean space first, then use ANOVA/MANOVA within RDA with permutation tests-.

Builds on ANOVA as form of multiple regression with orthogonal dummy variables as predictors.

MATCH between ANOVA statistics and RDA statistics.

Page 127: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

To test H0 C = 0 (i.e. the effect of X on Y)

 1. Permute rows of Y

2. Permute rows of X (env. data) CANOCO 2

3. Permute residuals Er of regression Y on Z (covariables)

REDUCED MODEL OR NULL MODEL CANOCO 3 & 4

4.  Permute residuals Ef of regression Y on Z and X (covariables and predictors)

FULL MODEL CANOCO 3 & 4 1 & 2 DESIGN-BASED PERMUTATIONS

1 Wrong type I error, low power

2 OK but no basis for testing interaction effects

3 & 4 MODEL-BASED PERMUTATIONS

3. Permute residuals of Y on Z (covariables) Default in CANOCO 3 & 4

Reduced model – maintains type I error in small data sets. Without covariables, gives exact Monte Carlo significance level. Retains structure in X and Z.

4. Permute residuals of Y on X and Z. Full model. Gives lower type II error, but only slightly so.

 (If no covariables Y = XC + E, does not matter if samples in Y or X are permuted. CANOCO permutes X)

Page 128: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

In DISTPCOA, do RDA with Y as principal co-ordinates scores, X defines dummy variables to code for interaction terms, and Z defines dummy variables for main effects (covariables) if interested in interactions. Can determine components of variation attributable to individual factors and interaction terms as in a linear model for multivariate data BUT using any DC that integrates both quantities and composition.

Can test the significance of individual terms and interaction terms for any complex multi-factorial experimental design. Cannot be applied to unbalanced data. If unbalanced because of missing or lost values, use missing data replacements (if other replicates).

Page 129: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Distance-based RDA offers special advantages to ecological researchers not shared by any other single multivariate method. These are:

1 The researcher has the flexibility to choose an appropriate dissimilarity measure, including those with semi-metric qualities, such as the Bray-Curtis measure

2 PCoA puts the information on dissimilarities among the replicates into a Euclidean framework which can then be assessed using linear models

3 A correction for negative eigenvalues in the PCoA, if needed, can be done such that probabilities obtained by a permutation test using the RDA F# -statistic are unaffected (correction method 1)

4 By using the multiple regression approach to analysis of variance, with dummy variables coding for the experimental design, RDA can be used to determine the components of variation attributable to individual factors and interaction terms in a linear model for multivariate data

5 Multivariate test statistics for any term on a linear model can be calculated, with regard to analogous univariate expected mean squares

6 Statistical tests of multivariate hypothesis using RDA statistics are based on permutations, meaning that there is no assumption of multinormality of response variables in the analysis. Also, there are no restrictions to the number of variables that can be included in RDA

7 Permutations of residuals using the method of ter Braak (1992) allows the permutation test to be structured precisely to the hypothesis and the full linear model of the design under consideration

8 The significance of multivariate interaction terms can be tested

Shares characteristics with:MAN- RDA ANO- MAN-OVA SIM TEL

* *

* *

* *

* * *

*

* *

Page 130: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

DISTANCE-BASED MULTIVARIATE ANALYSIS FOR A LINEAR MODEL

McArdle, B.H. & Anderson, M.J. (2001) Ecology 82; 290-297

DISTLM, DISTLMforward - www.stat.auckland.ac.nz/~mja

DISTLM - multivariate multiple regression of any symmetric distance matrix with or without forward selection of individual predictors or sets of predictors and associated permutation tests.

Y response variables (n x m)

X predictor variables (n x q) (1/0 or continuous variables)

Performs a non-parametric test of the multivariate null hypothesis of no relationship between Y and X on the basis of any distance measure of choice, using permutations of the observations. X may contain the codes of an ANOVA model (design matrix) or it may contain one or more predictor variables (e.g., environmental variable) of interest.

Like Legendre and Anderson's (1999) distance-based redundancy analysis but with no correction for negative eigenvalues. Shown theoretically that partitioning the variability in X according to a design matrix or model can be achieved directly from the distance matrix itself, even if the distance measure is semi-metric (e.g., Bray-Curtis distance).

Page 131: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CANONICAL ANALYSIS OF PRINCIPAL CO-ORDINATES

Anderson, M.J. & Willis, T.J. (2003) Ecology 84, 511-525

CAP – www.stat.auckland.ac.nz/~mja

CAP - canonical analysis of principal co-ordinates based on any symmetric distance matrix including permutation tests.

Y response variables (n x m)

X predictor variables (n x q) (1/0 or continuous variables)

Performs canonical analysis of effects of X on Y on the basis of any distance measure of choice and uses permutations of the observations to assess statistical significance.

If X contains 1/0 coding of an ANOVA model (design matrix), result is a generalised discriminant analysis. If X contains one or more quantitative predictor variables, result is a generalised canonical correlation analysis.

Page 132: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Output

1. Eigenvalues and eigenvectors from the PCOORD. Can use latter to plot an indirect ordination of data.

2. Canonical correlations and squared canonical correlations.

3. Canonical axis scores.

4. Correlations of original variables (Y) with canonical axes.

5. Diagnostics to help determine appropriate value for Qt, number of eigenvectors. Select the lowest misclassification error (in the case of groups) or the minimum residual sum of squares (in the case of quantitative variables in X). Also Qt must not exceed m or n and is chosen so that the proportion of variance explained by the first Qt axes is more that 60% and less than 100% of the total variation in the original dissimilarity matrix.

6. In the case of groups, table of results for 'leave-one-out' classification of individual observations to groups.

7. Results of permutation test to test statistical significance of Qt axis model (trace and first eigenvalue).

8. Scores to construct constrained ordination diagram to compare with unconstrained ordination diagram.Very good at highlighting and testing for group differences (e.g. sampling times) as CAP finds axes that maximise separation between groups.

With quantitative predictors, CAP finds axes that maximise correlation with predictor variables.

'AIC' for model selection deviance-capscale Jari Oksanen VEGAN 1.7-6 R

Page 133: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Extensions by Jari Oksanen (capscale in Vegan R)

1. Axes are weighted by their corresponding eigenvalues so that the ordination distances are best approximations of the original dissimilarities.

2. Uses all axes with positive eigenvalues. Guarantees that the results are the best approximation of the original dissimilarities.

3. Adds species scores as weighted sums of the (residual) species data.

4. Negative eigenvalues are harmless and can be ignored. Often most sensible to use dissimilarity coefficients that do not have negative eigenvalues. Square-root transformation of the species data prior to calculating dissimilarities can drastically reduce the number of negative eigenvalues.Note that CAP with Euclidean distance is identical to RDA in

sample, species, and biplot scores (except for possible reversal of sign).

Page 134: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Possible uses of canonical analysis of principal co-ordinates

1. As in CCA or RDA with biological and environmental data.

2. Fit models to data with rare or unusual samples or species that may upset CCA.

3. Analyse many environmental variables in relation to external (e.g. geographical, geological, topographical) constraints with Monte Carlo permutation tests. In other words, do a multivariate linear regression but not have to worry about the data meeting the assumptions of least-squares estimation and models.

Examples:

Willis & Anderson 2003 Marine Ecology Progress Series 257: 209-221 (cryptic reef fish assemblages)

Edgar et al. 2003 Journal of Biogeography31: 1107-1124 (shallow reef fish and invertebrate assemblages)

Page 135: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

SUMMARY OF CONSTRAINED ORDINATION METHODS

Methods of constrained ordination relating response variables, Y (species abundance variables) with predictor variables, X (such as quantitative environmental variables or qualitative variables that identify factors or groups as in ANOVA).Name of methods (acronyms, synonyms)

Distance measure preserved

Relationship of ordination axes with original variables

Takes into account correlation structure

Redundancy Analysis (RDA) Euclidean distance

Linear with X, linear with fitted values, Y = X(X'X)-1 X'Y

... among variables in X, but not among variables in Y

Canonical Correspondence Analysis (CCA)

Chi-square distance

Linear with X, approx unimodal with Y, linear with fitted values, Y*

... among variables in X, but not among variables in Y

Canonical Correlation Analysis (CCorA, COR)

Mahalanobis distance

Linear with X, linear with Y ... among variables in X, and among variables in Y

Canonical Discriminant Analysis (CDA; Canonical Variate Analysis CVA; Discriminant Function Analysis, DFA)

Mahalanobis distance

Linear with X, linear with Y ... among variables in X, and among variables in Y

Canonical Analysis of Principal Coordinates (CAP; Generalized Discriminant Analysis)

Any chosen distance or dissimilarity

Linear with X, linear with Qt; unknown with Y (depends on distance measure)

... among variables in X, and among principal coordinates Qt

^

^

Page 136: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CRITERION FOR DRAWING ORDINATION AXES

• Finds axis of maximum correlation between Y and some linear combination of variables in X (i.e., multivariate regression of Y on X, followed by PCA on fitted values, Y).

• Same as RDA, but Y are transformed to Y* and weights (square roots of row sums) are used in multiple regression.

• Finds linear combination of variables in Y and X that are maximally correlated with one another.

• Finds axis that maximises differences among group locations. Same as CCorA when X contains group identifiers. Equivalent analysis is regression of X on Y, provided X contains orthogonal contrast vectors.

• Finds linear combination of axes in Qt and in X that are maximally correlated, or (if X contains group identifiers) finds axis in PCO space that maximises differences among group locations.

^

RDA

CCA

CCorA

CVA

CAP

Page 137: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRINCIPAL RESPONSE CURVES (PRC)

van der Brink, P. & ter Braak, C.J.F. (1999) Environmental Toxicology & Chemistry 18, 138-148

van der Brink, P. & ter Braak, C.J.F. (1998) Aquatic Ecology 32, 163-178

PRC is a means of analysing repeated measurement designs and of testing and displaying optimal treatment effects that change across time.

Based on RDA (= reduced rank regression) that is adjusted for changes across time in the control treatment. Allows focus on time-dependent treatment effects. Plot resulting principal component against time in PRC diagram.

Developed in ecotoxicology; also used in repeated measures in experimental ecology and in descriptive ecology where spatial replication is substituted for temporal replication.

Highlights differences in measurement end-points betweeen treatments and the reference control.

Page 138: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRC MODEL

Yd(i)tk = Yotk + bk cdt + d(i)tk where

Yd(i)tk = abundance counts of taxon k at time t in replicate i of treatment d

Yotk = mean abundance of taxon k in controls (o) at time t

cdt = principal response of treatment d at time t (PRC)

bk = weight of species k with respect to cdt

d(i)tk = error term with mean of zero and variance 2k

Modelling the abundance of particular species as a sum of three terms, mean abundance in control, a treatment effect, and an error term.

Data input - species data (often log transformed) for different treatments at different times

- predictor variables of dummy variables (1/0) to indicate all combinations of treatment and sampling time ('indicator variables')

- covariables of dummy variables to indicate sampling time

Do partial RDA with responses, predictors, and covariables and delete the predictor variables that represent the control. This ensures that the treatment effects are expressed as deviations from the control.

Page 139: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRC MODEL (continued)

Simple example - three treatments: C = control, L = low dosage (not rep-licated), H = high dosage sampled at four times (W0, W1, W2, W3), six species.

Page 140: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRC MODEL (continued)

* ** * * * * * deleted in the RDA

Page 141: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRC PLOTS

One curve for each treatment expressed as deviation from the control. Species weights (bk) allow species interpretation. Higher the weight, more the actual species response is likely to follow the PRC pattern, because the response pattern = bk cdt. Taxa with high negative weight are inferred to show opposite pattern. Taxa with near zero weight show no response.

Significance of PRC can be tested by Monte Carlo permutation of the whole time series within each treatment.

Can use the second RDA axis to generate a second PRC diagram to rank 2 model.

Page 142: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRC PLOTS (continued)

See maximum deviation from control after 4 weeks, maximum effect larger for 150 g/l treatment than for 50 g/l treatment.

Chlamydomonas has high negative weight and this had highest abundances in high doses after treatment began.

Principal response curves resulting from the analysis of the example data set, indicating the effects of the herbicide linuron on the phyto-plankton community. Of all variance, 47% could be attributed to sampling date, and is on the horizontal axis. Of all variance, 30% could be attributed to treatment. Of the variance explained by treatment, 23% is displayed on the vertical axis. The lines represent the course of the treatment levels in time. The species weight (bk) can be interpreted as the affinity of the taxon with the principal response curves.

Page 143: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRINCIPAL RESPONSE CURVES AND ANALYSIS OF MONITORING DATA

PRC usually used with experimental data. Can be used with (bio)monitoring data.

Samples at several dates at several sites of a river, some upstream of a sewage treatment plant (STP) (300 m, 100 m), in the STP outlet, and some downstream (100 m, 1 km). 795 samples, 5 sites, 1994-2002.

PRC using sampling month as covariable, product of sampling month and site as explanatory variables. Used STP outlet as the reference site.

Van der Brink, P. et al. (2003) Austr. J. Ecotoxicology 9: 141-156

Page 144: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Principal Response Curves indicating the effects of the outlet of a sewage treatment plant on some monthly averages of physico-chemical characteristics of a river. Of all variance, 24% could be attributed to between-month variation; this is displayed on the horizontal axis. 57% of all variance could be allocated to between-site differences, the remaining 19% to within-month variation. Of the between-site variation, 58% is displayed on the vertical axis. The lines represent the course of the sites in time with respect to the outlet. The weight of the physico-chemical variable (bk) can be interpreted as the affinity of the variables with the Principal Response Curves (cdt).

Page 145: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

See biggest differences for the two upstream sites, with lower NOx, total N, conductivity, salinity, total P, and temperature and higher values of turbidity and faecal coliforms. STP outlet leads to increases in N, P, temperature, etc. Downstream values decrease but are not as low as upstream sites. STP successfully reduces faecal coliforms as their values are higher in the upstream sites due to pollution.

Page 146: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

PRINCIPAL RESPONSE CURVES – A SUMMARY

Filters out mean abundance patterns across time in the control. Focuses on deviation between treatment and control. PRC displays major patterns in those deviations and provides good summary of response curves of individual taxa.

PRC helps to highlight 'signal' from 'noise' in ecological data in replicated experimental studies.

Simplified RDA - simplified by representing the time trajectory for the controls as a horizontal line and taking the control as the reference to which other treatments are compared.

PRC gives simple representation of how treatment effects develop over time at the assemblage level.

Page 147: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

ECOLOGICAL APPLICATIONS OF PRINCIPAL RESPONSE CURVES

Vandvik, V. 2004. J. Ecology 92: 86-96 – revegetation of gaps in replicated successional series 0, 10, and 40 years after abandonment.

Vandvik, V et al. 2005. J. Applied Ecology 42: 139-149 – post-fire succession in heathlands.

Heegaard, E. & Vandvik, V. 2004. Oecologia 139: 459-466 – space-for-time substitution to analyse compositional differences in local ridge-snowbed gradients in 100 m altitudinal bands from 1140 to 1550 m.

Page 148: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

POLYNOMIAL RDA AND POLYNOMIAL CCA

POLYRDACCA

www.fas.umontreal.ca/biol/legendre

Makarenkov & Legendre (2002) Ecology 83, 1146-1161

Instead of using multiple linear regression to model the relationship between each response variable y in Y and the explanatory variables X, use polynomial regression

Y = b1x1 + b2x12 + b3x2 + b4x2

2 ... ... bpxp2

Aim is to develop a multivariate reduced-rank regression model that produces a noticeable increase in the amount of variation in Y accounted for by the model compared to standard linear RDA and CCA.

REF REF

REF

Page 149: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

RDA - can be viewed as a series of multiple linear regressions followed by an eigenvalue decomposition of the table of fitted values

CCA - approximate unimodal responses of species to environmental gradients through a chi-square transformation of the species

data. Assumes linear relationships between Y and X.

Resulting fitted values are no longer a linear combination of the explanatory variables but a polynomial combination of variables,

e.g. b1x1 + b2x12 + b3x2 + b4x2

2 + b5x1x2 etc.

Tries to use polynomial relationships to model non-linear relationships only when these cannot be modelled well by simple linear models.

Need to reduce the number of predictor variables so as to avoid too many terms and 'overfitting'. Overfitting occurs if number of regression terms > n-1 where n = number of observations.

Polynomial algorithm proceeds by successively reducing the matrix of explanatory variables while increasing the R2 for the response variable by introducing polynomial terms.

REF

REF

REF

REF

Page 150: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Display of results

1. Plot individual terms by biplot arrows. Can easily get too many arrows. May only plot those with a correlation greater than some pre-specified value.

2. Represent each explanatory variable by a single arrow corresponding to the multiple correlation of x and x2 with the ordination axes.

Tests of significance

1. Test statistical significance of linear model by permutation tests as in CANOCO.

2. Test statistical significance of polynomial model by related permutation test.

3. Test if the difference in variance accounted for between the polynomial and the linear models is significant and hence if

the simpler linear model is the minimal adequate model.

REF

REF

REF

REF

Page 151: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Linear RDA

Polynomial RDA

Significance of difference in explained variance p = 0.002 (999 permutations). Use the polynomial RDA model.

Notes: Matrix Y: hunting spider species 1-12. Matrix X: water content, reflection of soil surface. Either set of biplot scores can be used to represent the environmental variables in biplots. Users of the program may also request the matrix of regression coefficients B of the multiple linear regressions of Y on X (if classical linear RDA or CCA is computed) or the polynomial coefficients for each response variable y of Y (if polynomial RDA or CCA is used). The program may also carry out permutation tests of the significance of the linear and polynomial models, as well as the significance of the difference in variance accounted for between the two models.

P = 0.001 (999 permutations)

REF

REF

REF

REF

Page 152: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Full arrows = biplot scores of environmental variables in polynomial: Dashed arrows = biplot scores based on multiple correlations of (water, water2) and (reflection, relection2) with the axes.

35.4% 53.7%

Linear RDA Polynomial RDA

REF

REF

REF

REF

Page 153: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA/RDA AS PREDICTIVE TOOLS

Prediction is important challenge in environmental science.

• Given environmental shift, how will species respond?

• Given environmental data only (e.g. satellite image data), what biotic assemblages could be expected?

Conventional CCA/RDA – description and modelling

Ym and Xm Modelm where subscript m = modern

'Palaeo' CCA/RDA – modelling and reconstruction

Ym and Xm Modelm

Yo and Modelm Xo where subscript o = fossil or 'palaeo' data(Lecture 8 on Environmental Reconstruction)

Predictive CCA/RDA – modelling and prediction

Ym and Xm Modelm

Xf and Modelm Yf where subscript f = future (predicted) data

Page 154: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Gottfried et al. 1998. Arctic and Alpine Research 30: 207-211

Schrankogel (3497 m) Tyrol, eastern central Alps.

1000 1x1 m plots between 2830 and 3100 m, ecotonal transition between alpine zone (vegetation cover >50%) and nival zone (vegetation cover <10%).

Vegetation data - species +/- and relative abundance of 19 species

Environmental data – Digital Elevation Model (DEM) with pixel size of plots in GIS ARC/INFO

- gives 17 topographic descriptors at 10 resolutions plus altitude.

Page 155: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Gottfried et al. 1998

Page 156: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CCA with forward selection to give 37 predictor variables

Calculated CCA sample scores for 650,000 cells of DEM area as weighted linear combination of environmental variables times the canonical coefficients.

For each predicted environment of each cell, estimated which of the 1000 plots it is closest to CCA space.

Gottfried et al. 1999

Page 157: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Extrapolate vegetation data from those plots to the 650,000 cells to predict species distributions, vegetation types, species richness, etc.

To evaluate predictions, did 10-fold cross-validation, namely model with 90% of the plots, predict with left-out 10%, and repeat 10 times. Compare predictions with actual observed data. Also calculated Cohen's kappa statistic between observed and predicted distributions (0 = uncorrelated, 1 = perfect match).

Page 158: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Model performance

Axis 1 Axis 2 Species variance

CA 0.41 0.21 34.3%

CCA 0.28 0.11 21.8%

Species fell into five groups of kappa and other performance values

Total inertia 1.79 Constrained inertia 0.68 38% explained by topography

Kappa > 0.5 e.g. Carex curvula, Veronica alpina

Kappa > 0.4 e.g. Androsace alpina, Oreochloa disticha

Kappa > 0.3 e.g. Saxifraga oppositifolia, Primula glutinosa

Kappa > 0.25

e.g. Ranunculus glacialis, Cerastium uniflorum

Kappa < 0.1 e.g Poa laxa

Page 159: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Predicted distributions

Gottfried et al. 1998

Page 160: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Predicted vegetation types

Gottfried et al. 1998

Page 161: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Best predictors are topographic measures of roughness and curvature rather than simple elevation, slope, or exposure.

Modelled richness patterns decline with altitude but with a maximum richness at the alpine-nival ecotone.

What might happen under future climate warming of 1ºC or 2ºC?

Gottfried et al. 1999 Diversity and Distributions 5: 241-251.

Calculated altitudinal lapse rate using 33 temperature loggers.

Assume that the altitudinal limits are determined by temperature. Knowing the temperature lapse rate, predict species distributions for +1ºC and +2ºC temperature increases.

Page 162: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Done by increasing the values of the altitude variable in the environmental data for the sample plots corresponding to the lapse rate. Repeated the CCA/GIS interpolation/mapping.

Assumes that species growing at lower altitudes and hence warmer situations will occur in a future warmer climate in the same topographical conditions

Gottfried et al. 1999

Predicted distribution patterns at +0º, +1º and +2º

Page 163: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Gottfried et al. 1999

Predicted distribution patterns at +0º, +1º and +2º

Page 164: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Predictions - 19 species modelled, about 2 will become extinct

will be a reduction in genetic diversity as some 'topographical forms' will be lost

Dirnböck et al. 2003 Applied Vegetation Science 6: 85-96.

Same approach but with topographic descriptors and infra-red spectral data, and 3 nearest neighbours to predict rather than 1.

Predictive mapping of 17 vegetation types between 1600 m and 2277 m on Hochschwab, eastern Alps.

Page 165: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Dirnböck et al. 2003

Page 166: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

69.4% accuracy, Cohen kappa of 0.64

Topography - good predictors of different alpine grasslands

Infra-red spectra - good predictors of different pioneer vegetation types

Unexplained variation - land-use history, soil variation especially nutrients like N, P, and K

Page 167: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

 Builds on:

 1) Weighted averaging of indicator species and extends WA to the simultaneous analysis of many species and many environmental variables.

 2) Reciprocal averaging (= correspondence analysis) by adding the statistical methodology of regression. General framework of estimation and statistical testing of the effects of explanatory variables on biological communities.

CANONICAL CORRESPONDENCE ANALYSIS

Page 168: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Major Uses:

1. Identify environmental gradients in ecological data-sets.

2. In palaeoecology, used as a preliminary to determine what variables influence present-day community compositions well enough to warrant palaeoenvironmental reconstruction.

3. Add 'fossil' samples into modern 'environmental' space.

4. Study seasonal and spatial and temporal variation in communities and how this variation can be explained by environmental variation. Variance can be decomposed into seasonal, temporal, spatial, environmental and random components.

5. Niche analysis – niche-space partitioning where species probability or abundance is unimodal function of environment.

6. Impact studies.

7. Predictive studies.

8. Experimental data analysis.

Page 169: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Powerful alternative to multivariate analysis of variance MANOVA.

 

e.g. analysis of BACI (before-after-control-impact) studies with and without replication of the impacted site

 e.g. repeated measurement designs

 e.g. experimental plot (= block) designs

 e.g. split-plot designs

 

See ter Braak C.J.F. & Verdonschot P.F.M (1995) Aquatic Sciences 57, 255–289

  Canonical correspondence analysis and related multivariate methods in aquatic ecology

Page 170: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

NON-LINEAR CANONICAL ANALYSIS OF PRINCIPAL CO-ORDINATES

Millar, Anderson & Zunan 2005 Ecology 86: 2245-2251

All the methods discussed so far (RDA, CCA, Canonical correlation analysis, distance-based RDA, CAP, non-linear RDA or CCA) are intrinsically linear with respect to the environmental variables and are fitting a linear model to the distances of dissimilarities implicit in the particular method (e.g. Euclidean distances in RDA, chi-squared distance in CCA, any distance measure in distance-based RDA or CAP).

Page 171: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

In some ecological situations, the relationship between community structure and environmental variables may be intrinsically NON-LINEAR.

e.g. community composition at different distances from a pollution source. At what point along the gradient does the impact become negligible? Environmental variable (distance from pollution source) has non-linear, exponential effect.

e.g. community change through time. After disturbance, succession occurs, with invasion and replacement of species with time. At what point in time does community composition stabilise?

Page 172: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Non-linear canonical analysis of principal co-ordinates (NCAP) of Y in relation to X

1. Principal co-ordinates analysis of Y to obtain Q orthogonal principal co-ordinates using any distance measure.

2. Non-linear canonical correlation analysis of Q in relation to X – to maximise correlation of Q with respect to a non-linear transformation of X

Use non-linear link function (e.g. exponential) just as we do when we extend a traditional univariate linear model to a GLM through the use of a link function. Non-linear optimisation procedures as in GLM.

3. Determine the smallest number of principal co-ordinates using BIC in the NCAP context

where p = number of parameters, c = constant, R2 = coefficient of determination

cnnpR log1logBIC 2

Page 173: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

4. Test null hypothesis of no association between Y and X or Q and X by randomisation tests of rows of X.

5. Test if the non-linear gradient fitted is preferred to a linear gradient. Can view the non-linear model as the 'full' model and the linear model as the 'reduced' model. Use a pseudo-F statistic to test the null hypothesis of a linear gradient and derive its significance by permutation of rows of X.

6. Determine the approximate 95% confidence intervals of the regression coefficients by bootstrapping.

R

Page 174: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Unlike non-linear canonical correlation analysis or non-linear polynomial RDA or CCA, no limit to the number of variables in Y as the analysis is based on a distance measure of choice.

No explicit assumptions about distributions in the biological data (e.g. counts, presence/absence) as randomisation tests and bootstrapping are used to test formally the statistical significance of the observed non-linear gradient.

Virtually any non-linear form of gradient can be used along with any reasonable distance measure.

Of considerable potential in the analysis of multivariate non-linear ecological systems.

Page 175: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CANONICAL GAUSSIAN ORDINATIONSomething New but a bit Heavy!

CCA can be estimated in two ways

(1) weighted averaging estimation – heuristic, rapid, computationally simple, approximation to

(2) maximum likelihood (ML) estimation – 'ideal' but theoretically rigorous and computationally difficultCANOCO implements the weighted averaging

approach. ML canonical Gaussian ordination (CGO) not computationally possible until very recently.

Page 176: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Yee (2004) Ecological Monographs 74: 685-701

Quadratic reduced-rank vector generalised linear models (QRR-VGLMs) as a way of implementing CGO.

Basis is RR-VGLMs where by reduced-rank regression multivariate regression (many responses and many predictors) is implemented in the framework of GLM with specified link functions, error distributions, etc.

QRR-VGLMs are simply RR-VGLMs but with the addition of a quadratic form to each linear predictor, so that bell-shaped responses can be modelled as functions of environmental variables or gradients.

VGLM package in R

Page 177: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Hunting spider data – 12 species, 28 sites

Page 178: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.
Page 179: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

QRR-VGLM – assume equal or unequal species tolerances

Page 180: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

QRR-VGLM results very similar to CCA results

Page 181: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Advantages of canonical Gaussian ordination (CGO)

1. CGO has clear statistical assumptions

2. CGO provides estimates of maxima, optima, and tolerances of species

Page 182: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

3. CGO avoids complications, heuristics, and statistical simplifications and inefficiencies implicit in weighted averaging and detrending.

4. CGO is built on firm statistical theory.

5. CGO is more flexible than CCA.

CCA makes four assumptions (equal tolerances, equal maxima, homogeneously distributed data, site scores homogeneously distributed over the gradients – assumptions 1-3 imply a species-packing model). These assumptions cannot simultaneously hold with real ecological data.

CGO does not make any of these assumptions though the equal-tolerance assumption makes the ordination plot more easily interpretable.

Page 183: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

6. Residual diagnostic plots are available. (In theory possible in CCA.)

Page 184: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Disadvantages of CGO are:

1. Statistical assumption of symmetric unimodal Gaussian responses may be unrealistic in practice – skewed or multimodal responses.

2. Sensitive to outliers, sparse data, multicollinearity, high leverage points, as in other GLMs.

3. The global log-likelihood function that is maximised may have several maxima, and convergence to a local solution can never be ruled out.

4. Good initial starting values can be difficult to obtain, especially as the number of species and environmental variables increases.

5. Difficult to establish statistically the rank R of the data.

6. Numerically very intensive. CPU power continues to increase exponentially, so the size of data that can be handled will also increase.

Page 185: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

QRR-VGLM advantages:

1. Basic framework is very broad, so error distributions such as binomial, negative binomial, and zero-inflated Poisson can be handled, along with compositional data. As Yee (2006) say "CGO operates on various types of species data such as presence/absence data and Poisson counts; CCA treats all types of data the same, which, to purists, is a bit like using a baseball bat to play golf, squash, and tennis."

2. Can accommodate linear combinations of non-linear functions of explanatory variables. Can include factors and functions such as orthogonal polynomials and B-splines. Polynomial CCA is ad hoc and too limited.

3. Any rank R can be fitted in theory.

4. Approximate statistical significance of regression parameters can be tested.

5. Statistical hypothesis testing in terms of deviance is possible.

6. Overdispersion and underdispersion in the data can be detected. Not possible in CCA (Overdispersion - variance in data exceeds nominal variance under an assumed model: underdispersion is opposite).

Page 186: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

QRR-VGLM disadvantages

1. Requires large amounts of memory as large model matrices are constructed to perform the fitting.

2. Specialist software needed – VGAM package in R.

3. Convergence problems may occur, especially with 'dirty' real data.

4. Much skill and statistical understanding are needed to use QRR-VGLMs.

5. 'Exact' standard errors for all estimated parameters currently not possible.

6. Currently computationally possible for rank 2 with 500 sites x 5 species x 5 predictors or 60 sites x 10 species x 10 predictors. Computational needs increase very rapidly with number of species and with the rank.

Page 187: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Important message, not mentioned by Yee (2004, 2006) is that the CCA solution is very similar to the CGO QRR-VGLM (and the CAO RR-VGAM) model results.

Reassuring for all CCA users!

Remarkable just how robust CCA is and how good weighted average estimation is.

Page 188: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Next step:

to move from GLM framework of specified response models to GAM framework with smooth data-driven functions

i.e. RR-VGAMs.

Yee (2006) Ecology 97: 203-213

Page 189: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CONSTRAINED ADDITIVE ORDINATION (CAO)

CAO, unlike CA, DCA, CCA, CGO, etc, does NOT assume any underlying species response model (e.g. unimodal, symmetric, etc).

Computes optimal gradients and flexible GAM-like response curves. Can see response curves as they really are in relation to the major gradients.

With one gradient, CAO is a generalisation of constrained quadratic (or Gaussian) ordination (CQO or CGO).

Replaces symmetric bell-shaped responses in CQO (=CGO) with completely flexible smooth curves estimated using smoothers.

CAO models are GAMs fitted to a very small number of latent variables.

Data-driven rather than model-driven, so CAO allows the data "to speak for themselves". No assumptions.

Can now think of ordination methods based on modern regression procedures as unconstrained or constrained (=canonical) and linear, quadratic, or smooth depending on assumed underlying responses.

VGAM package in R

Page 190: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Assumptions of CCA, CQO (=CGO), and CAO

CCA CQO CAO

Equal tolerances + +/- -Equal maxima + +/- -Species optima uniformly distributed

+ - -

Site scores uniformly distributed

+ - -

Symmetric bell-shaped responses

+ + -

Page 191: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Ordination (O) procedures based on modern regression procedures (Yee 2006)

Shape Unconstrained (U) Constrained (C)

Linear (L) Unconstrained linear ordination e.g. unconstrained vector generalised linear model (U-VGLM)

Constrained LO e.g. reduced rank regression (=RDA), generalised reduced rank regression, RR-VGLM

Quadratic (Q)

Unconstrained quadratic ordination (UQO) e.g. quadratic unconstrained VGLM (QU-VGLM), Gaussian ordination

Constrained quadratic ordination (CQO=CGO) e.g. QRR-VGLM, Gaussian logit ordination

Smooth (GAM)

Unconstrained additive ordination (UAO) e.g. U-VGAM, prinicipal curves

Constrained additive ordination (CAO) e.g. RR-VGAM, constrained principal curves

VGLM = vector generalised linear model; VGAM = vector generalised additive model; RR = reduced rank

Page 192: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Yee (2006) proposes that ordination, prediction (estimating y from x), and calibrating (estimating x from y) can all be performed from the same model.

Can, in theory, have

Ordination Prediction Calibration

CLO CLP CLC

CQO (= CGO) CQP CQC

CAO CAP CAC

Page 193: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Reduced-rank vector-based generalised additive models (RR-VGLM)

Approximates each surface in R-dimensional latent-variable space by an additive model.

Interpretable and avoids the 'curse of dimensionality' when R becomes large. Splines or LOESS smoothers are unusable if R is large and are uninterpretable if R > 2.

Page 194: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Technical problems arise in how much flexibility to allow in the GAM smoother, the effective non-linear degrees of freedom.

Smooth curve with zero non-linear degrees of freedom is linear, so CAO = CLO.

Values between 1.5 and 2.5 non-linear degrees of freedom usually necessary.

Page 195: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Hunting spider data: 12 species 28 sites 6 environmental variables

Each species has three non-linear degrees of freedom in rank-1 CAO model

Clear unimodal response curves

Similar to De'ath's (1999) principal curves for the same data

Yee (2006)

Page 196: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CAO model. Very similar to CQO (and CCA) results

CQO (= CGO) preferred to CAO as the final model as it is simpler and parameters such as species tolerances are well defined (as in GLM compared to GAM).

CAO best used as an exploratory tool and to help CQO work better.

Yee (2006)

Page 197: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

RR-VGLMs more powerful than ordinary GAMs because they search for optimal latent variables before applying the additive model.

GAMs are a special case of RR-VGAMs because GAMs smooth one environmental variable at a time whereas each RR-VGAM smooth is on a composite of the environmental variables after searching for the optimal combination or weighting.

At present, software only handles rank-1 RR-VGAMs.

Cannot handle interactions in X.

Are "fragile with dirty data" (Yee 2006).

Page 198: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

 CANOCO + CANODRAW 

[CANCOR, CAP]

canonicalcorrelation

analysis

constrained principalco-ordinates analysis

DISTPCOA

distance-based redundancy analysis via principal co-ordinates analysis

SOFTWARE FOR CONSTRAINED ORDINATIONS

POLYRDACCA

polynomial RDA and CCA

R(VEGAN) (CCA, RDA)

R (Non-linear CAP)

R(VGLM) (CGO)

R(VGAM) CAO)

Page 199: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.
Page 200: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CANODRAW 4

Pie symbols plot

Page 201: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CANODRAW 4

Page 202: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

CANODRAW 4

Isolines in RDA ordination diagram

Biplot with environmental variables

& sites

Attribute plot

T-value biplotSample diagram with principal response curves

Response curves fitted using GAM

Page 203: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.
Page 204: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

1987

20032002

1987

Page 205: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 7. Direct Gradient Analysis.

Mark Hill

Cajo ter Braak

Petr ŠmilauerMarti Anderson

Thomas Yee


Recommended