Robustness of three hierarchical agglomerative clustering ...

transcript

RH-12-2008

Thesis for the degree of Master of Science in Environment and Natural

Resources

Robustness of three hierarchicalagglomerative clustering

techniques for ecological data

Warsha Singh

Faculty of Natural Sciences

Department of Mathematics

October 2008

A thesis submitted in partial ful�llment of the requirements for the degree of Master

of Science in Environment and Natural Resources at the University of Iceland.

Robustness of three hierarchical agglomerative clustering techniques for ecological

Warsha Singh

Science Institute Report: RH-12-2008

c© Warsha Singh 2008

Committee in charge:

Dr. Gunnar Stefánsson (Department of Mathematics, University of Iceland)

Dr. Einar Hjörleifsson (Marine Research Institute of Iceland)

Moderator:

Dr. Erla Björk Ornolfsdóttir (Marine Research Center Breiðafjörður)

Contents

Abstract xv

Acknowledgement xvi

1 Introduction 1

1.1 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Statistical Theory 11

2.1 Hierarchical agglomerative clustering . . . . . . . . . . . . . . . . . . 11

2.1.1 Average linkage (UPGMA) . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Complete linkage . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Ward's linkage . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Non-Metric Multidimensional Scaling (NMDS) . . . . . . . . . . . . . 13

3 Methodology 15

3.1 Icelandic Ground�sh Survey . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Hierarchical cluster analysis - Species Assemblages . . . . . . . . . . . 17

3.3.1 Analysis I: Correlation distance . . . . . . . . . . . . . . . . . 17

3.3.2 Analysis II: Bray-Curtis distance . . . . . . . . . . . . . . . . 18

3.3.3 Data Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Comparison of the hierarchical clustering techniques . . . . . . . . . . 22

3.5 Comparison of hierarchical clustering with non-metric multidimen-

sional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Fish Assemblages in relation to environmental variables . . . . . . . . 23

3.7 Habitat analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.8 Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vi CONTENTS

4 Results 25

4.1 Comparison of the three hierarchical clustering techniques . . . . . . 25

4.2 Sample size e�ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Data Aggregation (smoothing) e�ect . . . . . . . . . . . . . . . . . . 50

4.4 Comparison of hierarchical clustering with non-metric multidimen-

sional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Fish Assemblages in relation to environmental variables . . . . . . . . 57

4.6 Habitat Classi�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Discussion 77

5.1 Fish Assemblages and species-environment relationships . . . . . . . . 82

6 Main considerations and recommendations 85

A Appendix 89

List of Figures

3.1 Icelandic ground�sh survey area within the 500 meter contour line,

outlining the statistical rectangles and the locations of the stations . . 16

3.2 Distribution of the data (a) before and (b) after transforming to

fourth root and scaling to zero mean and variance 1, for four abundant

species in the survey, as labelled. The histogram shows the number

of �sh per tow collections. . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Distribution of the data (a) before and (b) after transforming to

fourth root and standardising by range, for four adundant species

in the survey, as labelled. The histogram shows the number of �sh

per tow collections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Dendrogram of species assemblage for the Icelandic Ground�sh (IGF)

survey from 1998-2007 using (a) Average linkage and (b) Complete

linkage, with correlation dissimilarity measure. Data consists of species

abundance in numbers, fourth root transformed and scaled to 0 mean

and variance 1, comprising of all tow collections. The rectangles high-

light the clusters with AU > 0.9. The AU values are used for interpre-

tation are indicated in blue and the cluster number (edge) is marked

in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Dendrogram of species assemblage using Ward's linkage with corre-

lation dissimilarity measure. Data consists of species abundance in

numbers fourth root transformed and scaled to 0 mean and variance

1. The rectangles highlight the clusters with AU > 0.9. . . . . . . . . 29

viii LIST OF FIGURES

4.3 Dendrogram of species assemblage using (a) Average linkage and (b)

Complete linkage, with correlation dissimilarity measure. Data con-

sists of mean species abundance in numbers by stations, fourth root

transformed and scaled to 0 mean and variance 1. The rectangles

highlight the identi�ed species assemblages for comparison. . . . . . . 30

4.4 Dendrogram of species assemblage using Ward's linkage with correla-

tion dissimilarity measure. Data consists of mean species abundance

in numbers by stations, fourth root transformed and scaled to 0 mean

and variance 1. The rectangles highlight the identi�ed species assem-

blages for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Complete linkage with Bray-Curtis dissimilarity measure. Data con-

sists of species abundance in numbers, fourth root transformed and

standardised by range. The rectangles highlight the clusters with AU

> 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.6 Dendrogram of species assemblage using Ward's linkage with Bray-

Curtis dissimilarity measure. Data consists of species abundance in

numbers, fourth root transformed and standardised by range. The

rectangles highlight the clusters with AU > 0.9. . . . . . . . . . . . . 33

sists of mean species abundance in numbers by stations, fourth root

transformed and standardised by range. The rectangles highlight the

identi�ed species assemblages for comparison. . . . . . . . . . . . . . 34

Curtis dissimilarity measure. Data consists of mean species abun-

dance in numbers by stations, fourth root transformed and standard-

ised by range. The rectangles highlight the identi�ed species assem-

blages for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.9 Dendrogram of species assemblage using Average linkage with corre-

numbers, fourth root transformed and scaled to 0 mean and variance

1, comprising of (a) 50% random subsample and (b) 25% random

subsample of the total tow collections. The rectangles highlight the

clusters with AU > 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

LIST OF FIGURES ix

4.10 Dendrogram of species assemblage using Average linkage with corre-

1, comprising of 10% random subsample of the total tow collections.

The rectangles highlight the clusters with AU > 0.9. . . . . . . . . . 39

4.11 Dendrogram of species assemblage using Complete linkage with cor-

relation dissimilarity measure. Data consists of mean species abun-

dance in numbers by stations, fourth root transformed and scaled

to 0 mean and variance 1, comprising of (a) 50% random subsam-

ple and (b) 25% random subsample of the total tow collections. The

4.12 Dendrogram of species assemblage using Complete linkage with corre-

lation dissimilarity measure. Data consists of mean species abundance

in numbers by stations, fourth root transformed and scaled to 0 mean

and variance 1, comprising of 10% random subsample of the total tow

collections. The rectangles highlight the clusters with AU > 0.9. . . . 41

1, comprising of (a) 50% random subsample and (b) 25% random

subsample of the total tow collections. The rectangles highlight the

clusters with AU > 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1, comprising of 10% random subsample of the total tow collections.

4.15 Dendrogram of species assemblage using Average linkage with Bray-

numbers, fourth root transformed and standardised by range, com-

prising of (a) 50% random subsample and (b) 25% random subsample

of the total tow collections. The rectangles highlight the clusters with

AU > 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

x LIST OF FIGURES

4.16 Dendrogram of species assemblage using Average linkage with Bray-

prising of 10% random subsample of the total tow collections. The

4.17 Dendrogram of species assemblage using Complete linkage with Bray-

ised by range, comprising of (a) 50% random subsample and (b) 25%

random subsample of the total tow collections. The rectangles high-

light the clusters with AU > 0.9. . . . . . . . . . . . . . . . . . . . . 46

4.18 Dendrogram of species assemblage using Complete linkage with Bray-

ised by range, comprising of a 10% random subsample of the total

tow collections. The rectangles highlight the clusters with AU > 0.9. 47

prising of (a) 50% random subsample and (b) 25% random subsample

of the total tow collections. The rectangles highlight the clusters with

AU > 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

prising of 10% random subsample of the total tow collections. The

Complete linkage with correlation dissimilarity measure. Data con-

sists of mean species abundance in numbers by statistical subrectan-

gles, fourth root transformed and scaled to 0 mean and variance 1.

LIST OF FIGURES xi

4.22 Dendrogram of species assemblage using Ward's linkage with correla-

tion dissimilarity measure. Data consists of mean species abundance

in numbers by statistical subrectangles, fourth root transformed and

scaled to 0 mean and variance 1. The rectangles highlight the clusters

with AU > 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

sists of mean species abundance in numbers by statistical subrectan-

gles, fourth root transformed and standardised by range. The rect-

angles highlight the clusters with AU > 0.9. . . . . . . . . . . . . . . 53

dance in numbers by statistical subrectangles, fourth root transformed

and standardised by range. The rectangles highlight the clusters with

AU > 0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.25 Multidimensional scaling using Bray-Curtis distance measure for (a)

the full data set (comprising all tow collections) (b) data aggregated

by statistical sub-rectangle. Species abundance in numbers was fourth

root transformed and standardised by range. . . . . . . . . . . . . . . 56

4.26 Geographical distribution of the 40 species analysed for this study,

labelled accordingly. The bubble plot shows the mean abundance of

species by statistical subrectangles averaged across years. The size of

circles are proportional to the square root of the mean abundance. . . 67

4.27 Weighted average depths and standard deviations for the 40 species

analysed. A-D refers to the identi�ed �sh assemblages from Ward's

hierarchical clustering based on correlation distance. . . . . . . . . . . 68

4.28 (a) Box and whisker plot for the mean depths of species in the iden-

ti�ed �sh assemblages from Ward's hierarchical clustering based on

correlation distance (b) Tukey test results showing the signi�cant dif-

ference between the identi�ed �sh assemblages (c) Box and whisker

plot for the mean depths of species in the identi�ed �sh assemblages

from Ward's hierarchical clustering based on Bray-Curtis distance

(d) Tukey test results showing the signi�cant di�erence between the

identi�ed �sh assemblages from (c) . . . . . . . . . . . . . . . . . . . 69

xii LIST OF FIGURES

4.29 Weighted average depths and standard deviations for the 40 species

analysed. A*-C* refers to the identi�ed �sh assemblages from Ward's

hierarchical clustering based on Bray-Curtis distance. . . . . . . . . . 70

4.30 De�nition of areas in Icelandic waters using Ward's hierarchical clus-

tering. The data consist of species abundance in numbers transformed

to fourth root. Clustering was based on (a) correlation distance with

data scaled to 0 mean and variance 1 (b) Bray-Curtis distance with

data standardised by range. . . . . . . . . . . . . . . . . . . . . . . . 73

4.31 Species composition of de�ned clusters from the habitat classi�cation

using Correlation distance measure and Ward's linkage. The species

codes are outlined in Table 4 in the Appendix. . . . . . . . . . . . . . 74

4.32 Species composition of de�ned clusters from the habitat classi�cation

using Bray-Curtis distance measure and Ward's linkage. The species

codes are outlined in Table 4 in the Appendix. . . . . . . . . . . . . . 75

4.33 A heatmap showing the species-area association for the Icelandic

Ground�sh (IGF) survey from 1998-2007 using Average linkage hi-

erarchical clustering with correlation dissimilarity measure. The x-

axis shows the dendrogram of areas (statistical rectangles) and y-axis

shows the dendrogram of species assemblage. Data consists of species

abundance in numbers, fourth root transformed and scaled to 0 mean

and variance 1. The colours range from blue (low ratios) to red (high

ratios) indicating the strength of associations. . . . . . . . . . . . . . 76

A.1 De�nition of areas in Icelandic waters using (a) Average (b) Com-

plete hierarchical clustering with correlation distance. Data consists

of species abundance in numbers, transformed to fourth root and

scaled to 0 mean and variance 1. . . . . . . . . . . . . . . . . . . . . . 91

A.2 De�nition of areas in Icelandic waters using (a) Average (b) Com-

plete hierarchical clustering with Bray-Curtis distance. Data consists

of species abundance in numbers, transformed to fourth root and

standardised by range. . . . . . . . . . . . . . . . . . . . . . . . . . . 92

List of Tables

2.1 Parameter Values for the clustering algorithms used in this study . . 12

4.1 Cophenetic Correlation Coe�cient for Analysis I (Correlation dis-

tance) and II (Bray-Curtis distance) . . . . . . . . . . . . . . . . . . . 27

4.2 Agglomerative Coe�cient for Analysis I (Correlation distance) and

II (Bray-Curtis distance) . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.1 The common and Latin names of the fourty most common species

analysed for this study with the codes used for analysis. . . . . . . . . 90

Abstract

Although, cluster validity has been a subject of interest and importance in the �eld

of molecular genetics for some decades now, substantive guidelines are not readily

available for the choice of the appropriate clustering algorithms for ecological data.

This study tested the robustness of three common hierarchical agglomerative clus-

tering methods, Average, Complete and Ward's linkage, for identi�cation of species

assemblages. The Icelandic ground�sh survey data for the period 1998-2007 was used

for this study, taking the fourty most abundant species into consideration. The ob-

jective criteria used for cluster validity or e�ciency was the Cophenetic Correlation

Coe�cient (CPCC) and the Agglomerative Coe�cient (AC). In order to test the

reliability of the clusters bootstrap resampling technique was used to generate the

probability for the clusters. Furthermore, to examine the stability and consistency

of the linkage methods, their performances across di�erent sample sizes and levels

of data smoothing were tested. Two modes of data analyses based on a di�erent

combination of data standardisation and distance measure; (1) Correlation distance

on data scaled to zero mean and one variance and (2) Bray-Curtis distance on data

standardised by range, showed that Ward's clustering technique was the most robust

and suitable for this data set. It generated consistent well-de�ned clusters with high

probabilities and gave high values of CPCC and AC. The assemblages were also

ecologically meaningful when related to two environmental parameters, depth and

geographical distribution. A veri�cation of the hierarchical clusters with Non-metric

Multidimensional Scaling also gave similar species groupings. The Complete linkage

was unstable generating inconsistent results across di�erent sample sizes and data

smoothing. The Average linkage maximised CPCC but was sensitive to the way the

data were standardised. The CPCC criterion of cluster validity was not seen as a

very reliable and adequate measure in this study.

Subsequently, the main species assemblages o� the Icelandic waters, covered by

the survey, were de�ned. Biological interpretations of the �sh assemblages showed

LIST OF TABLES xv

that the spatial structure of the environmental gradients around Iceland played a

role in characterising the �sh assemblages. A de�nition of areas around Iceland led

to a separation along the north-south gradient, according to the bathymetric and

hydrographic conditions, which further showed some di�erentiation along depth.

Furthermore, the use of a visualisation technique, the heatmap, was introduced for

exploring community patterns.

Acknowledgment

I would like to acknowledge the Marine Research Institute of Iceland for making the

data on the Icelandic ground�sh survey available for this study. My sincere gratitude

goes to Dr. Gunnar Stefánsson of the Department of Mathematics, University of

Iceland and Dr. Einar Hjörliefsson of the Marine Research Institute of Iceland for

their technical guidance and continued valuable support throughout this study and

for their constructive comments in strengthening this study.

I am immensely and forever grateful to the co-ordinating team of the United

Nations University Fisheries Training Programme, Dr. Tumi T'omasson, Mr. Þor

Ásgeirsson and Ms. Sigridur Kr. Ingvarsdóttir for giving me this opportunity to

be a part of this Masters Programme in Environment and Natural Resources at the

University of Iceland. I would also like to acknowledge the continued support and

encouragement from Dr. Brynhildur Davidsdóttir (Co-ordinator for the Masters

Programme).

I thank Mr. Sigurdur þor Jóonsson of the Marine Research Institute of Iceland

for providing his technical assistance with the statistical software R.

1Introduction

The shift toward ecosystem based �sheries management has resulted in numerous

studies, carried out world-wide, to determine �sh assemblages. This new approach

entails starting �sheries management at the ecosystem level rather than at sin-

gle species level (Pikitch et al., 2004). An initial step toward understanding the

ecosystem- or multispecies-based approaches is to understand the mechanisms of

the biological communities in space and time, including their correlation with the

environment (Sousa et al., 2005; Jaureguizar et al., 2006). Hence, the identi�cation

of �sh assemblages and their relation to environmental variables may be seen as

one probable measure of potential interactions between the species (Francis et al.,

2002). The term �sh assemblage refers to a group of species that coexist at a

geographical scale because of similar habitat preferences or biological interactions

(Jaureguizar et al., 2003; Mahon et al., 1998). Because these assemblages poten-

tially characterise geographical areas or environmental gradients, they are consid-

ered an appropriate indicator for habitat complexity (Noss, 1990). The patterns

of species assemblages are commonly de�ned using multivariate analysis, inferring

species-environment relationships. Nonetheless, not much attention is given to the

reliability of the methodology that is applied which is the main topic of the present

study.

Hierarchical cluster analysis is widely applied for assemblage studies. This

method is based on identifying objects with similar characteristics and grouping

them together such that objects within the group are more similar than objects

in di�erent groups. Cluster analysis can be used to identify species assemblages,

2 Chapter 1 Introduction

and di�erent sites and times having similar community structures (Clarke and War-

wick, 2001). The output is a tree-like structure called a dendrogram with the x-axis

showing the objects and the y-axis indicating the level of similarity or dissimilarity

of the groupings. Similarity between the clusters diminishes moving from lower to

upper levels. Hierarchical clustering is sub-divided into agglomerative and divisive

methods. Agglomerative methods are most commonly used (Clarke and Warwick,

2001). In the basic description given by Quinn and Keough (2002), the procedure

starts with calculating a matrix of dissimilarity between the objects or variables and

two objects which are most similar cluster together to form a new object replacing

the merged pair. The dissimilarity between the new set of objects is re-calculated

and again the most similar objects are merged. The process continues until all the

objects are linked in a cluster. Dissimilarity indices (also called distance) measure

how di�erent the objects are (how far apart the objects are in multidimensional

space) and is calculated for every possible pair of objects. This is the basis for the

formation of a cluster. For continuous variables dissimilarity measures include Eu-

clidean (squared normal distance), Manhattan, Canberra, Minkowski, Bray-Curtis,

Kulczynski and Chi-square (Quinn and Keough, 2002). A variety of agglomerative

clustering methods exist depending on which technique or linkage method is used

to fuse the objects during the clustering process. Some of the common ones include

Single linkage, Complete linkage, Average linkage and Ward's hierarchical cluster-

ing method. The divisive method is opposite to the agglomerative, starting with a

single cluster which contains all the objects and splitting it up into smaller groups

(Clarke and Warwick, 2001) and two-way indicator species analysis (TWINSPAN)

is a common method in this class (Quinn and Keough, 2002) .

For the most part, hierarchical clustering techniques lack a completely stable

output and an objective measure for evaluating the outcomes obtained (Cao et al.,

2002a; Nemec and Brinkhurst, 1988) introducing subjectivity into the interpreta-

tion of the classi�cations (Mahon et al., 1998). Generally, prior to clustering the

grouping properties of the data set are unknown and the number of expected clus-

ters cannot be assigned beforehand. In other words, it is an unsupervised process

(�unsupervised learning�) and it is generally di�cult to judge whether the resulting

classi�cation patterns and the number of groups are acceptable. Additionally, the

function of the clustering algorithms are susceptible to the properties of the data

and the assumptions made for the de�nition of the groups (Halkidi et al., 2002b;

Kovács et al., 2005). Another drawback is that once a cluster is formed it cannot

Introduction 3

be broken down later in the process and an inaccurate cluster formed early in the

process will therefore in�uence the classi�cation that follows (Quinn and Keough,

2002). Consequently, evaluation and validation of clustering techniques are an es-

sential part of cluster analysis (Legendre, 1998). Comparing outcomes from a few

techniques can also ensure consistency and plausibility of the results as di�erent

clustering algorithms could lead to di�erent results for the same data set (Jakoniene

and Lambrix, 2007). Naturally, if there really is a strong association in the data,

di�erent methods should produce similar results (Quinn and Keough, 2002).

The results of hierarchical classi�cation depend on the choice of the clustering

technique (linkage method) and the initial dissimilarity index used to calculate the

pairwise dissimilarity between objects, thus one should be wary of their choices. The

purpose of the analysis, the nature of the data and the standardisations of the data

all play a role in determining the optimum clustering technique used (Quinn and

Keough, 2002) taking note that the choice of linkage method is more critical than the

choice of the dissimilarity measure (Vakharia and Wemmerlöv, 1995). For ecological

studies, the group mean (or Average) linkage technique also known as unweighted

pair-groups method using arithmetic averages (UPGMA) based on Bray-Curtis dis-

similarity has been a prominent technique for some decades, as noted by Clarke and

Ainsworth (1993) and also falls within the recommendation of Quinn and Keough

(2002). When data are in the form of species abundance the problem of "double

zeros� normally exists, that is, a species can be absent from two sites. �If a species is

absent from two sites, then these two sites are either both above or both below the

optimal niche value for that species, or one above and one below that value� (Leg-

endre, 1998). Thus clear indications about the ecological preferences of the species

cannot be reached in these circumstances and ecological conclusions should not be

drawn. Therefore, dissimilarity coe�cients that do not classify sampling units as

similar because they have no species in common are recommended. Coe�cients of

this type are called asymmetric as they treat zeros in a di�erent way and skip double

zeros altogether when computing dissimilarities (Legendre, 1998). Bray-Curtis is an

asymmetrical quantitative coe�cient where the comparison excludes double zeros

which makes it preferable for ecological studies (Legendre, 1998). Bray-Curtis is one

such asymmetric coe�cient together with others such as Kulczynski and Canberra.

On the other hand, Euclidean and Chi-square are also good measures of dissimi-

larity if the data do not have zeros (Quinn and Keough, 2002). Other reasons as

to why Bray-Curtis coe�cient is preferred is, the inclusion of a third sample does

not a�ect the similarity between two initial samples and its value is unchanged by

inclusion and exclusion of a species which is jointly absent from two samples (Clarke

and Warwick, 2001). If the data are �rst normalised then the use of correlation as a

dissimilarity measure may be appropriate for species associations (Legendre, 1998).

Correlation distance is used more in analysis of species than sites since it incor-

porates a type of row standardization (Clarke and Warwick, 2001). This however

does not remove the problem of double-zeros but the problem can be minimised by

eliminating rare species from the analysis (Legendre, 1998).

The ecological literature has a vast number of studies on �sh assemblages rang-

ing across various types of �sheries and habitats. Some of the analyses of demersal

�sh assemblages in the Northern region include; Galician continental shelf and up-

per slope, north-west Spain (Fariña et al., 1997); eastern Norwegian sea (Bergstad

et al., 1999); north-east Newfoundland/Labrador shelf (Gomes and Richard, 1995);

Flemish Cap (González-Troncoso et al., 2006); Faroe Banks (Magnussen, 2002); east

coast of North America (Mahon et al., 1998); west and east Greenland continen-

tal shelf and slope (Rätz, 1999) and Portuguese continental margin (Sousa et al.,

2005). Most of these studies relate the spatial and temporal patterns of species

assemblages to possible environmental variables that could explain these structures.

To ensure consistency of the results, output from at least two multivariate analyti-

cal techniques are generally compared in most studies. For example, Mahon et al.

(1998), Medina et al. (2007) and González-Troncoso et al. (2006) compare PCA to

hierarchical clustering. Francis et al. (2002); Sousa et al. (2005) compare CA and

hierarchical clustering and Lee and Sampson (2000) look at DCA and hierarchical

clustering. Some of the studies such as Brazner and Beals (1997) and Massuti and

Moranta (2003), among others, try to complement results obtained from clustering

with MDS. All these studies report consistent results from the di�erent techniques

used. However cluster validation or comparison of techniques was not the underlying

objective of these studies. With some exceptions, justi�cation is not provided on the

choice of techniques used. Generally such studies are more focused on the biological

aspects of analysis and interpretation rather than the reliability of the techniques

used. It is therefore not clear in general whether consistency is a general feature or

only present between the two methods chosen in each of those analyses.

Numerous studies have focused on testing the e�ciency and stability of various

hierarchical clustering techniques and in turn trying to determine the best linkage

method for the data set being evaluated. Some of these include Datta and Datta

Introduction 5

(2003); Gauch Jr and Whittaker (1981); Hennig (2007); Loganantharaj et al. (2006);

Milligan and Cooper (1987); Scheibler and Schneider (1985) and references therein.

Quinn and Keough (2002) and Cao et al. (1997a) also give further citations. The

majority of other studies on cluster validation are based on non-ecological data.

Studies such as Scheibler and Schneider (1985) used Monte Carlo tests to examine

the accuracy of a wide range of hierarchical and non-hierarchical clustering, show-

ing that Ward's linkage was the most robust among the hierarchical classi�cation

techniques examined. Some recent studies as Hennig (2007) use simulation studies

to test stability of clustering techniques also, based on external validation criteria

such as Jaccard's coe�cient.

Cluster validation studies are fairly limited in the �eld of ecology. One study

conducted by Cao et al. (1997a) compared the performance of three hierarchical

linkage methods, UPGMA, Complete and Ward's linkage, and TWINSPAN on river

benthic community data. Contrary to the general recommendation they found that

Ward's clustering technique produced the best result. Nonetheless, the choice of

dissimilarity measure also plays a role. Ward's linkage needs Euclidean distance

(Vakharia and Wemmerlöv, 1995) and this distance measure is known to strongly

overweight abundant species, even after data transformation (Cao et al., 1997a). In

their study Cao et al. (1997a) broaden the use of Wards linkage and apply it to

a new dissimilarity measure, namely CY dissimilarity measure, proposed by Cao

et al. (1997b). Ward's technique has generally been applied and recommended for

non-ecological studies. Since ecological patterns in multivariate data are normally

not known a priori this poses some di�culties in assessment of patterns. This short-

coming has been addressed by some studies through the use of simulated data. One

such study by Gauch Jr and Whittaker (1981) compared hierarchical classi�cation

for simulated community data and �eld data. They showed that UPGMA did not

perform very well in separating the predetermined plant communities in compari-

son to TWINSPAN and Complete linkage. On the contrary, Belbii and McDonald

(1993) found that �exible-UPGMA performed better than TWINSPAN when tested

on simulated community data.

Even though there are some studies which suggest that sampling e�ort could have

a signi�cant e�ect on the multivariate analyses, this has seldom been investigated

(Cao et al., 2002a,b). Cao et al. (2002a) investigated the e�ect of sampling e�ort on

the similarity/dissimilarity measures as opposed to the clustering technique, with

the justi�cation that these are fundamental to cluster analysis. Their study illus-

trated that increasing sampling e�ort signi�cantly improved the site separation in

techniques such as cluster analysis and ordination, since more samples improve the

estimate of the similarity between objects resulting in a clearer separation between

groups. Additionally, decreasing sampling e�ort or insu�cient sampling can have an

e�ect on the observed community structure as fewer species are caught and recorded

in smaller sample sizes (Riecken, 1999). As such, to test the e�ect of sample sizes

on the observed ecological communities appears worthwhile.

The two important aspects of cluster validation involve testing the e�ciency and

the stability of a method. Validation techniques used for testing e�ciency, or the

goodness-of-�t of the clustering, can be broadly classi�ed into external, internal and

relative criteria. A concise account of these are given by Halkidi et al. (2002b,a). In

short, an external criterion for cluster validation involves comparing the clustering to

a prede�ned structure. Statistical indices such as the Rand Statistic, Jaccard coe�-

cient, Hurbert's statistic and Folkes and Mallow Index are used for this criterion. A

relative criterion is based on certain assumptions and parameters and involves com-

paring the obtained classi�cation to other clustering schemes. Some of the statistics

used in this criterion are the Dunn family of indices, modi�ed Hurbert's statistic,

Davies-Bouldin index among others (Halkidi et al., 2002a). An internal criterion on

the other hand relies on the inherent features of the data to evaluate the clustering

structure (Halkidi et al., 2002b) such as the initial dissimilarity patterns between

the objects. This is particularly useful if no prior information about the de�nitions

in the data are available. One such criterion is the Cophenetic Correlation Coe�-

cient (CPCC), also referred to as a matrix correlation or the standardized Mantel

statistic, proposed by Sokal and Rohlf (1962). The hierarchical clustering proce-

dure produces a total dissimilarity matrix known as the cophenetic matrix. The

correlation between this cophenetic matrix and the original dissimilarity matrix on

which the clustering was carried out is the CPCC (Lessig, 1972). A high correla-

tion shows that the clustering technique did not distort much information contained

in the original dissimilarity matrix. This criteria of cluster validation has been

applied in several studies for evaluating clustering e�ciency (Farris, 1969). Such

evaluations also include Gauch Jr and Whittaker (1981); Li (1990); Rodrigues and

Diniz-Filho (1998). However some studies such as Farris (1969); Rohlf and Fisher

(1968); Phipps (1971) have questioned the reliability of this index of cluster validity.

Another criterion is the agglomerative coe�cient (AC), proposed by Kaufman and

Rousseeuw (1990). This criterion is based on the clustering structure itself found by

Introduction 7

the clustering algorithm and is normally used to assess the strength and quality of

the clustering (Rodrigues and Diniz-Filho, 1998; Hasan and Masumoto, 1999; Lesage

et al., 1999).

The second aspect of cluster validation, stability, normally refers to whether the

clusters remain constant irrespective of changes in the initial data set, such as taking

subsamples or adding noise to the data (Hennig, 2007). Perhaps one of the disad-

vantages of hierarchical cluster analyses is to verify that the clusters are not just a

result of random e�ects. This has been, in some ways, overcome by the bootstrap

technique. Bootstrapping is used to assess the uncertainty in hierarchical clustering

by determining the probabilities of the obtained clusters. The stability and con-

sistency of a cluster can therefore also be tested using the bootstrap (Hennig and

Mathematik-SPST, 2005; Efron et al., 1996). Bootstrapping has also been applied in

a variety of ways to assess the reliability of clusters (Efron et al., 1996; Handl et al.,

2005; McKenna, 2003; Kerr and Churchill, 2001; Shimodaira, 2002; Suzuki and Shi-

modaira, 2004). The majority of such work has been done in the �eld of molecular

genetics, some in a rather elaborate manner, and studies such as Bolshakova et al.

(2005) have developed speci�c software for cluster validation of DNA microarray

data. This software can be used to validate a range of clustering techniques and

incorporates various validation indices. However, bootstrapping of cluster analysis

is somewhat less seen among the numerous ecological studies conducted.

Given the drawbacks of cluster analysis which have been outlined earlier, ordi-

nation techniques such as MDS are sometimes preferred. An ordination is like a

map of the objects in more than two dimensions, where the placement of the ob-

jects represents their similarity. Multidimensional scaling (MDS), also referred to

as non-metric multidimensional scaling (NMDS), is like clustering, based on simi-

larities or dissimilarities between the objects. The procedure �scales objects based

on a reduced set of new variables derived from the original variables� (Quinn and

Keough, 2002). Other ordination techniques include Principal Co-ordinates Analy-

sis and Correspondence Analysis (CA), Canonical Correspondence Analysis (CCA),

Detrended Correspondence Analysis (DCA) and Principal Components Analysis

(PCA) which is the longest-established ordination method (Clarke and Warwick,

2001). Ordination gives a more informative display when samples do not portray a

strong grouping (Clarke and Warwick, 2001). Clarke and Warwick (2001) suggest

that cluster analysis be used in conjunction with ordination, even if the samples

are strongly grouped. Gauch and Whittaker 1981 argue that in community ecology,

data are relatively continuous with samples relatively evenly spaced and the data are

not naturally clustered. Consequently, clustering may impose clusters which are not

intrinsic to the data. Therefore they suggest that non-hierarchical and ordination

techniques have advantages over hierarchical techniques in such cases. NMDS is nor-

mally recommended as one of the best ordination techniques (Quinn and Keough,

2002; Clarke and Warwick, 2001; Clarke and Ainsworth, 1993) due to its �exibility.

It can be applied in conjunction with a wide range of dissimilarity measures and does

not rely on any particular response model between species and underlying ecological

gradients (Legendre, 1998).

Examining complex patterns in community structures can be considerably de-

manding and complex. When the data are extensive and structures are abstract,

a clear visualisation of patterns in a graphical format can be particularly bene�cial

for understanding and interpretation. The heatmap is one such visualisation tech-

nique that is a useful data exploratory tool and has been applied widely in the �eld

of genetics for studying patterns in complex DNA microarray data (Pryke et al.,

2006; Hastie et al., 2001; Zhang et al., 2003; Eisen et al., 1998; Quackenbush, 2007).

This conceptualisation deals with assigning colours to each data point that �quan-

titatively and qualitatively re�ects the original experimental observations� (Eisen

et al., 1998) which is much more interpretable and informative than reading num-

bers. Visualisation can also be used as a measure of quality of the solutions (Pryke

et al., 2006). Although applied extensively in taxonomical studies, ecologists have

refrained from the use of these visualisation techniques for exploring community

structures. Here an attempt is made to give an informative representation of the

species-area relationship through a heatmap.

1.1 Purpose of the study 9

1.1 Purpose of the study

Hierarchical agglomerative cluster analyses have been widely applied in the �eld of

ecology. However, the robustness of the techniques used are seldom examined. The

primary emphasis of this study was to address the methodological and statistical

aspects of clustering procedures. Secondarily, the biological aspects of the estimation

of �sh assemblages were also addressed.

The robustness of three hierarchical agglomerative clustering techniques namely,

Average linkage or Unweighted Pair-Group Mean Average (UPGMA), Complete link-

age and Ward's linkage were examined for identi�cation of �sh assemblages. These

are the most commonly used linkage methods in ecology. This study was based on

Icelandic ground�sh survey data for the period 1998 to 2007.

The objective criteria used for assessing the cluster validity or e�ciency was

the Cophenetic Correlation Coe�cient (CPCC) and the Agglomerative Coe�cient

(AC). In order to test the reliability of the clustering methods, the probability values

for the clusters were determined through bootstrap resampling. As a measure of the

stability and consistency of the methods, their performances were examined across

di�erent sample sizes and di�erent levels of data smoothing (data aggregation) .

As a secondary aim, it was explored if di�erent data standardisation methods

and dissimilarity measures played a signi�cant role in determining multivariate pat-

terns, in this context the species assemblages. Thus the above analyses were carried

out using two modes of data analysis. These were a di�erent combination of (1)

data transformation and standardisation and (2) the dissimilarity measure used to

obtain the matrix of dissimilarities before the clustering. For each mode of data anal-

ysis, relative comparisons were made between the three linkage methods in order to

determine which hierarchical agglomerative clustering technique was conditionally

most robust, thus potentially most suitable for the data being studied. Furthermore,

NMDS was used as an external subjective criterion to compare and verify the �sh

assemblages obtained from hierarchical cluster analysis.

Furthermore, after the identi�cation of the most robust linkage method, it was

important to examine if the species assemblages obtained from that method were

ecologically meaningful. Thus the identi�ed assemblages were examined in rela-

tion to two environmental variables, depth and geographic distribution. These two

variables were hypothesised to be in�uential in determining the species associations.

A classi�cation of the �shing areas was also carried out to determine similar

habitats. This was carried out in line with the two modes of data analysis and

the three linkage methods in order to examine the consistency of the outcomes. A

visualisation technique, the �heatmap�, was then used to give a more informative

display of the patterns in the community structures by giving a pairwise display of

the two classi�cations of species and areas (statistical rectangles).

2Statistical Theory

2.1 Hierarchical agglomerative clustering

All hierarchical agglomerative clustering procedures begin with an initial dissimi-

larity matrix between the objects. At the start of the agglomerative process each

object is considered as a separate class or cluster. For a set of N initial objects,

the �rst clustering will result in N-1 clusters, the next N-2 and so on until only one

cluster contains all the objects, with objects which are most similar fusing together

at each step. How the distance between the new cluster and the remaining objects is

computed is determined by the clustering algorithm being used (Gordon, 1999). A

general equation proposed by Lance and Williams (1967) and outlined in Scheibler

and Schneider (1985), describes how the various hierarchical algorithms compute

this distance:

dhk = αidhi + αjdhj + βdij + λ |dhi − dhj| (2.1)

where:

dij denotes the Euclidean distance between the entities i and j which have been

combined to form a new cluster k

dhk denotes the Euclidean distance between a remaining entity h and the new cluster

αi, αj, β and λ are parameters that depend on the clustering method being used and

are outlined in Table 1 below for the three methods considered here.

12 Chapter 2 Statistical Theory

Cluster Method αi αj β λAverage ni

Complete 0.5 0.5 0 0.5Ward's nh+ni

nh+nk0

Table 2.1: Parameter Values for the clustering algorithms used in this study

where:

ni is the number of entities in cluster i of preceding partition

nj is the number of entities in cluster j of preceding partition

nk is the number of entities in the new cluster k (nk = ni + nj)

nh is the number of remaining entities for which the distance to cluster k has to be

recomputed (one less than the number of clusters after the merger).

The output from the analyses are represented as hierarchical tree or dendrograms.

A general description of the three methods evaluated in this study is given below.

2.1.1 Average linkage (UPGMA)

In this method after two objects with the least dissimilarity fuse together an arith-

metic average of the dissimilarity of this new cluster and the rest of the objects are

calculated. This leads to a reduction in the size of the original dissimilarity matrix.

The procedure then continues with the dissimilarity matrix being correspondingly

reduced. When the average between an object and a cluster is calculated, the

method gives equal weights to the members of the clusters when averaging, thus is

called unweighted. Thus, in the progressive reduction of the dissimilarity matrix,

only relationships between groups are considered, which are given equal weighting

and this leads to loss of information about the relationships between pairs of objects

(Legendre, 1998).

2.1.2 Complete linkage

The fusion of the clusters depends on the most distant pair of objects as opposed to

the closest. An object can join a cluster only when it is linked to all objects present

in the cluster. Two clusters can only fuse when all members from the �rst cluster

are related to all objects from the second cluster, hence it becomes more di�cult

2.2 Non-Metric Multidimensional Scaling (NMDS) 13

for objects to join a cluster. This however creates clusters with clear discontinuities

(Legendre, 1998).

2.1.3 Ward's linkage

This method is also referred to as Ward's minimum variance method. The procedure

minimizes the sum of squares to form clusters, thus it is also referred to as the

incremental sum of squares method. The procedure initially considers each object

as a cluster on its own so the distance of the object to its cluster centroid is 0.

The centroid of a cluster is the average of the coordinates of the objects in the

cluster. As the clusters form, the centroids move away from actual object coordinates

and the sum of squared distances between the objects and the centroids increases.

The distance of the object to its cluster centroid is calculated using the Euclidean

distance formula. At each clustering step, the cluster identi�ed for fusion is the one

that minimizes the sum of squared distance over all objects. The dendrogram is

normally represented in squared distances.

2.2 Non-Metric Multidimensional Scaling (NMDS)

The process begins with an ordination (scaling) of the objects in full-dimensional

space and then represents them in few dimensions while the distance relationships

between objects are retained as much as possible. The main objective of NMDS

is to plot dissimilar objects far apart in the ordination space and similar objects

close to one another. An initial distance matrix is calculated using an appropriate

distance measure for the data. A con�guration of the objects is constructed in a

speci�ed dimension which goes through an iterative algorithm to calculate a matrix

of �tted distances in the ordination space, using Euclidean distance mostly. The

solution depends on the initial positions of the objects so the choice of the original

dissimilarity measure is important. The �tted distances are then compared to the

original distances through regression and the corresponding scatter plot is known as

the Shepard Diagram. The goodness-of-�t of the regression is evaluated by the use

of the sum of squares from the regression analysis. These are known as the stress

values and the �t is considered good if the stress value is less than 0.01 (Legendre,

1998).

14 Chapter 2 Statistical Theory

Stress =

√√√√∑h,i(dhi − d̂hi)2∑

h,i d2hi

where:

dhi are the �tted distance values

d̂hi are the values forecasted by the regression between dhi and dhi (original distances)

3Methodology

3.1 Icelandic Ground�sh Survey

The Icelandic ground�sh survey was instigated in 1985 and has been conducted in

March every year since by the Marine Research Institute. The survey area which

consists of the Icelandic continental shelf inside the 500 meters depth contour, is

divided into statistical rectangles. Each statistical rectangle represents one half

degree latitude and one degree longitude, on which the strati�cation scheme is based.

Statistical rectangles are further divided into 4 subrectangles. The strati�cation

system in the survey design, used to de�ne the locations of tows (stations) was

based on the density of cod found in the area. These density patterns, estimated by

statistical rectangles, were calculated from catch data from commercial and research

vessels prior to the survey design. For analysis, the survey area is divided into a

northern and southern area and ten strata based on biological and hydrographic

considerations. The allocation of stations to strata is directly proportional to the

area of the stratum and its estimated cod density (Pálsson et al., 1989). Figure 3.1

shows the survey area, the statistical rectangles and the approximate locations of

the stations.

The sampling scheme can be classi�ed as semi-random strati�ed (Pálsson et al.,

1989) as half the stations were randomly chosen by the research team of the institute

whereas the other half was chosen by �shermen who had knowledge and experience

of �shing and the �shing grounds. The design however is systematic since the same

stations are covered every year (Pálsson et al., 1989). Five commercial vessels are

16 Chapter 3 Methodology

Figure 3.1: Icelandic ground�sh survey area within the 500 meter contour line,outlining the statistical rectangles and the locations of the stations

leased every year to carry out the survey within the restricted time frame of 2-3

weeks. Emphasis is placed on standardizing the �shing methods as far as possible.

The towing speed is �xed at 3.8 knots over the bottom and the towing distance is

4.0 nautical miles.

3.2 Data

The survey targets all major commercial demersal �sh species within the survey

area. The criterion used for identifying the species to be included in the current

analysis was the frequency of occurrence of the species in the overall number of

samples. Species which appeared in greater than 5% of the total number of samples

were analysed. This comprised 40 species. Rare species were excluded as they could

confuse patterns in multivariate analysis if left in the similarity matrix since they

typically have only single sporadic occurrences at variable sites, without apparent

structure (Clarke and Warwick, 2001).

Data for the period 1998-2007 were analyzed. The raw data used for analy-

sis consisted of abundance in numbers by species, year, station, statistical square,

3.3 Hierarchical cluster analysis - Species Assemblages 17

sub-square, depth, latitude and longitude of the stations. The original matrix of

abundance had species arranged in columns and each row corresponded to a single

The data were appropriately standardized (for each method) and transformed

before analysis. For data on species abundance standardizing reduces the strong

weighting and in�uence of few highly abundant species. It is important to make all

species have similar importance so that uncommon species also contribute to the

dissimilarities. Standardization also reduces the e�ect of di�erent total abundance

in di�erent sampling units which is important when comparing sites.

3.3 Hierarchical cluster analysis - Species Assem-

blages

The data analyses consisted of two main parts (Analysis I and II), based on di�erent

data standardisations and dissimilarity measures and are described below.

3.3.1 Analysis I: Correlation distance

For this distance measure, the data were �rst transformed to fourth root and then

scaled to mean 0 and variance 1 before carrying out the analysis. The distribution of

the data, before and after transformation are outlined in Figure 3.2, for four abun-

dant species. The dissimilarity measure used was 1 - Correlation. This coe�cient

best measures linear relationships between standardized (zero mean and unit vari-

ance) variables (Quinn and Keough, 2002). Since the data were centered (zero

mean), the Uncentered Pearsons Correlation Coe�cient was used, subsequently

modi�ed to dissimilarity by subtracting from 1:

n∑i=1

xijxik√√√√ n∑i=1

n∑i=1

where xij and xik represents the abundance of jth and kth species at site i.

3.3.2 Analysis II: Bray-Curtis distance

The second distance measure tested was the Bray-Curtis. The data were transformed

to fourth root and standardized by range which is one suitable standardisation for

this distance measure (Quinn and Keough, 2002). The Bray-Curtis measure of

dissimilarity could not be applied to earlier data standardisation as it does not

accept negative values (Quinn and Keough, 2002) which are generated when the

data are scaled. Figure 3.3 outlines the distribution of the data before and after

transformation for four abundant species. The Bray-Curtis coe�cient compares two

species in terms of their minimum abundance at each site:

∑pi=1 2min(xij, xik)∑p

i=1(xij + xik)(3.1)

where xij and xik represents the abundance of jth and kth species at site i.

The dissimilarity coe�cient is calculated by subtracting similarity from 100.

3.3.3 Data Analyses

The statistical software R was used to carry out all the analyses.

For each mode of analysis (Analysis I: Correlation distance and II: Bray-Curtis

distance) the three hierarchical clustering methods; Average, Complete and Ward's

were applied. For each method three levels of data aggregation were tested; (i) raw

data including all stations and years, (ii) data aggregated by station by taking an

average across years and (iii) data aggregated by subrectangles by taking an average

across years and stations.

The e�ect of sample size was tested by taking subsamples of the data. A total

of 5352 tows were available initially. Subsamples of 50%, 25% and 10% of the

original tow collection were taken. These subsamples were generated randomly

while maintaining the design and relative station density of the survey. Clustering

was done on each subsample for the two modes of analyses.

The cluster analysis was carried out using the Pvclust routine under package

Pvclust to assess the uncertainty in the clustering through bootstrap resampling

technique. A thousand bootstrap replications were run for each cluster. Two types

3.3 Hierarchical cluster analysis - Species Assemblages 19

of probability values are computed in parallel by the routine i.e. approximately

unbiased (AU) p-value and bootstrap probability (BP) value. The AU p-value is

generated through multiscale bootstrap resampling and has asymptotic superiority

in bias over the BP value (Suzuki and Shimodaira, 2006). The BP value of a cluster,

which is calculated by the ordinary bootstrap resampling, is the frequency that it

appears in the bootstrap replicates. A detailed account of these computations are

given by Shimodaira (2008).

In R the Bray-Curtis measure of dissimilarity is implemented using the routine

vegdist in package vegan.

20 Chapter 3 MethodologyC

Frequency

00010002000300040005000

Frequency

020040060080010001200

Frequency

010002000300040005000

Frequency

02004006008001000

Frequency

010002000300040005000

Frequency

020040060080010001400

Frequency

00010002000300040005000

Frequency

020040060080010001200

Figure3.2:

Distributionof

thedata

(a)beforeand(b)aftertransformingto

fourth

andscalingto

meanand

variance

1,forfour

abundant

speciesinthesurvey,aslabelled.

histogramshow

sthenumberof�shpertowcollections.

3.3 Hierarchical cluster analysis - Species Assemblages 21C

Frequency

00010002000300040005000

Frequency

050010001500

Frequency

010002000300040005000

Frequency

02004006008001000

Frequency

010002000300040005000

Frequency

0500100015002000

Frequency

00010002000300040005000

Frequency

050010001500

Figure3.3:

Distributionof

thedata

(a)beforeand(b)aftertransformingto

fourth

andstandardisingby

range,for

adundant

speciesin

thesurvey,as

labelled.

histogram

sthenumber

of�shper

towcollections.

3.4 Comparison of the hierarchical clustering tech-

niques

One objective criterion used for comparison was the Cophenetic Correlation Coef-

�cient (CPCC). The CPCC is a simple correlation coe�cient between the original

dissimilarity matrix and the cophenetic matrix which is the total dissimilarity matrix

produced after clustering i.e. the distance at which two objects become members of

the same cluster. This correlation therefore measures how well the clustering was

able to maintain the original dissimilarity in the data. The Pearson's correlation

coe�cient was used here. In order to test the e�ect of di�erent sample sizes, the

correlation was calculated between the cophenetic matrix for the various reduced

sample sizes and the original dissimilarity matrix for all samples.

Another objective criterion used was the agglomerative coe�cient (AC) which

basically measures the clustering structure found by a technique. �For each ob-

servation i, its dissimilarity to the �rst cluster it is merged with is divided by the

dissimilarity of the merger in the �nal step of the algorithm, denoted by m(i). The

AC is the average of all 1 - m(i)� (Maechler et al., 2005). The value ranges from 0

to 1 and the higher the AC the better. The AC was however not used to compare

results for di�erent sample sizes as the coe�cient tends to increase with the number

of observations. In R, AC is computed using the routine agnes in package cluster.

The de�nition of the clusters and their probability values were noted and com-

pared. The signi�cance of the clusters were set at 0.9 for the AU p-value of the

clusters.The dendrograms were also visually compared for the presence of similar

clusters across the di�erent data smoothing and sample sizes.

Independent comparisons were made for Analysis I (Correlation distance) and

Analysis II (Bray-Curtis distance) to examine which clustering method performed

relatively better, for the two modes of analysis. The most robust method was then

identi�ed.

3.5 Comparison of hierarchical clustering with non-

metric multidimensional scaling

A non-statistical approach was used to validate the results from the hierarchical ag-

glomerative clustering. This was done by comparing it with non-metric multidimen-

3.6 Fish Assemblages in relation to environmental variables 23

sional scaling (NMDS). The Kruskal's non-metric multidimensional scaling routine

isoMDS under package MASS was used. The procedure does not accept negative

values for initial dissimilarities, hence it could not be applied to data scaled to 0

mean and 1 variance. Thus the comparison could only be made with Analysis II:

Bray-Curtis dissimilarity measure on fourth root transformed data scaled by range.

NMDS plots the clusters on an ordination diagram to look for groupings. These

identi�ed groups were then compared with the clusters formed by the hierarchical

clustering. The stress values were used to examine the goodness-of-�t.

3.6 Fish Assemblages in relation to environmental

variables

After the comparisons of the clustering techniques and the identi�cation of the most

robust linkage method, some biological interpretations were made on the identi�ed

species assemblages, for both Analysis I and II. It was tested if the identi�ed �sh

community structures could be related to two environmental variables, depth and

geographic location of species.

For each species, weighted average depths d and standard deviations sd were

calculated by:

∑nsds∑ns

√∑ns(ds − d)2∑

where ns represents the abundance in numbers for species and ds represents the

depth at station s.

A one-way Analysis of Variance (ANOVA) was carried out to examine any sig-

ni�cant variability in mean depths among the identi�ed �sh assemblages. A Tukey

multiple comparison test was then undertaken to determine between which treat-

ment levels (assemblages) the actual di�erences lay.

Furthermore, the geographic distribution of each species was mapped. This was

done by generating a bubble plot which shows the mean abundance of each species,

averaged across all years, by statistical sub-rectangles. The sizes of the circles are

proportional to the square root of the mean abundance. Any relationship between

this and the identi�ed assemblages was then examined in a non-statistical manner.

3.7 Habitat analysis

This part of the analysis entailed carrying out a classi�cation of the areas within the

Icelandic continental shelf. The areas were de�ned as the statistical subrectangles.

An average of the species abundance in numbers, was calculated by each subrectangle

generating a species-subrectangle matrix. This was essentially a transpose of the

species-site matrix used for species assemblages. Clustering was then carried out

on these data to determine the hierarchical classi�cation of the areas. Classi�cation

was carried out using the three hierarchical linkage methods, for the two distance

measures described above (Analysis I: Correlation distance and II: Bray-Curtis).

The classi�cations obtained were mapped for clarity. For each identi�ed cluster of

areas, its species composition was also determined.

In a previous analysis described in Stefánsson and Pálsson (1997) it was inferred,

based on the bathymetric and hydrographic structure of the Icelandic continental

shelf, that some de�nition between the north and south areas and some depth di-

visions should be observed. The e�ciency of the techniques were based on this

hypothesis.

3.8 Heatmap

A heatmap was generated using the heatplot routine in package made4 ). This plots

hierarchical dendrograms of objects and variables, in this context sites and species

respectively, in a two-way rearrangement. The data were transformed to fourth root

and scaled to mean 0 and variance 1 for this analysis. Here the default settings

were used, which is clustering based on correlation dissimilarity and Average linkage

(Culhane et al., 2005). This generated an image with a spectrum of colours indicat-

ing the strength of associations between the species and their corresponding areas

of occurrence.

4Results

4.1 Comparison of the three hierarchical clustering

techniques

The results from the objective criteria for assessing the clustering techniques, CPCC

and AC, are outlined in Tables 4.1 and 4.2 respectively. Overall it was seen that

Average linkage gave the highest CPCC (0.82), followed by Complete (0.79) then

Ward's (0.76), although Complete linkage performed poorly with the full data set

(0.67). The AC was the highest for Ward's linkage (0.82) followed by Complete

(0.62) then Average (0.49).

The hierarchical clustering yielded by Average and Complete linkage, Figures

4.1a and 4.1b respectively, produced clusters at high dissimilarity levels. Ward's

linkage, however, gave well-de�ned clusters forming at lower levels of dissimilarity.

When the entire data set was used, this technique classi�ed the species into 2 distinct

signi�cant groups (AU > 0.9; edge 37 & 38 in Figure 4.2). Edge refers to the

cluster number which is marked in green in the �gures. A few signi�cant groups of

species were produced by the Average and Complete linkage. Overall, the probability

of clustering was lower for the Complete linkage in comparison to the other two

methods. The AU p-values were used for comparison which are illustrated in blue

in the �gures.

Clustering on the full data set provided inconsistent species assemblages across

26 Chapter 4 Results

the three hierarchical clustering techniques. However, with some data smoothing,

i.e. averaging the species abundance by stations and across years, the results were

more consistent and comparable among the three clustering methods. Essentially

four main species assemblages could be identi�ed and these are portrayed in Figures

4.3a, 4.3b and 4.4 for Average, Complete and Ward's linkage respectively. Species

such as altantic wol�sh, moustache sculpin, lump�sh, long rough dab and snake

blenny were inconsistent in clustering, among the three linkage methods.

For this analytical method also, it was seen that Average linkage gave the highest

CPCC (0.87), followed by Complete (0.74) then Ward's (0.61) (Table 4.1). The AC

was the highest for Ward's (0.75) linkage followed by Complete (0.62) then Average

(0.44) (Table 4.2).

When the clustering was carried out on the full data set, the Average (Figure

4.5a) and Complete linkage (Figure 4.5b) produced clusters at high dissimilarity

levels. The Complete linkage did not give a clear de�nition of clusters in particular.

Ward's linkage gave well de�ned clusters (Figure 4.6). The results among the three

linkage techniques was not consistent. With smoother data, the clustering structure

improved for Average and Complete linkage and the results across the three clus-

tering techniques were relatively more consistent. Similar groups of species could

be identi�ed. The results from Average and Complete linkage were similar (Figures

4.7a, 4.7b) except Average linkage produced some outlying observations. However

the clustering structure between the constituent groups of species was di�erent for

Ward's linkage (Figure 4.8).

4.1 Comparison of the three hierarchical clustering techniques 27

Data Average Complete Ward's

I II I II I IIFull data set 0.82 0.87 0.67 0.74 0.75 0.61Aggregated by stations 0.82 0.84 0.79 0.74 0.76 0.64Aggregated by subrectangles 0.81 0.83 0.79 0.79 0.75 0.6650% Subsample 0.80 0.83 0.74 0.79 0.75 0.6925% Subsample 0.80 0.83 0.75 0.68 0.75 0.6510% Subsample 0.78 0.82 0.70 0.61 0.66 0.63

Table 4.1: Cophenetic Correlation Coe�cient for Analysis I (Correlation distance)and II (Bray-Curtis distance)

Data Average Complete Ward's

I II I II I IIFull data set 0.49 0.44 0.62 0.62 0.82 0.75Aggregated by stations 0.66 0.55 0.75 0.65 0.90 0.83Aggregated by subrectangles 0.70 0.61 0.77 0.63 0.91 0.85

Table 4.2: Agglomerative Coe�cient for Analysis I (Correlation distance) and II(Bray-Curtis distance)

28 Chapter 4 Results(a)

deepwater redfishpolar cod

polar sculpinatlantic sculpin

artic rocklinggreenland halibutesmark's eelpout

lycodes spatlantic poacherlongfin snailfish

codspotted wolffish

thorny skatesnake blenny

long rough dabvahl's eelpout

witchfourbeaded rockling

haddockwhiting

monkfishlemon sole

blue whitingblueling

greater argentinetusk

megrimnorway pout

lingnorway haddock

saitheredfish

skatedogfish

atlantic wolffishmoustache sculpin

lumpfishhalibutplaice

0.20.40.60.81.0

Dissimilarity

megrimnorway pout

lingnorway haddock

tusksaithe

redfishwitch

fourbeaded rocklinghaddock

whitingmonkfish

lemon soleatlantic wolffish

moustache sculpinthorny skate

codspotted wolffish

dabdeepwater redfish

greater argentineskate

dogfishsnake blenny

polar codgreenland halibutesmark's eelpout

artic rockling

0.20.40.60.81.01.21.4

Dissimilarity

Figure4.1:

Dendrogram

ofspeciesassemblagefortheIcelandicGround�sh

(IGF)survey

1998-2007using(a)Average

linkage

and(b)Com

pletelinkage,withcorrelationdissimilarity

measure.Dataconsists

ofspeciesabundancein

numbers,

fourth

transformed

andscaled

to0meanandvariance

1,comprisingof

alltowcollections.The

rectangles

highlight

theclusterswithAU>0.9.

AUvalues

areused

forinterpretation

areindicatedin

andtheclusternumber

(edge)

ismarkedin

green.

haddockwhiting

monkfishlemon sole

halibutplaice

dabmegrim

norway poutsaithe

redfishtuskling

norway haddockblueling

greater argentineblue whiting

deepwater redfishskate

dogfishpolar cod

atlantic poacherlongfin snailfish

greenland halibutlycodes sp

atlantic sculpinartic rockling

esmark's eelpoutpolar sculpin

long rough dabsnake blennythorny skate

vahl's eelpoutatlantic wolffish

moustache sculpinlumpfish

codspotted wolffish

0123456

Dissimilarity

Figure4.2:

Dendrogram

ofspeciesassemblageusingWard'slinkage

withcorrelationdissimilarity

numbersfourth

transformed

andscaled

to0meanandvariance

rectangles

highlight

theclusters

withAU>0.9.

tuskblue whiting

bluelinggreater argentine

lingnorway haddock

megrimnorway pout

saitheredfish

haddockfourbeaded rockling

whitingwitch

plaicedab

halibutmonkfish

lemon soleskate

dogfishmoustache sculpin

atlantic wolffishlumpfish

vahl's eelpoutlong rough dab

snake blennycod

spotted wolffishdeepwater redfish

polar codthorny skateartic rocklingpolar sculpin

atlantic sculpingreenland halibutesmark's eelpoutatlantic poacherlongfin snailfish

lycodes sp

0.20.40.60.81.01.2

Dissimilarity

tusksaithe

redfishling

norway haddockmegrim

norway poutblue whiting

long rough dabsnake blenny

fourbeaded rocklingwhiting

witchhalibutplaice

dabhaddockmonkfish

lemon soleskate

dogfishmoustache sculpin

codspotted wolffish

thorny skatevahl's eelpout

greenland halibutesmark's eelpoutatlantic poacherlongfin snailfish

lycodes sppolar sculpin

0.00.51.01.5

Dissimilarity

Figure4.3:

Dendrogram

ofspeciesassemblageusing(a)Average

linkage

and(b)Com

pletelinkage,withcorrelationdis-

similarity

ofmeanspeciesabundancein

numbersby

stations,fourth

transformed

andscaled

to0meanandvariance

rectangles

highlight

theidenti�edspeciesassemblages

forcomparison.

codspotted wolffish

greenland halibutesmark's eelpoutatlantic poacherlongfin snailfish

lycodes sppolar sculpin

atlantic sculpinartic rocklingblue whiting

lingnorway haddock

megrimnorway pout

tusksaithe

redfishmoustache sculpin

haddockfourbeaded rockling

whitingwitchskate

dogfishplaice

dabhalibut

monkfishlemon sole

Dissimilarity

Figure4.4:

Dendrogram

numbersby

stations,fourth

transformed

andscaled

to0meanandvariance

rectangles

highlight

forcomparison.

polar codpolar sculpin

esmark's eelpoutatlantic poacherlongfin snailfish

deepwater redfishblueling

snake blennyhalibutplaice

dabnorway pout

lingmegrim

monkfishfourbeaded rockling

whitingwitch

moustache sculpinlumpfish

atlantic wolffishthorny skate

codlong rough dab

haddockredfish

spotted wolffishvahl's eelpout

saithetusk

lemon solenorway haddock

skatedogfish

0.00.20.40.60.81.0

Dissimilarity

snake blennymoustache sculpin

lumpfishspotted wolffish

vahl's eelpoutpolar sculpin

esmark's eelpoutatlantic sculpin

artic rocklingdeepwater redfish

polar codatlantic poacherlongfin snailfish

halibutplaice

dabnorway pout

lingmegrimblueling

fourbeaded rocklingmonkfish

whitingwitchtusk

atlantic wolffishthorny skate

codlong rough dab

haddockredfishsaithe

skatedogfish

0.00.20.40.60.81.0

Dissimilarity

Figure4.5:

Dendrogram

linkage

and(b)Com

pletelinkage

withBray-Curtis

dissimilarity

measure.Dataconsistsofspeciesabundancein

numbers,fourth

transformed

andstandardised

byrange.

rectangles

highlight

theclusters

withAU>0.9.

moustache sculpinpolar sculpin

esmark's eelpoutatlantic sculpin

artic rocklingpolar cod

lumpfishhaddock

redfishatlantic wolffish

thorny skatecod

long rough dabdeepwater redfish

blue whitingskate

dogfishhalibutplaice

dabnorway pout

lingmegrim

saithetusk

snake blennymonkfish

0.00.51.01.52.02.53.03.5

Dissimilarity

Figure4.6:

Dendrogram

withBray-Curtisdissimilarity

numbers,fourth

transformed

andstandardised

byrange.

rectangles

highlight

theclusters

withAU>0.9.

dogfishskate

dabgreenland halibut

lycodes spartic rockling

atlantic sculpinpolar sculpin

tusknorway haddock

saitheredfish

atlantic wolffishlumpfishhaddock

thorny skatelong rough dab

vahl's eelpoutcod

spotted wolffishfourbeaded rockling

whitingwitch

plaicehalibut

monkfishlemon sole

megrimling

0.00.20.40.60.8

Dissimilarity

artic rocklingesmark's eelpout

megrimling

skatedogfish

moustache sculpinvahl's eelpout

codspotted wolffish

saitheredfish

tusknorway haddock

halibutmonkfish

lemon soleplaice

dabsnake blenny

0.00.20.40.60.81.0

Dissimilarity

Figure4.7:

Dendrogram

linkage

and(b)Com

pletelinkage

withBray-Curtis

dissimilarity

numbersby

stations,fourth

transformed

standardised

byrange.

rectangles

highlight

forcomparison.

atlantic poacherlongfin snailfishatlantic sculpin

polar sculpingreenland halibut

lycodes spdeepwater redfish

polar codatlantic wolffish

lumpfishhaddock

codspotted wolffish

greater argentinesaithe

redfishtusk

lingnorway pout

skatedogfish

snake blennyfourbeaded rockling

whitingwitch

halibutmonkfish

lemon soleplaice

Dissimilarity

Figure4.8:

Dendrogram

numbersby

stations,fourth

transformed

andstandardised

byrange.

rectangles

highlight

forcomparison.

4.2 Sample size e�ect

For this part, Average linkage performed well down to a subsample of 25% with some

minor changes in the clustering structure of the species. On the other hand, Com-

plete linkage gave unstable results but Ward's linkage performed well down to 10%

subsample. Two main observations can be made in all three cases. The probability

values decreased with smaller sample size leading to many clusters being insigni�-

cant and the CPCC for all linkage techniques generally decreased with decreasing

sample size (Table 4.1). Some more detailed observations for the three clustering

methods are outlined below.

Average linkage

The 50% subsample gave very similar assemblage groupings to the total sample

size. Three clusters were identi�ed at a dissimilarity of 1 (edge 34, 36 & 37 in Figures

4.1a and 4.9a). The 25% subsample gave similar results except the species group

containing blue whiting, blue ling and greater argentine clustered with a di�erent

group of species (Figure 4.9b). At a subsample of 10%, the clusters containing cod

and greenland halibut (edge 36; Figure 4.10) were similar however the clustering for

the rest of the species changed. The probability values decreased with decreasing

sample size. The CPCC decreased from 0.82 for the largest sample to 0.78 for the

smallest sample. (Table 4.1).

Complete linkage

Data aggregated by stations were used to compare the sample sizes in this case

as it gave relatively more consistent results. Additionally, the results obtained from

these were similar to the results obtained from the other two clustering techniques

therefore this was considered more reliable for comparison. Reducing the sample

size had an e�ect on the assemblages obtained from this method. Even though the

results from 25% subsample were similar (Figures 4.3b & 4.11b), the 50% subsample

gave some inconsistent results, such as, the cluster containing cod (edge 34; Figures

4.11a) had a di�erent clustering structure. At 10% subsample the clustering was

4.2 Sample size e�ect 37

signi�cantly di�erent (Figure 4.12). The probability values decreased with decreas-

ing sample size. The CPCC decreased from 0.79 for the largest sample to 0.70 for

the smallest sample (Table 4.1).

Ward's linkage

For the 50% and 25% subsamples the results were similar with lump�sh, skate

and dog�sh being exceptions (Figures 4.2, 4.13a, & 4.13b). At 10% subsample

snake blenny was an exception to the general clustering structure (Figure 4.14).

The probability values of the clusters decreased signi�cantly with fewer samples.

The CPCC values were consistent down to 25% subsample at 0.75 but decreased to

0.66 with a further reduction in the sample size (Table 4.1).

For this distance measure, Average and Ward linkage performed relatively better

than Complete linkage.

Average linkage performed consistently at 50% subsample, some species were

unstable in clusters (Figures 4.5a and 4.15a). Some inconsistencies were observed at

25% subsample however the overall structure was similar (Figure 4.15b) but changed

considerably at 10% sample size (Figure 4.16).

Complete linkage performed consistently at 50% subsample, some species were

unstable in clusters (Figures 4.7b and 4.17a). The assemblages were considerably

di�erent at 25% and 10% subsample (Figures 4.17b & 4.18).

Ward's linkage performed relatively well at 50% and 25% subsample, with some

exceptions (Figures 4.6, 4.19a and 4.19b). At 10% subsample the assemblages were

considerably di�erent (Figure 4.20).

Here again, the CPCC values decreased gradually with decreasing sample size for

all techniques (Table 4.1) and the probability values of the clusters also decreased.

codspotted wolffish

artic rocklingatlantic poacherlongfin snailfish

lycodes spgreenland halibutesmark's eelpout

skatedogfish

haddockwhiting

monkfishlemon sole

greater argentinetusk

megrimnorway pout

lingnorway haddock

saitheredfish

0.20.40.60.81.0

Dissimilarity

polar codpolar sculpin

codspotted wolffish

dabdogfish

skatedeepwater redfish

greater argentinemonkfish

lemon solewitch

whitingtusk

saitheredfish

megrimnorway pout

lingnorway haddock

0.20.40.60.81.0

Dissimilarity

Figure4.9:

Dendrogram

ofspeciesassemblageusingAverage

linkage

numbers,fourth

transformed

andscaled

to0meanandvariance

1,comprisingof

(a)50%

random

subsam

pleand(b)25%

random

subsam

thetotaltowcollections.The

rectangles

highlight

theclusterswith

AU>0.9.

dogfishskate

tusksaithe

redfishfourbeaded rockling

haddockmonkfish

whitingwitch

bluelingmegrim

norway poutling

norway haddockhalibutplaice

lemon soledab

lumpfishpolar sculpin

deepwater redfishatlantic sculpin

polar codgreenland halibut

lycodes spatlantic poacherlongfin snailfishatlantic wolffish

moustache sculpinsnake blenny

long rough dabthorny skate

vahl's eelpoutcod

spotted wolffish

0.00.20.40.60.81.01.2

Dissimilarity

Figure4.10:Dendrogram

linkage

numbers,fourth

transformed

andscaled

to0meanandvariance

1,comprisingof10%

random

subsam

rectangles

highlight

theclusters

withAU>0.9.

tusksaithe

redfishling

haddockwhiting

monkfishlemon sole

halibutplaice

dabskate

dogfishlumpfish

atlantic wolffishmoustache sculpindeepwater redfishgreenland halibutesmark's eelpout

polar sculpinartic rockling

codspotted wolffish

polar codatlantic sculpin

0.00.51.01.5

Dissimilarity

tuskblue whiting

saitheredfish

lingnorway haddock

megrimnorway pout

whitingwitchskate

dogfishplaice

dabhalibut

monkfishlemon sole

lumpfishatlantic wolffish

moustache sculpincod

spotted wolffishthorny skate

vahl's eelpoutesmark's eelpout

artic rocklingdeepwater redfish

0.00.51.01.5

Dissimilarity

ofspeciesassemblageusingCom

pletelinkage

measure.

consists

numbersby

stations,fourth

transformed

andscaled

to0meanandvariance

1,comprisingof

(a)50%

random

subsam

pleand(b)25%

random

subsam

rectangles

highlight

theclusters

withAU>0.9.

dabmonkfish

lemon soleling

norway haddockfourbeaded rockling

haddockwhiting

witchtusk

saitheredfishdogfishmegrim

norway poutblueling

moustache sculpinhalibutplaice

long rough dabsnake blennypolar sculpin

codthorny skate

vahl's eelpoutskate

artic rocklingspotted wolffishatlantic sculpin

deepwater redfishesmark's eelpout

0.00.51.01.5

Dissimilarity

pletelinkage

measure.

consistsof

meanspeciesabundancein

numbersby

stations,fourth

transformed

andscaled

to0meanandvariance

1,comprisingof

random

subsam

rectangles

highlight

theclusters

withAU>0.9.

greenland halibutesmark's eelpout

codspotted wolffishatlantic wolffish

moustache sculpindeepwater redfish

greater argentinemegrim

norway poutling

norway haddocktusk

saitheredfish

haddockwhiting

monkfishlemon sole

skatedogfish

0123456

Dissimilarity

codspotted wolffishatlantic wolffish

moustache sculpinmegrim

norway poutsaithe

redfishtuskling

norway haddockdeepwater redfish

blue whitingskate

dogfishwitch

whitingmonkfish

lemon solelumpfish

halibutplaice

0123456

Dissimilarity

numbers,fourth

transformed

andscaled

to0meanandvariance

1,comprisingof

(a)50%

random

subsam

pleand(b)25%

random

subsam

rectangles

highlight

theclusterswith

AU>0.9.

tusksaithe

redfishdogfish

skatebluelingmegrim

norway poutling

norway haddockhalibutplaice

lemon soledab

haddockmonkfish

whitingwitch

fourbeaded rocklingsnake blenny

artic rocklingdeepwater redfishesmark's eelpout

codspotted wolffishlong rough dab

moustache sculpin

01234567

Dissimilarity

numbers,fourth

transformed

andscaled

to0meanandvariance

1,comprisingof10%

random

subsam

rectangles

highlight

theclusters

withAU>0.9.

skateatlantic sculpin

artic rocklinglycodes sp

polar sculpinatlantic poacherlongfin snailfish

dogfishblue whiting

snake blennyplaice

dabsaithe

norway haddockredfish

tuskmoustache sculpin

codspotted wolffish

vahl's eelpoutfourbeaded rockling

lingmegrimhalibut

monkfishlemon sole

norway poutwhiting

0.10.20.30.40.50.60.70.8

Dissimilarity

dogfishskate

polar coddeepwater redfish

lycodes sppolar sculpinartic rockling

atlantic sculpinatlantic poacherlongfin snailfish

bluelingmegrim

dabhalibutplaice

tusknorway haddock

saitheredfish

monkfishlemon sole

lingnorway pout

moustache sculpinatlantic wolffish

lumpfishhaddock

vahl's eelpoutcod

spotted wolffishsnake blenny

0.00.20.40.60.8

Dissimilarity

linkage

measure.

consists

numbers,

fourth

transformed

andstandardised

byrange,

comprisingof

(a)50%

random

subsam

pleand(b)25%

random

subsam

rectangles

highlight

theclusterswith

AU>0.9.

polar coddeepwater redfishgreenland halibut

skateartic rockling

atlantic poacheratlantic sculpin

polar sculpinesmark's eelpout

longfin snailfishlycodes sp

dogfishblue whiting

snake blennyplaice

dabsaithe

tuskredfish

norway haddockmoustache sculpin

vahl's eelpoutspotted wolffish

codthorny skate

long rough dabhalibut

monkfishlemon sole

fourbeaded rocklingling

megrimwitch

whitingnorway pout

0.10.20.30.40.50.60.70.8

Dissimilarity

Figure4.16:DendrogramofspeciesassemblageusingAverage

linkage

ofspeciesabundanceinnumbers,fourth

transformed

andstandardised

byrange,comprisingof10%random

subsam

ofthetotaltowcollections.The

rectangles

highlight

theclusters

withAU>0.9.

atlantic sculpinartic rocklingpolar sculpin

fourbeaded rocklingsnake blenny

plaicehalibut

monkfishlemon sole

saithenorway haddock

redfishtusk

norway poutwhiting

witchling

megrimblue whiting

dabskate

dogfish

0.00.20.40.60.81.0

Dissimilarity

94au 99

skatemegrim

lingnorway pout

blue whitingdogfish

dabplaice

halibutmonkfish

lemon soledeepwater redfish

atlantic sculpinatlantic poacherlongfin snailfish

polar codlycodes sp

whitingwitchtusk

norway haddocksaithe

codspotted wolffish

vahl's eelpoutthorny skate

long rough dabhaddock

0.00.20.40.60.81.0

Dissimilarity

pletelinkage

measure.Data

consistsofmeanspeciesabundancein

numbersby

stations,fourth

transformed

andstandardised

byrange,comprising

of(a)50%

random

subsam

pleand(b)25%

random

subsam

rectangles

highlight

clusters

withAU>0.9.

atlantic poacheresmark's eelpout

artic rocklingatlantic sculpin

polar sculpinskate

greenland halibutdogfish

saithetusk

redfishnorway haddock

witchwhiting

norway poutfourbeaded rockling

bluelingling

megrimhalibut

monkfishlemon sole

plaicedab

codthorny skate

long rough dab

0.00.20.40.60.81.0

Dissimilarity

pletelinkage

measure.Data

consistsofmeanspeciesabundancein

numbersby

stations,fourth

transformed

andstandardised

byrange,comprising

ofa10%

random

subsam

rectangles

highlight

theclusters

withAU>0.9.

lycodes spgreenland halibutesmark's eelpoutatlantic poacherlongfin snailfishatlantic sculpin

codspotted wolffish

blue whitingskate

dogfishtusk

norway haddocksaithe

redfishmegrim

lingnorway pout

whitingwitch

dabplaice

halibutmonkfish

lemon sole

Dissimilarity

polar sculpinatlantic poacherlongfin snailfish

saithenorway haddock

redfishtusk

skatedogfish

norway poutling

megrimblue whiting

dabplaice

halibutmonkfish

lemon solesnake blenny

Dissimilarity

Figure4.19:DendrogramofspeciesassemblageusingWard'slinkage

numbers,

fourth

transformed

andstandardised

byrange,

comprisingof

(a)50%

random

subsam

pleand(b)25%

random

subsam

rectangles

highlight

theclusters

withAU

skategreenland halibut

artic rocklingatlantic poacheratlantic sculpin

polar sculpinesmark's eelpout

snake blennydeepwater redfish

polar codatlantic wolffish

lumpfishhaddock

long rough dabmoustache sculpin

codthorny skate

dogfishblueling

lingmegrim

plaicedab

fourbeaded rocklingwitch

whitingnorway pout

saithetusk

redfishnorway haddock

halibutmonkfish

lemon sole

Dissimilarity

ofspeciesabundanceinnumbers,fourth

transformed

andstandardised

byrange,comprisingof10%random

subsam

ofthetotaltowcollections.The

rectangles

highlight

theclusters

withAU>0.9.

4.3 Data Aggregation (smoothing) e�ect

The level at which the data were aggregated had an e�ect particularly on Complete

linkage. With the full data set, the clusters were not very well-de�ned and the

de�nition improved with data smoothing, increasing the probability slightly also

(Figure 4.1b & Figure 4.3b & Figure 4.21b). The CPCC was considerably higher

for aggregated data then for the full data set. The CPCC for Average and Ward's

linkage did not show any considerable di�erence with data smoothing (Table 4.1).

The overall assemblage patterns for these two linkage methods were comparable,

across di�erent data aggregations, with some species being exceptions that moved

between the clusters. These are illustrated in Figures 4.1a, 4.3a & 4.21a for Average

linkage and Figures 4.2, 4.4 & 4.22 for Ward's linkage.

The dissimilarity levels at which the clusters formed was lower when the data

were aggregated by stations, for Average and Complete linkage. Further data aggre-

gation by subrectangles, did not result in any signi�cant changes in the clustering

levels. The AC values considerably increased when the data were aggregated by sta-

tions for all three linkage methods. However, no considerable changes were observed

when data were further aggregated by subrectangles (Table 4.2).

The probability of clustering generally decreased with data smoothing. Ward's

linkage performed well across all three data aggregation levels with the highest

probability of clustering with the full data set, indicating the greatest consistency

in generated clusters across bootstraps.

The structure of assemblages were sensitive to data aggregation for all three link-

age techniques, in particular for Complete linkage. The probability of the clusters

increased with increased data smoothing for all three linkage techniques. These are

illustrated in Figures 4.5a , 4.7a and 4.21a for Average linkage; Figures 4.5b , 4.7b

and 4.21b for the Complete linkage and Figures 4.6 , 4.8 and 4.24 for the Ward's

linkage.

The CPCC increased for Complete and Ward's linkage but decreased slightly for

Average linkage (Table 4.1). The AC values increased with data smoothing for all

linkage techniques (Table 4.2) together with the probability values for the clusters.

4.3 Data Aggregation (smoothing) e�ect 51

halibutplaice

dablong rough dab

haddockwhiting

witchskate

dogfishblue whiting

monkfishlemon sole

lingnorway haddock

megrimnorway pout

tusksaithe

deepwater redfishthorny skate

vahl's eelpoutcod

spotted wolffishpolar cod

artic rocklingpolar sculpin

0.00.20.40.60.81.01.2

Dissimilarity

moustache sculpinatlantic wolffish

lumpfishthorny skate

vahl's eelpoutcod

spotted wolffishatlantic sculpin

deepwater redfishatlantic poacher

polar codblue whiting

tusksaithe

redfishskate

dogfishmonkfish

lemon soleling

norway pouthalibutplaice

dabhaddock

whitingwitch

fourbeaded rocklinglong rough dab

snake blenny

0.00.51.01.5

Dissimilarity

linkage

and(b)Com

pletelinkage

withcorrelation

dissimilarity

measure.

Dataconsists

numbersby

statisticalsubrectangles,

fourth

transformed

andscaled

to0meanandvariance

rectangles

highlight

theclusters

withAU>0.9.

atlantic sculpingreenland halibutesmark's eelpout

vahl's eelpoutcod

spotted wolffishthorny skate

atlantic poacherpolar cod

deepwater redfishblue whiting

tusksaithe

redfishmonkfish

lemon soleling

norway poutmoustache sculpin

whitingwitch

halibutplaice

dabskate

dogfish

0246810

Dissimilarity

numbersby

fourth

transformed

andscaled

to0meanand

variance

rectangles

highlight

theclusters

withAU>0.9.

4.3 Data Aggregation (smoothing) e�ect 53

dogfishskate

longfin snailfishblue whiting

snake blennyplaice

dabmoustache sculpin

vahl's eelpoutcod

spotted wolffishmegrim

norway poutfourbeaded rockling

whitingwitch

lingnorway haddock

tusksaithe

redfishhalibut

monkfishlemon sole

0.00.20.40.60.8

Dissimilarity

dogfishhalibut

monkfishlemon sole

tusksaithe

redfishnorway pout

lingnorway haddock

megrimblue whiting

skateplaice

dabdeepwater redfish

esmark's eelpoutlongfin snailfish

lycodes spatlantic poacheratlantic sculpinatlantic wolffish

lumpfishhaddock

codspotted wolffish

whitingwitch

0.00.20.40.60.81.0

Dissimilarity

linkage

and(b)Com

pletelinkage

withBray-Curtis

dissimilarity

measure.

Dataconsists

numbersby

fourth

transformed

andstandardised

byrange.

rectangles

highlight

theclusters

withAU>0.9.

esmark's eelpoutlongfin snailfishatlantic poacheratlantic sculpinatlantic wolffish

lumpfishhaddock

codspotted wolffish

greater argentineskate

dogfishmegrim

norway pouthalibut

monkfishlemon sole

lingnorway haddock

tusksaithe

redfishplaice

dabsnake blenny

Dissimilarity

numbersby

statisticalsubrectangles,fourth

transformed

andstandardised

byrange.

rectangles

highlight

theclusters

withAU>0.9.

4.4 Comparison of hierarchical clustering with non-metric multidimensional scaling 55

Summary

For both Analysis I: Correlation distance and Analysis II: Bray-Curtis distance,

the following holds for the linkage methods. Average linkage always gave the highest

CPCC followed by Complete then Ward's linkage. Complete linkage was the most

sensitive method, giving inde�nite patterns with full data set and with deviations

in sample size. Average and Ward's linkage were to some extent sensitive to data

aggregation but less so than Complete linkage and were more stable when sample size

was altered. The CPCC and the probability of clusters decreased with decreasing

sample size for all three linkage techniques.

Ward's linkage always gave the highest AC followed by Complete then Average

linkage. The AC always increased with data aggregation.

The Bray-Curtis distance measure worked better with aggregated data yielding

higher p-values for the clusters. The Correlation distance measure worked better

with the full data set for Ward's linkage. This was based on the reliability of the

clusters in terms of their probability values.

4.4 Comparison of hierarchical clustering with non-

metric multidimensional scaling

The NMDS ordination of species, with Bray-Curtis dissimilarity measure resulted

in a high stress of 8.87% in three dimensions, for the full data set. The ordination

was repeated for the two data aggregation levels and produced a stress of 7.85%

and 7.49% respectively. Ordination of the full data set did not produce any distinct

groupings (Figure 4.25a). With smoother data (aggregated by statistical subrect-

angles) four species groups could be identi�ed (Figure 4.25b) which were similar to

the outcome from the hierarchical clustering, particularly to Ward's linkage (Figure

4.24), on the same level of data aggregation .

−0.4−0.20.00.20.40.60.8

ling tu

−0.20.00.20.4

Figure4.25:Multidimensional

scalingusingBray-Curtisdistance

measure

for(a)thefulldata

set(com

prisingalltow

collections)(b)data

aggregated

bystatisticalsub-rectangle.

Speciesabundancein

numberswas

fourth

transformed

andstandardised

byrange.

4.5 Fish Assemblages in relation to environmental

variables

The classi�cations from Ward's linkage, carried out on the full data set, were related

to the two environmental variables, depth and geographic location, to examine pos-

sible ecological rationale for the assemblages obtained. Two discrete clusters were

obtained having high probabilities (AU > 0.9) (edge 37 & 38, Figure 4.2). These

clusters were further divided into two. Essentially, four species assemblages were

obtained. The �rst assemblage (A) comprised of halibut, plaice, dab, monk�sh,

lemon sole, witch, fourbeaded rockling, whiting and haddock clustering at AU=0.86.

The second assemblage (B) consisted of tusk, saithe, red�sh, ling, norway haddock,

megrim, norway pout, blueling, blue whiting, greater argentine, skate, dog�sh and

deepwater red�sh at AU=0.90. The third assemblage (C) was Altantic wol�sh,

moustache sculpin, thorny skate, vahl's eelpout, cod, spotted wol�sh, lump�sh,

long rough dab and snake blenny AU=0.98. The fourth assemblage (D) comprised

of greenland halibut, esmark's eelpout, polar sculpin, Altantic sculpin, lycodes sp.,

artic rockling, Altantic poacher, long�n snail�sh and polar cod AU=0.88. The latin

names for the �sh species are outlined in Table A.1 in the Appendix.

These assemblages could be related to environmental parameters such as depth

and geographic distribution of the species. The species that clustered together had

similar geographical distributions also. Assemblages A and B were characterised as

species found in the southern region (Figure 4.26). In relation to the mean depths of

the species, the �rst assemblage was de�ned as shallow to intermediate with depths

ranging from 50m - 200m. The second assemblage was de�ned as intermediate to

deep with a mean depth range of 180m - 340m. Assemblages C and D characterised

the northern region (Figure 4.26), where assemblage C was categorised as shallow

to intermediate with a 150m - 250m depth range and assemblage D was de�ned

as deep ranging between 280m - >400m. This relationship between depth and the

identi�ed assemblages is demonstrated in Figure 4.27 where the weighted depths

and standard deviations for each species are outlined and each species is assigned

to the respective cluster.

The box and whisker plot in Figure 4.28a shows the data on which a one-way

ANOVA was performed to investigate statistical di�erences in the mean depths of

the species comprising the assemblages. The ANOVA showed that the mean depths

at which the assemblages occurred were signi�cantly di�erent (F = 41.282, df:3, P

< 0.05). The Tukey multiple comparisons test showed that assemblages A, B and

D were signi�cantly di�erent from each other but assemblage B and C were not

signi�cantly di�erent (Figure 4.28b).

Average linkage gave similar assemblages when applied to data aggregated by sta-

tions, although some species from assemblage C became a part of assemblage D and

skate and dog�sh moved to cluster A (Figure 4.3a). Complete linkage gave similar

clusters with long rough dab, snake blenny, skate and dog�sh being exceptions. The

probability of the clusters were slightly lower than Average linkage (Figure 4.3b).

Three assemblages were identi�ed by the Ward's linkage on data aggregated by sta-

tistical subrectangles. The �rst assemblage (A*) comprised of halibut, plaice, dab,

monk�sh, lemon sole, fourbeaded rockling, whiting, witch, tusk, saithe, red�sh, ling,

norway haddock, megrim, norway pout, blueling, blue whiting, greater argentine,

skate, dog�sh with an AU=0.94. The second assemblage (B*) comprised of cod,

spotted wol�sh, vahl's eelpout, moustache sculpin, long rough dab, thorny skate,

lump�sh, haddock and atlantic wol�sh with a probability of 0.94 and the third as-

semblage (C*) consisted of deepwater red�sh, polar cod, greenland halibut, lycodes

sp., artic rockling, long�n snail�sh, altantic poacher, atlantic sculpin, polar sculpin

and esmark's eelpout with an AU=0.88 (Figure 4.24).

The relationship between depth and the identi�ed assemblages is demonstrated

in Figure 4.29 where the weighted depths and standard deviations for each species

are outlined and each species is assigned to the respective cluster. The box and

whisker plot in Figure 4.28c shows the data on which a one-way ANOVA was per-

formed to see any statistical di�erences in the mean depths of the species comprising

the assemblages. The ANOVA showed that the mean depths at which the assem-

blages occurred were signi�cantly di�erent (F = 26.398, df:2, P < 0.05). The Tukey

multiple comparisons test showed that the di�erence lay between assemblage A and

C. Assemblages A and B were not signi�cantly di�erent (Figure 4.28d). The species

separated broadly into north and south divisions in relation to the geographic loca-

tion (Figure 4.26).

The Average linkage gave two signi�cant clusters when applied to data aggregated

by statistical subrectangles. One of the assemblages was similar to assemblage C*

de�ned above with a probability of 0.96. The rest of the species grouped together

with a probability of 0.99 with two outliers (Figure 4.23a). Complete linkage gave

two distinct groups, with a probability of > 0.80, according to the north and south

divisions except species such as whiting, witch and fourbeaded rockling grouped

with the cod cluster instead (Figure 4.23b).

* * **

* * ***

* * **

● ●

●●

●● ●

● ●●

●●

●● ●●

●● ● ●

●●

● ●●

●●

●●● ●

●●

● ●●

● ●●● ●●●

●● ●●

●●

● ● ●●● ●

●●

●● ●●

●●●

●●

● ●●

●● ●

●●

●●●

● ●●

●●

● ●●●

●●●

●●

●● ● ●

●●

● ●●

●●

● ●

●●

●●●

● ● ●●

●●●

●●

● ● ●●

●● ●

●●

● ●●

● ●●●

●●

●●●

●●

● ● ●●

●●

● ●

●●

● ●●

●●

* * **

* * ***

* * **

● ●● ●

●●

●● ●

●●

● ● ●●●

●●

● ●●●

●●

●●●

●●

● ●●●

●●

●● ● ●

●●

●● ● ●●

●●●

● ●●

●●● ●

●●

● ●●

●● ●

●●

● ●● ●

●●

● ●●

●●

●● ●

●●

●●●

●● ●

●●

●● ● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●● ●

●●

●●● ● ●

●●

●● ●●

●●

● ●●

●●

● ●●

●●

● ●●

●●

● ●●

●●● ●

●● ●●

●●

● ●

●●

●● ●

●●

● ●●

●●

● ●

●●

●●●

●●

●● ●

●●

●●● ●

●●

● ●●

●●

●● ●

● ●

●●

* * **

* * ***

* * **

●●

●● ●

●●

● ●

●●

●● ●

●●

●●●

●●

●●●

●●

● ●●

●●● ●

●●

●●●

●●

● ●● ●

●●

●●●

●●

● ●

●●

●● ●

●●

● ●

●●

●● ●

● ●●

●●

● ●

●●

● ●●

●●

* * **

*** * *

** * *

● ●

●●

●● ●

●●

●● ● ●

●●

● ●●

●●

● ● ●●●● ●

●●

● ●●

●●

●● ●●

●●●

●●

● ●

●●●

●●

● ●● ●

●●

●●●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●●

●●

* * **

*** * *

** * *

● ●

●●

● ●●

●●

● ● ●●

●●

●● ●

●●

●●●

●●

● ● ● ●●

●●

● ●●

●●

● ●●

●●

●● ●●●

●●●

●●

● ●

●●

● ●● ●

●●

●● ●●

●●

●●●

●●

●● ●

●●

* * **

*** * *

** * *

● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●●

●●

●●●

●●

● ●●

●●

● ●

●●

● ●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●●

● ●●

●●

* * **

*** * *

** * *

● ●

●●

●● ●

●●

●● ●

●●

● ● ●●●

●●

● ●

●●

● ●

●●

●●●

●●

●● ●

●●

● ●●

●●

● ●●

●●

* * **

*** * *

** * *

● ●

●●

●●●

●●

●● ●

●●

● ●●

●●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●●

● ●●

●●

* * **

*** * *

** * *

● ●●●

●●

●●●

●●

● ●●

●●

● ●●

●●

●●●

●●

● ●

●●

●●●● ●

● ●

●●

●● ●

●●

● ●

●●

●● ●

● ●●

●●

● ●

●●

* * **

*** * *

** * *

● ●

●●

●●●

●●

● ●●

●●

●● ●

●●

● ●●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

● ●●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●●

●●

● ●

●●

●● ●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

●● ●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

● ●●

●● ●

●●

●●●●

● ●●

●●

● ●●

●●

● ●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●●

●● ●

●●

● ●●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●●

●● ●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

●●●● ●

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●●

●●●

●●

● ●

●●

● ●●

●● ●

●●

● ●●

●● ●

● ●●

●●

●● ●●

●●

●● ●

● ●●

●●

●●●

● ●●

● ● ● ●● ● ●●

●●

● ●

●●

●● ●●

●● ●

●●

●● ●

●●

● ●●

● ●●●

●●

● ●●

●●

●● ●●

●●

● ●●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

●●●

●●

● ●●

●●

● ●

●●

● ●●

●● ●

●●

●● ●

●●●

●●

● ●●

●●

● ●

●● ●

●●

●● ●

●●

●● ●

●●

● ●●

●● ●

●●

●● ●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

●● ●

●●

● ●●●

●●

● ●●

●● ●

●●

● ●●● ● ●

●●

● ●● ● ●●

● ●●●

●●●

●●

● ●●●

●●

● ●● ●

●●

● ●●

●●

●● ●

●● ● ● ●

●●●●

●● ● ●●

●● ●

● ● ●● ●

●●

● ●●

●● ●

●●

●●●

● ●●

●●

●● ●

● ● ●●●

●●

●● ●●

● ● ●●●● ●

●●

●●●

● ●●●

●●

●● ●

● ●● ●

● ●●

●●

* * **

*** * *

** * *

● ●

●●

●● ●

●●

●●● ●

●●

●● ●

●●

●●●

● ●

●●

●●● ● ●

●●● ●

●●

●● ●●

●●

● ●

●●

● ●●

●● ●

●●

● ●●

●●

● ●●

●●

●● ●

●●

● ●●●

●●

● ● ● ●●●

●●

●● ● ●

●●● ●

●●

●● ●

● ●●

●● ●

●●

●●●

●●

● ●●●

●●

●● ●

● ● ●●

●● ●●

● ● ●●

●●

●● ●

●●

● ●●

●●

● ●●

●●

* * **

*** * *

** * *

● ●

●●

●●●

●●

●●●

●●

● ●

●●

● ●

●●

● ●● ●

●●

●● ●

●●

● ●●

●● ● ●

●● ●

●●

● ●●

●●

● ●

●●

● ●●

●●

* * **

* * ***

* * **

●●

●● ●

●●

●●●

●● ●

●●

● ●●

●●

●● ●

● ●●

● ●●●

●● ●●

●●

● ●●

●●

● ●●

●●

● ●●

●●

● ●●

●●

● ●●

●●

● ●●●

●●●

●●

● ●●

●●

● ●●●

●●

● ●●

●●

●● ● ●

●●

●●●●

●● ● ●●

●● ●

●●

●● ●

● ●

●● ●

●●

● ●●●

●●

●● ●

●● ●●

● ● ● ●●● ●

●●

●● ●

●●

●●● ● ●

● ●●● ●●

●●

●● ●

* * **

* * ***

* * **

● ●

●●

●● ●

●●

● ● ●

●●

● ●●

●●

● ●●

●●

● ●●

●●

●● ●

●● ●●● ● ●

●●

● ●●

●●

● ●

● ● ●●

● ●

●●

● ●●

●●

● ●●●

●●

● ●●

●●●

●●

● ●●● ●●

● ● ● ●●

●●

● ●●

●●

●● ●

● ● ● ●●●

●●

●●●

● ● ●●●

● ●●●

●●

●● ● ●

●● ●

●●

● ● ●●●

●●

●●●

●●

●● ●

●●●● ● ●

●● ●●

●●

●● ●

●●

●● ●●

●●●

●●

●●● ●

●● ●●

●●

* * **

* * ***

* * **

● ●●●

●●

●●●

●●

● ●●

●●

● ●●

●●

●●● ● ●

●●

● ●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●●

●●

● ●●

●●

● ●●

● ●

●●●

●●

● ●●●

●●

●● ●

●●

● ●●● ● ●

●●

●● ● ●●

●● ●

●●

●● ●●

● ●●

●●

● ●● ●●

●●

●● ●

●● ●●●

●●●

●● ●

●●

● ●●

● ●● ●

●● ●●

●●

● ●

* * **

* * ***

* * **

● ●●

●●

●●●

●●

● ●●

●●

● ●●

●●

●● ●

●●

● ●●

●●

● ●●

●●

● ●●

●●

● ●●

●●

●● ●

●●

●●●

●●

● ●●

●●

● ● ●●●

● ●●

●●

●● ●

● ●●

●●

● ●●●

●●

●● ●

●● ●●

● ●●● ●

●●

● ●●

● ● ●●

●●

* * **

* * ***

* * **

● ●●

●●

●●●

●●

● ●●

●●

● ●●

●●

●●● ●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

●● ●

●●

●● ●

●●

● ●● ●

●●

* * **

* * ***

* * **

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●●

● ●● ●●

● ●● ●

●●

* * **

* * ***

* * **

●●

●● ●

●●●

●●

● ●●

●●

● ●●

●●

●● ●

●●

●● ●

●●

● ●●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●● ●●

●●

●●●

●●

● ●●

●●

* * **

* * ***

* * **

● ●●

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●

●●

● ●●

●●

● ●

●●

●● ●

●●

●● ●

●●

●● ●

●●

●● ●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

● ●●

●●

●● ●

●●

● ●

●●

● ● ●● ●

●●

* * **

* * ***

* * **

● ●●

●●

●●●

●●

● ●●

●●

● ●●

●●

●● ●

●●

● ●

●●

● ●●

●●

● ●●

●●

●●●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

● ●

●●

●● ●

●●

● ●●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

●●●

●●

●●● ●

●●

● ●

●●

●●●

●●

●● ●

●●

●● ●

● ●●

●●

●● ●

●●

●● ● ●

●●● ● ●

●●

* * **

* * ***

* * **

● ●●

●●●

●●

● ●●

●●

● ●●

●●

● ●

●●

● ●

●●

● ●

●●

●● ●

●●

●● ●

●●

●● ●

● ●●

●●

●● ●

●●

● ●● ●

●● ●

●●

●● ●

●●

●● ●●

● ●● ●

●●

* * **

* * ***

* * **

● ●

●●

●●●

●●

● ●●

●●

● ●●

●●

● ●

●●

●● ●

●●

● ●

●●

●● ●

●●

● ●●

●●

●● ●

●●

● ●●

●●

Figure4.26:Geographicaldistribution

ofthe40

speciesanalysed

forthisstudy,

labelledaccordingly.

bubble

sthemeanabundanceofspeciesby

statisticalsubrectangles

averaged

acrossyears.The

ofcirclesareproportional

tothesquare

ofthemeanabundance.

dogfish A

plaice A

lemon sole A

halibut A

whiting A

haddock A

atlantic wolffish C

lumpfish C

monkfish A

snake blenny C

witch A

saithe B

long rough dab C

fourbeaded rockling A

skate A

moustache sculpin C

norway pout B

thorny skate C

norway haddock B

redfish B

tusk B

ling B

vahl's eelpout C

megrim B

spotted wolffish C

polar cod D

polar sculpin D

blue whiting B

blueling B

atlantic poacher D

artic rockling D

greater argentine D

longfin snailfish D

atlantic sculpin D

deepwater redfish D

greenland halibut D

esmark's eelpout D

lycodes sp D

Depth (m)

Figure4.27:Weightedaveragedepths

andstandard

deviations

forthe40

speciesanalysed.A-D

refersto

theidenti�ed�sh

assemblages

Ward'shierarchicalclustering

basedon

correlationdistance.

A B C D

Assemblage

−100 0 100 200

95% family−wise confidence level

Differences in mean levels of Assemblage

●●

Assemblage

−50 0 50 100 150 200

95% family−wise confidence level

Differences in mean levels of Assemblage

Figure 4.28: (a) Box and whisker plot for the mean depths of species in the identi�ed�sh assemblages from Ward's hierarchical clustering based on correlation distance(b) Tukey test results showing the signi�cant di�erence between the identi�ed �shassemblages (c) Box and whisker plot for the mean depths of species in the identi�ed�sh assemblages from Ward's hierarchical clustering based on Bray-Curtis distance(d) Tukey test results showing the signi�cant di�erence between the identi�ed �shassemblages from (c)

dab A*

dogfish A*

plaice A*

lemon sole A*

halibut A*

whiting A*

haddock B*

atlantic wolffish B*

lumpfish B*

monkfish A*

snake blenny A*

witch A*

saithe A*

long rough dab B*

fourbeaded rockling A*

skate A*

moustache sculpin B*

norway pout A*

thorny skate B*

cod B*

norway haddock A*

redfish A*

tusk A*

ling A*

vahl's eelpout B*

megrim A*

spotted wolffish B*

polar cod C*

polar sculpin C*

blue whiting A*

blueling A*

atlantic poacher C*

artic rockling C*

greater argentine A*

longfin snailfish C*

atlantic sculpin C*

deepwater redfish C*

greenland halibut C*

esmark's eelpout C*

lycodes sp C*

Depth (m)

Figure4.29:Weightedaveragedepths

andstandard

deviations

forthe40

speciesanalysed.A*-C*refers

totheidenti�ed

�shassemblages

Ward'shierarchicalclustering

basedon

Bray-Curtisdistance.

4.6 Habitat Classi�cation 71

4.6 Habitat Classi�cation

The Average and Ward linkage yielded similar results. When the dendrogram of

subrectangles was split into 5 clusters, a separation along the north-west and south-

east gradient was obtained with clusters 1, 4 & 5 in the north and clusters 2 & 3 in

the south. The north and south areas further separated along the depth gradient

(Figure 4.30a). The output from Ward's linkage is presented here. The output from

Average linkage is given in Appendix; Figure A.1a. Whereas the outcome from the

Complete linkage was di�erent and is outlined in Appendix; Figure A.1b.

The species composition of the �ve clusters is delineated in Figure 4.31. Cluster

1 mainly comprised of greenland halibut, blue whiting, atlantic poacher, deepwater

red�sh, esmark's eelpout, long�n snail�sh, polar cod, atlantic sculpin, vahl's eelpout,

polar sculpin, artic rockling, lycodes sp. and some of altantic cod, thorny skate and

spotted wol�sh. Cluster 5 mainly comprised of atlantic wol�sh, lump�sh, moustache

sculpin, vahl's eelpout, cod, haddock, spotted wol�sh, tusk, long rough dab, snake

blenny and some of the species in cluster 1. Cluster 4 consisted of haddock, whiting,

thorny skate, plaice, witch, long rough dab, lump�sh, fourbeaded rockling, vahl's

eelpout, snake blenny and some cod. Cluster 3 consisted of haddock, atlantic wol�sh,

monk�sh, dog�sh, halibut, plaice, lemon sole, dab, lump�sh and moustache sculpin.

Cluster 2 contained haddock, saithe, whiting, red�sh, ling, blueling, tusk, monk�sh,

skate, dog�sh, greater argentine, halibut, lemon sole, witch, megrim, norway pout,

blue whiting, fourbeaded rockling, norway haddock. The species codes shown in

Figure 4.31 are outlined in Table A.1 in the Appendix.

The Bray-Curtis distance with Ward's linkage showed a de�nition along the north

and south areas with some separation along the depth gradient within these areas

(Figure 4.30b). The Complete linkage also gave similar patterns (Appendix, Figure

A.2b). The Average linkage however gave a de�nition along the north and south

areas only but de�nition according to depth was not apparent (Appendix, Figure

A.2a). The species compositions for the di�erent clusters are delineated in (Figure

4.32).

4.7 Heatmap

A heatmap of the association between species and areas is shown in Figure 4.33.

The map shows a pair-wise display of two dendrograms which were generated us-

ing the Average linkage technique based on Analysis I: Correlation distance. The

species assemblage dendrogram is on the y-axis and the assemblage of areas den-

drogram is on the x-axis. The spectrum of colours ranging from blue (low ratios)

to red (high ratios) gave three main patches of high ratio colours indicating the

species-environment patterns. Thus it can be seen that the species relationships

were re�ected by the spatial relationships.

4.7 Heatmap 73

Figure4.30:De�nition

ofareasin

IcelandicwatersusingWard'shierarchical

clustering.

consistof

species

abundancein

numberstransformed

tofourth

root.Clusteringwas

basedon

(a)correlationdistance

withdata

scaled

meanandvariance

1(b)Bray-Curtisdistance

withdata

standardised

byrange.

−1.00.5

−0.50.5

−1.00.52.0

−0.50.5

Figure4.31:Sp

eciescompositionof

de�ned

clustersfrom

thehabitatclassi�cationusingCorrelation

distance

measure

Ward'slinkage.The

speciescodesareoutlined

inTable4in

theApp

endix.

4.7 Heatmap 75

−1.50.01.5

−0.50.5

−1.00.52.0

Figure4.32:Sp

eciescompositionofde�ned

clustersfrom

thehabitatclassi�cationusingBray-Curtisdistance

measure

Ward'slinkage.The

speciescodesareoutlined

inTable4in

theApp

endix.

573320574611476619373620563513512562617618721375564424474672613663717667374668615718616714715664614671662568425422477423472571561716612426666376372475665511323361360362311673312310322318669570416575719370670319523625675624626623460410364365524366415462413414412463621527367569317461576411321674722622723525315316324473371720363526

Figure4.33:A

heatmap

ingthespecies-area

associationfortheIcelandicGround�sh

(IGF)survey

1998-2007

usingAverage

linkage

hierarchical

clustering

measure.The

x-axisshow

sthedendrogram

ofareas(statistical

rectangles)andy-axisshow

sthedendrogram

ofspeciesassemblage.

Dataconsists

ofspeciesabundance

innumbers,fourth

transformed

andscaled

to0meanandvariance

coloursrangefrom

ratios)to

(highratios)indicating

thestrength

ofassociations.

5Discussion

A clustering algorithm will always generate a clustering structure even if no real

structure may be intrinsic to the data (Loganantharaj et al., 2006) and di�erent

clustering algorithms are likely to generate di�erent results from the same data set.

The problem becomes more complex when the choice of the dissimilarity measure

to be used is taken into consideration, and the data properties itself, which in turn

in�uence the e�ectiveness of the algorithm (Loganantharaj et al., 2006). This issue

becomes more di�cult as the number of variables increases. Cluster validity has

therefore been a subject of interest and importance in the �eld of molecular genetics

for some decades now. However, substantive guidelines are not available in regards

to the choice of the appropriate algorithms and distance metric for ecological data.

In the �eld of ecology the Average linkage technique is generally recommended in

conjunction with the Bray-Curtis distance measure (Clarke and Warwick, 2001;

Quinn and Keough, 2002).

A number of assessment criteria were used in this study to test the robustness

of the three hierarchical agglomerative clustering techniques that are commonly

applied in ecological studies, Average, Complete and Ward's linkage. According

to the internal criteria of cluster validity and e�ciency, CPCC, Average linkage

performed most e�ciently for both modes of data analyses (Correlation and Bray-

Curtis distance) and yielded the highest values for the coe�cient. In theory, this

would indicate that this linkage generated a classi�cation which was most similar

to the original dissimilarity patterns in the species and site matrix, since the CPCC

is a basic correlation between the two matrix of dissimilarities, that is prior to and

78 Chapter 5 Discussion

subsequent to the clustering. Thus by the de�nition of the CPCC criterion, the

overall performance ranking of the clustering methods were Average followed by

Complete then Ward's, although the CPCC for Complete and Ward's linkage were

not considerably lower. However, Complete linkage did not perform e�ciently when

applied to the full set of data.

On the other hand, based on the AC criterion which measures the caliber of the

clustering, Ward's linkage outperformed the Average and Complete linkage for both

modes of data analyses. The AC values for this method were always higher than

the other two techniques. Thus this clustering technique gave the highest quality

of clustered data set. This can be seen from the dendrograms that had well-de�ned

clusters forming at lower dissimilarity levels. Also, there were no outliers produced

by this technique. Ward's linkage is designed to give compact clusters that minimise

the loss of information based on the sums of squares criteria (Ward, 1963). Thus

from a di�erent perspective, this technique could impose clusters or patterns on a

data set which are not truly there (Gauch Jr and Whittaker, 1981). Average and

Complete linkage, on the other hand, gave clusters at high dissimilarity levels and

some species were always de�ned as outliers.

An assessment of the uncertainty of the clusters, through bootstrapping showed

that Ward's linkage performed the best. When applied to the full data set for the

Correlation distance, it gave two distinct clusters with high probabilities. Thus some

con�dence could be placed in the clusters that were obtained. Average and Complete

linkage on the other hand, gave clusters with lower probabilities resulting in many

small signi�cant clusters. Thus the species were not grouping together with a high

likelihood. For Ward's linkage similar patterns were observed with aggregated data

except the likelihood (p-values) of the clusters were lower. This technique was also

robust across the decreasing sample sizes and performed well down to a subsample

of 10% with some anomalies. The clusters obtained were similar although their

probabilities were much lower. For Average linkage, aggregating the data formed

clusters at lower dissimilarity levels. However this reduced the probability of the

clusters. The linkage method worked well down to a sample size of 25% giving

similar species clusters.

The Complete linkage was observed to be the most unstable method. Irrespective

of the data analysis method used, it was sensitive to the di�erent levels of data

aggregation (smoothing) and the extent of the data used for clustering (sample

size). When the full set of data were used this technique did not perform well and

Discussion 79

gave unclear de�nition of clusters. With aggregated data the classi�cation was more

de�ned with high probability of clustering. Similarly as the samples were reduced

the patterns observed from the clustering were not coherent. This algorithm only

allows an object to merge with a cluster if it is similar to all objects already present

in the cluster (Legendre, 1998). Thus as a cluster is formed it is receding in space

from other clusters as its dissimilarity with the other groups increases (Cao et al.,

1997a). Hence a lot information in the data set could potentially a�ect the algorithm

in de�ning close groups which lead to an unstable outcome.

Generally, it seemed that a reduction in sample size reduced the information

about the species-site similarities in the data which resulted in lower bootstrap

probability values for the clusters. Even though the assemblages obtained were

similar for Ward's and Average linkage, for Correlation distance, the accuracy and

reliability of the clusters decreased with fewer samples.

Ward's linkage yielded similar species assemblages even with the Bray-Curtis

distance measure. However, this was when highly smooth data (aggregated by sub-

rectangles) were used. There distinct clusters, with similar species composition,

were obtained with high probabilities. Complete linkage also gave comparable re-

sults with some exceptions, when the outcome for data aggregated by stations for

Correlation distance and data aggregated by statistical subrectangles for Bray-Curtis

distance were compared. Average linkage on the other hand appeared sensitive to

the type of data standardisation and the distance measure that were used. This

method resulted in considerably di�erent species assemblages for the two modes of

data analyses. The Average clustering algorithm takes an average dissimilarity be-

tween two groups. All agglomerative methods inhabit a monotonic property, that

is the dissimilarity between the merged clusters increases monotonically with the

level of the merger. The average technique appears sensitive to the numerical scale,

on which the clustering dissimilarities are calculated from the initial dissimilarities,

since applying a monotonic function to averaging formula can have an e�ect on the

outcome (Hastie et al., 2001). Average linkage in combination with data standard-

ised by range and Bray-Curtis distance did not perform well in identifying species

assemblages for this data set. Clarke and Warwick (2001) recommend a row stan-

dardisation on untransformed data for Average linkage with Bray-Curtis distance

measure.

The general observation was that the Ward's linkage, when applied with Corre-

lation distance performed better with full data set. This was assessed in terms of the

accuracy and reliability of the clusters. On the contrary, with Bray-Curtis distance,

the method performed better with highly smooth data (aggregated by statistical

subrectangles). This could be related to the properties of the Bray-Curtis distance

measure which compares two species according to their minimum abundance at each

Stability in cluster analysis is to a great extent dependent on the data set it-

self. Essentially if strong patterns are not present in the data then the clustering

algorithm might not give clear de�nitions and di�erent methods may give consid-

erable deviations in the patterns obtained (Hennig, 2007). The NMDS ordination

technique which is considered more reliable in �nding groups was used as an inde-

pendent technique to verify results from hierarchical cluster analyses. The NMDS

ordination showed roughly three groupings which were similar to the clusters ob-

tained from Ward's linkage, with Bray-Curtis distance and resulted in stress values

of approximately 0.075 in three-dimensions for data aggregated by statistical sub-

rectangles. Normally, results giving stress values of < 0.1 indicates a good ordination

with no real likelihood of misleading interpretation (Clarke and Ainsworth, 1993).

The Average technique has some desirable properties such as the maximisation

of the cophenetic correlation which makes it highly preferable in ecological studies

(Gauch Jr and Whittaker, 1981). As Cao et al. (1997a) point out, it has seldom been

assessed whether the classi�cation acquired from the Average linkage is ecologically

meaningful even though the technique is highly recommended. Cao et al. (1997a)

based their study on river samples with some predeterminations on site separation

from cluster analysis. They found that Ward's and Complete linkage were better

in site separation of the samples in comparison to Average linkage with Ward's

linkage performing better. Similar observations are made in this study. Gauch Jr

and Whittaker (1981) also showed that Average and Complete linkage were not as

adequate in recognising pre-determined plant communities as Ward's linkage and

other non-hierarchical clustering methods.

Since its formulation by Sokal and Rohlf (1962), the CPCC criterion of cluster

validity has been widely applied (Farris, 1969). However this criterion has been ques-

tioned and studies such as Farris (1969); Rohlf and Fisher (1968); Phipps (1971) have

deemed it inadequate. In this study the CPCC criteria was not adequate in identi-

fying the optimal clustering method either. As described earlier, it is a correlation

Discussion 81

between the initial dissimilarities and the �nal cophenetic dissimilarity obtained by

the clustering algorithm. The cophenetic dissimilarity is a restrictive measure since

it contains tied values i.e. out of the N(N − 1)/2 pair of dissimilarities only N − 1

values can be distinct (Hastie et al., 2001). Additionally, hierarchical classi�cations

of objects obey ultrametric inequality for distance hij (from classi�cation), �every

triple of objects (i, j, k) possesses the property that the two largest values in the set

hij, hik, hjk are equal� (Gordon, 1999). Comparing dissimilarities and ultrametric

distances seems ambiguous by measuring the strength of their linear relationship,

even more so when the ultrametric distance contains many tied values (Gordon,

1999). Besides, the signi�cance of CPCC cannot be tested since the cophenetic ma-

trix is dependent on the original dissimilarity matrix (Legendre, 1998). Thus the

nature of this cluster validity index is limiting.

Ward's linkage was identi�ed as the most robust method after assessing it against

the above criteria. This linkage method has been shown to be a robust method in

some non-ecological studies (Scheibler and Schneider, 1985; Milligan and Cooper,

1987). Its use in ecology has been restricted since it is normally used in conjunction

with Euclidean distance which appears unsuitable for species abundance data, as

noted earlier. This study showed that this linkage method performed well with

Correlation and Bray-Curtis distance metrics.

It was considered important that the validity and e�ciency of the linkage tech-

niques were not entirely based on only numeric indices. Clusters can be stable and

yet give meaningless results therefore it is important to complement the results by

some visual inspection and subject-based validation (Hennig, 2007). In the present

study, no pre-determinations could be made about the �sh assemblages. However,

as community structures may change along environmental gradients it was inferred

that the assemblages should be distributed according to some key environmental

variables, in this case geographic distribution and depth were considered in�uen-

tial parameters. Thus the obtained assemblages were related to these variables to

observe any meaningful ecological patterns.

5.1 Fish Assemblages and species-environment re-

lationships

The boreal �sheries are dominated by a few key species which strongly interact.

Generally highly dynamic environment, attributed to the oceanographic conditions,

in�uence the �sh stocks (Livingston and Tjelmeland, 2000). The assemblage pat-

terns of the fourty most abundant species in the Icelandic ground�sh survey area

were studied here. The focus was more on demersal species therefore pelagic species

such as capelin and herring were not considered in the study.

Bathymetric studies show that Iceland is situated on two ridges, the mid Atlantic

ridge running from south-west Reykjanes ridge to the north-east Jan Mayen ridge

and the Faroes-Greenland ridge going from south-east to north-west (Stefánsson and

Pálsson, 1997). The bathymetry largely in�uences the hydrography. Several water

masses are present in the Iceland shelf. The Irminger current, which is part of the

North Atlantic current, brings the warm saline Atlantic water to the south coast of

Iceland. To the North, the East Icelandic current is cold and fresh as it carries Artic

water, sea ice and icebergs from East Greenland Current. These largely a�ect the

atmosphere and oceanography around Iceland with warm conditions in the south

and the west, cold in the east, and variable conditions in the north (Valdimarsson

and Malmberg, 1999). Di�erent water masses have distinct thermal and oxygen

concentrations and temperature and salinity are highly variable as a result. This

leads to a natural separation in the habitat preferences of �sh species. Thus it

was inferred that the species occurring in the north and south areas should cluster

separately.

The species assemblage obtained by the Ward's linkage on Correlation distance

gave a separation along the geographic location (north and south) and depth gradient

within each region, as per the inference. Some con�dence could be placed in the

clusters obtained as the bootstrap generated high probability values for these clusters

and indicated that the assemblages were not entirely a result of random e�ects. High

probability values essentially indicate the accuracy of a cluster where �accuracy

means the certainty of the existence of a cluster� (Suzuki and Shimodaira, 2004).

Essentially, four �species assemblage areas� (Jaureguizar et al., 2006) were de-

�ned on the basis of the geographic distribution of the species. Species found in

the north clustered together (assemblages C and D). These formed two constituent

5.1 Fish Assemblages and species-environment relationships 83

groups, one containing the deepwater species such as Greenland halibut, that prefer

colder environmental conditions, and one containing species which are more dis-

persed within the area such as cod and the shallow range species such as lump�sh.

Species found in the south, that prefer warmer conditions, clustered into two assem-

blages, A and B. Assemblage A contained the shallow water species and assemblage

B was the group of species which were present in the intermediate to deep region.

Similar observations have been made by studies on demersal �sh assemblages in the

region (Bergstad et al., 1999; Colvocoresses and Musick, 1984; Fariña et al., 1997;

Gabriel, 1992; Rätz, 1999) where depth and geographic distribution were signi�cant

variables in explaining the �sh assemblages. A similar observation was made for

the analysis based on Bray-Curtis distance. Ward's linkage could be related to the

environmental gradients. Bottom temperature and salinity are other two potentially

important variables that could explain the variability in the �sh assemblages. How-

ever this has not been addressed in the present study since the primary focus of the

study was on the methodological aspects of identifying �sh assemblages.

Species assemblages are groups of species that tend to co-occur in space and time

because they have similar habitat preferences or because they interact biologically.

Nonetheless, association of species or co-occurrence does not necessarily imply that

the species are interacting (Legendre, 1998). This study showed assemblage patterns

in the data and it was seen that the environmental gradients, depth and geographic

properties, played a role in the structuring of the �sh assemblages. Thus the �sh

assemblages re�ected the habitat heterogeneity.

The deeper water species such as Greenland halibut, Altantic poacher, long�n

snail�sh and others that form part of this species group (assemblage D), have dis-

tinguished geographical locations and it was observed that this cluster of species

was always obtained irrespective of the data analysis and clustering methods used.

Whereas, most of the other species occur in a wide area and this could have confused

the multivariate patterns, leading to discrepancies in the classi�cations acquired with

di�erent approaches used for analysis.

The de�nition of areas around Iceland (habitat classi�cation) also led to a sep-

aration along the north-south gradient which further showed some di�erentiation

along depth. The de�nitions obtained were comparable to the previous study on

the de�nition of oceanic areas around Iceland in Stefánsson and Pálsson (1997),

which was in relation to identifying appropriate areas for Bormicon, a Boreal migra-

tion and consumption model for multispecies modeling. Similar observations were

made in this study, where the areas were approximately split according to the Bormi-

con area de�nitions (Stefánsson and Pálsson, 1997). The previous study was based

on hierarchical cluster analysis of some key species including cod, haddock, saithe,

red�sh, cat�sh, Greenland halibut, plaice, herring, capelin and shrimp showed some

consistency in the cluster of areas and the Bormicon strata. It was seen that this

independent study which took many species into consideration and di�erent hierar-

chical clustering methods, complemented the de�nitions of the Bormicon strata.

This study experimenting the use of heatmap in the �eld of ecology for pattern

recognition. The visual display showed some patterns in community structure. Es-

sentially three species-environment associations could be observed through the high

ratio (red) patches. These identi�ed the species characteristic of the northern area

and their corresponding habitats (statistical squares). The species in the southern

areas are divided into two according to depth. This basically gives a visual rep-

resentation that speci�c species groups characterise speci�c geographical locations.

It should be noted that the heatmap here was generated using the default settings

which was Average linkage hierarchical clustering with correlation distance mea-

sure. However, the heatmap routine in R can be used to de�ne speci�c clustering

techniques and distance measures for calculating the dendrograms.

6Main considerations and

recommendations

The Ward's linkage was the most robust hierarchical clustering method according to

this study and is recommended for any further studies based on the Icelandic ground-

�sh survey data. It generated consistent well-de�ned clusters with high probabili-

ties and gave high values of CPCC and AC. The assemblages were also ecologically

meaningful when related to two environmental parameters depth and geographical

distribution. It also performed well for the classi�cation of habitats, giving a de�-

nition as per the inference based on the bathymetric and hydrographic conditions

of the Icelandic continental shelf. Complete linkage worked well with aggregated

data, but was generally an unstable method. The Average technique appeared to be

sensitive to the type of data standardisation and distance measure used. The Bray-

Curtis distance metric in conjunction with Average linkage on data standardised by

range was not a suitable method of analysis for this data set. The �shing areas were

also not well-de�ned by this mode of data analysis.

The choice of the distance measure, data standardisation and clustering algo-

rithm is important and should be given more attention. As has been noted in prior

studies, the internal criteria for cluster validity CPCC was not adequate for this

study either.

Biological interpretations of �sh assemblages showed that the spatial structure

of the environmental gradients around Iceland played a role in characterising the

�sh assemblages. Further studies of this nature could relate the �sh assemblages

86 Chapter 6 Main considerations and recommendations

with other environmental variables such as temperature and salinity which could be

signi�cant parameters in explaining the variation in �sh assemblages. Examining

some spatial and temporal patterns in species assemblages could also be of interest.

Use of visualisation techniques such as heatmaps are recommended in the �eld

of ecology for displaying community patterns (species-habitat associations). Gen-

erating a heatmap based on Ward's linkage would be recommended for any further

studies of this nature.

Some limitations of the study need to be taken into consideration and some

appropriate recommendations are provided. More attention needs to be paid to

the initial sample selection criterion for analysis. Some pelagic and semi-pelagic

species such as blue ling and greater argentine were not excluded from the data

before analysis. This needs to be taken into consideration for any further studies

of this nature, if the emphasis needs to be on demersal species. In future this type

of analysis could also incorporate some details on the structural composition of the

major species by splitting the abundance values into juvenile (immature) and adult

(mature) prior to analysis.

The Icelandic ground�sh survey covers the �shing grounds down to 500m depth

as it was primarily designed for cod. As such, the variability of deep water species

such as Greenland halibut are relatively high in the survey. The autumn survey on

the other hand covers stations in deeper waters even though it has fewer stations.

However, this study indicated that a reduction in the sample size did not lead to any

major changes in the species assemblage patterns. Whether the high variability of

some deep water species in the spring survey, which are included in this assemblage

study, have an e�ect on the species associations could be examined by using the

data from the autumn survey.

Fisheries management is largely moving toward community analysis and identi-

fying potential management strategies to target �sh assemblages rather than single

species. These �ndings on species assemblages in relation to the particular envi-

ronmental conditions and the habitat de�nitions could be used for multi-species or

ecosystem based management purposes. Further research on temporal and spatial

variability and persistence of these assemblages would be recommended. Whether

these assemblages have functional relationships cannot be determined from this anal-

ysis. Some trophic studies in relation to habitat association within the de�ned

assemblages could be used to determine some functional associations between the

species. The de�nition of the speci�c geographical units having distinct species as-

Main considerations and recommendations 87

semblages relating to the bathymetry and hydrographic conditions, such as shown

here, could also be utilised for conservation purposes, for example if there were

intentions of setting up marine protected areas then these species-environment re-

lationships could be useful.

88 Chapter 6 Main considerations and recommendations

AAppendix

90 Chapter A Appendix

Common Name Latin Name Code

Cod Gadus morhua codHaddock Melanogrammus aegle�nus hadSaithe Pollachius virens saiWhiting Merlangius merlangus whiRed�sh Sebastes marinus redLing Molva molva linBlueling (European ling) Molva dipterygia bluTusk Brosme brosme tusAtlantic wol�sh Anarhichas lupus atwThorny skate (starry ray) Raja (Amblyraja) radiata thoSpotted wol�sh (leopard�sh) Anarhichas minor spoMonk�sh Lophius piscatorius monSkate Raja (Dipturus) batis skaDog�sh Squalus acanthias dogGreater argentine Argentina silus graHalibut Hippoglossus hippoglossus halGreenland halibut Reinhardtius hippoglossoides grePlaice Pleuronectes platessa plaLemon sole Microstomus kitt lemWitch Glyptocephalus cynoglossus witMegrim Lepidorhombus whi�agonis megDab Limanda limanda dabLong rough dab Hippoglossoides platessoides limandoides lrdNorway pout Trisopterus esmarki norBlue whiting Micromesistius poutassou blwLump�sh (lumpsucker) Cyclopterus lumpus lumMoustache sculpin Triglops murrayi mouAtlantic poacher Leptagonus decagonus atpFourbearded rockling Rhinonemus cimbrius fouNorway haddock Sebastes viviparus nohDeepwater red�sh Sebastes mentella derEsmark´s eelpout Lycodes esmarki esmLong�n snail�sh (sea tadpole) Careproctus reinhardti losPolar cod Boreogadus saida polAtlantic hookear sculpin Artediellus atlanticus atsVahl´s eelpout (checker eelpout) Lycodes vahli vahPolar sculpin Cottunculus microps posArctic rockling Onogadus argentatus artSnake blenny Lumpenus lampretaeformis snaLycodes sp. Lycodes eudipleurostictus lyc

Table A.1: The common and Latin names of the fourty most common species anal-ysed for this study with the codes used for analysis.

Appendix 91

FigureA.1:De�nition

ofareasin

Icelandicwatersusing(a)Average

(b)Com

pletehierarchicalclustering

withcorrelation

distance.Dataconsistsof

speciesabundancein

numbers,transformed

tofourth

andscaled

to0meanandvariance

92 Chapter A Appendix

FigureA.2:De�nition

ofareasin

Icelandicwatersusing(a)Average

(b)Com

pletehierarchicalclustering

withBray-Curtis

distance.Dataconsists

numbers,transformed

tofourth

andstandardised

byrange.

Bibliography

L. Belbii and C. McDonald. Comparing Three Classi�cation Strategies for Use in

Ecology. Journal of Vegetation Science, 4(3):341�348, 1993.

OA Bergstad, O. Bjelland, and JDM Gordon. Fish communities on the slope of the

eastern Norwegian Sea. Sarsia, 84:67�78, 1999.

N. Bolshakova, F. Azuaje, and P. Cunningham. An integrated tool for microarray

data clustering and cluster validity assessment. Bioinformatics, 21(4):451�455,

J.C. Brazner and E.W. Beals. Patterns in �sh assemblages from coastal wetland and

beach habitats in Green Bay, Lake Michigan: A multivariate analysis of abiotic

and biotic forcing factors. Canadian Journal of Fisheries and Aquatic Sciences,

54(8):1743�1761, 1997.

Y. Cao, A.W. Bark, and W.P. Williams. A comparison of clustering methods for

river benthic community analysis. Hydrobiologia, 347(1):24�40, 1997a.

Y. Cao, W.P. Williams, and A.W. Bark. Similarity measure bias in river ben-

thic Aufwuchs community analysis. Water Environment Research, 69(1):95�106,

1997b.

Y. Cao, DP Larsen, RM Hughes, PL Angermeier, and TM Patton. Sampling e�ort

a�ects multivariate comparisons of stream assemblages. Journal of the North

American Benthological Society, 21(4):701�714, 2002a.

Y. Cao, D.D. Williams, and D.P. Larsen. Comparison of Ecological Communities:

The Problem of Sample Representativeness. Ecological Monographs, 72(1):41�56,

2002b.

94 BIBLIOGRAPHY

KR Clarke and M. Ainsworth. A method of linking multivariate community struc-

ture to environmental variables. Marine Ecology Progress Series, 92(3):205�219,

KR Clarke and R.M. Warwick. Change in Marine Communities: An Approach to

Statistical Analysis and Interpretation; Second Edition. PRIMER-E Ltd� 2001.

JA Colvocoresses and JA Musick. Species associations and community composition

of Middle Atlantic Bight continental shelf demersal �shes. Fishery Bulletin, 82

(2):295�313, 1984.

A.C. Culhane, J. Thioulouse, G. Perriere, and D.G. Higgins. MADE4: an R package

for multivariate analysis of gene expression data, 2005.

S. Datta and S. Datta. Comparisons and validation of statistical clustering tech-

niques for microarray gene expression data. Bioinformatics, 19(4):459�466, 2003.

B. Efron, E. Halloran, and S. Holmes. Bootstrap con�dence levels for phylogenetic

trees. Proceedings of the National Academy of Sciences, 93(23):13429, 1996.

M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and

display of genome-wide expression patterns. Proceedings of the National Academy

of Sciences, 95(25):14863, 1998.

AC Fariña, J. Freire, and E. González-Gurriarán. Demersal Fish Assemblages in the

Galician Continental Shelf and Upper Slope (NW Spain): Spatial Structure and

Long-term Changes. Estuarine, Coastal and Shelf Science, 44(4):435�454, 1997.

J.S. Farris. On the Cophenetic Correlation Coe�cient. Systematic Zoology, 18(3):

279�285, 1969.

M.P. Francis, R.J. Hurst, B.H. McArdle, N.W. Bagley, and O.F. Anderson. New

Zealand Demersal Fish Assemblages. Environmental Biology of Fishes, 65(2):

215�234, 2002.

W.L. Gabriel. Persistence of demersal �sh assemblages between Cape Hatteras and

Nova Scotia, Northwest Atlantic. Journal of Northwest Atlantic Fisheries Science,

14:29�46, 1992.

BIBLIOGRAPHY 95

H.G. Gauch Jr and R.H. Whittaker. Hierarchical Classi�cation of Community Data.

The Journal of Ecology, 69(2):537�557, 1981.

M.C. Gomes and L. Richard. Spatial and temporal changes in the ground�sh as-

semblages on the northeeast NewfoundlandLabrador Shelf, northewest. Fisheries

Oceanograpgy, 4(2):85�101, 1995.

D. González-Troncoso, X. Paz, and X. Cardoso. Persistence and Variation in the

Distribution of Bottom-trawl Fish Assemblages over the Flemish Cap. Journal of

Northwest Atlantic Fisheries Science, 37:103�117, 2006.

A.D. Gordon. Classi�cation, second edition. Chapman & Hall, 1999.

M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering validity checking meth-

ods: part II. ACM SIGMOD Record, 31(3):19�27, 2002a.

M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster validity methods: part I.

Association for Computing Machinery Special Interest Group in Management of

Data (ACM SIGMOD) Record, 31(2):40�45, 2002b.

J. Handl, J. Knowles, and D.B. Kell. Computational cluster validation in post-

genomic data analysis. Bioinformatics, 21(15):3201�3212, 2005.

M. Hasan and Y. Masumoto. Document clustering: before and after the singular

value decomposition. Sapporo, Japan, Information Processing Society of Japan

(IPSJ-TR: 99-NL-134.) pp, pages 47�55, 1999.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:

Data Mining, Inference, and Prediction. Springer, 2001.

C. Hennig. Cluster-wise assessment of cluster stability. Computational Statistics

and Data Analysis, 52(1):258�271, 2007.

C. Hennig and F. Mathematik-SPST. A Method for Visual Cluster Validation.

Classi�cation-the Ubiquitous Challenge: Proceedings of the 28th Annual Confer-

ence of the Gesellschaft Für Klassi�kation EV, University of Dortmund, March

9-11, 2004, 2005.

V. Jakoniene and P. Lambrix. A Tool for Evaluating Strategies for Grouping of

Biological Data. Journal of Integrative Bioinformatics, 4(3):83, 2007.

96 BIBLIOGRAPHY

AJ Jaureguizar, R. Menni, C. Bremec, H. Mianzan, and C. Lasta. Fish assemblage

and environmental patterns in the R�´o de la Plata estuary. Estuarine, Coastal

and Shelf Science, 56(5-6):921�933, 2003.

A.J. Jaureguizar, R. Menni, C. Lasta, and R. Guerrero. Fish assemblages of the

northern Argentine coastal system: spatial patterns and their temporal variations.

Fisheries Oceanography, 15(4):326�344, 2006.

L. Kaufman and P.J. Rousseeuw. Finding groups in data. an introduction to cluster

analysis. Wiley Series in Probability and Mathematical Statistics. Applied Proba-

bility and Statistics, New York: Wiley, 1990, 1990.

M.K. Kerr and G.A. Churchill. Bootstrapping cluster analysis: Assessing the reli-

ability of conclusions from microarray experiments. Proceedings of the National

Academy of Sciences, page 161273698, 2001.

F. Kovács, C. Legány, and A. Babos. Cluster Validity Measurement Techniques.

Proceedings of 6th International Symposium of Hungarian Researchers on Com-

putational Intelligence, Budapest, Hungary, 2005.

GN Lance and WTWilliams. Mixed-Data Classi�catory Programs I - Agglomerative

Systems. Australian Computer Journal, 1(1):15�20, 1967.

Y.W. Lee and D.B. Sampson. Spatial and temporal stability of commercial ground-

�sh assemblages o� Oregon and Washington as inferred from Oregon trawl log-

books. Canadian Journal of Fisheries and Aquatic Sciences, 57(12):2443�2454,

P. Legendre. Numerical Ecology. Elsevier Science, 1998.

V. Lesage, M.O. Hammill, and K.M. Kovacs. Functional classi�cation of harbor seal

(Phoca vitulina) dives using depth pro�les, swimming velocity, and an index of

foraging success. Canadian Journal of Zoology, 77:74�87, 1999.

V.P. Lessig. Comparing Cluster Analyses with Cophenetic Correlation. Journal of

Marketing Research, 9(1):82�84, 1972.

X. Li. Parallel algorithms for hierarchical clustering and cluster validity. Pat-

tern Analysis and Machine Intelligence, IEEE Transactions on, 12(11):1088�1092,

BIBLIOGRAPHY 97

P.A. Livingston and S. Tjelmeland. Fisheries in boreal ecosystems. ICES Journal

of Marine Science, 57(3):619, 2000.

R. Loganantharaj, S. Cheepala, and J. Cli�ord. Metric for Measuring the E�ective-

ness of Clustering of DNA Microarray Expression. Bioinformatics, 7(Suppl 2),

Martin Maechler, Peter Rousseeuw, Anja Struyf, and Mia Hubert. Cluster analysis

basics and extensions. Rousseeuw et al provided the S original which has been

ported to R by Kurt Hornik and has since been enhanced by Martin Maechler:

speed improvements, silhouette() functionality, bug �xes, etc. See the 'Changelog'

�le (in the package source), 2005.

E. Magnussen. Demersal �sh assemblages of Faroe Bank: species composition,

distribution, biomass spectrum and diversity. Marine Ecology Progress Series,

238:211�225, 2002.

R. Mahon, S.K. Brown, K.C.T. Zwanenburg, D.B. Atkinson, K.R. Buja, L. Cla�in,

G.D. Howell, M.E. Monaco, R.N. O'Boyle, and M. Sinclair. Assemblages and

biogeography of demersal �shes of the east coast of North America. Canadian

Journal of Fisheries and Aquatic Sciences, 55(7):1704�1738, 1998.

E. Massuti and J. Moranta. Demersal assemblages and depth distribution of elas-

mobranchs from the continental shelf and slope o� the Balearic Islands (western

Mediterranean). ICES Journal of Marine Science, 60(4):753, 2003.

JE McKenna. An enhanced cluster analysis program with bootstrap signi�cance

testing for ecological community analysis. Environmental Modelling and Software,

18(3):205�220, 2003.

A. Medina, J.C. Brêthes, J.M. Sévigny, and B. Zakardjian. How geographic distance

and depth drive ecological variability and isolation of demersal �sh communities

in an archipelago system (Cape Verde, Eastern Atlantic Ocean). Marine Ecology,

28(3):404�417, 2007.

G.W. Milligan and M.C. Cooper. Methodology Review: Clustering Methods. Ap-

plied Psychological Measurement, 11(4):329, 1987.

98 BIBLIOGRAPHY

AFL Nemec and RO Brinkhurst. Using the bootstrap to assess statistical signi�cance

in the cluster analysis of species abundance data. Canadian Journal of Fisheries

and Aquatic Sciences, 45(6):965�970, 1988.

R.F. Noss. Indicators for Monitoring Biodiversity: A Hierarchical Approach. Con-

servation Biology, 4(4):355�364, 1990.

O.K. Pálsson, E. Jónsson, SA Schopka, G. Stefánsson, and BÆ Steinarsson. Icelandic

ground�sh survey data used to improve precision in stock assessments. Journal

of Northwest Atlantic Fishery Science, 9:53�72, 1989.

JB Phipps. Dendrogram topology. Systematic Zoology, 20:306�308, 1971.

EK Pikitch, C. Santora, EA Babcock, A. Bakun, R. Bon�l, DO Conover, P. Dayton,

P. Doukakis, D. Fluharty, B. Heneman, et al. ECOLOGY: Ecosystem-Based

Fishery Management. Science, 305(5682):346�347, 2004.

A. Pryke, S. Mostaghim, and A. Nazemi. Heatmap Visualization of Population

Based Multi Objective Algorithms. School of computer science research reports -

University of Birmingham CSR, 14, 2006.

J. Quackenbush. Extracting biology from high-dimensional biological data. Journal

of Experimental Biology, 210(9):1507, 2007.

G.P. Quinn and M.J. Keough. Experimental Design and Data Analysis for Biologists.

Cambridge University Press, 2002.

H.J. Rätz. Structures and changes of the demersal �sh assemblage o� Greenland,

1982�96. NAFO Scienti�c Council Studies, 32(1):15, 1999.

U. Riecken. E�ects of Short-Term Sampling on Ecological Characterization and

Evaluation of Epigeic Spider Communities and Their Habitats for Site Assessment

Studies. Journal of Arachnology, 27(1):189�195, 1999.

F.M. Rodrigues and J.A.F. Diniz-Filho. Hierarchical structure of genetic distances:

E�ects of matrix size, spatial distribution and correlation structure among gene

frequencies. Genetics and Molecular Biology, 21:233�240, 1998.

F.J. Rohlf and DL Fisher. Test for hierarchical structure in random data sets.

Systematic Zoology, 17:407�412, 1968.

BIBLIOGRAPHY 99

D. Scheibler and W. Schneider. Monte CarRo Tests of the Accuracy of Cluster

Analysis Algorithms: A Comparison of Hierarchical andl Nonhierarchical Meth-

ods. Multivariate Behavioral Research, 20:283�304, 1985.

H. Shimodaira. An Approximately Unbiased Test of Phylogenetic Tree Selection.

Systematic Biology, 51(3):492�508, 2002.

H. Shimodaira. Testing regions with nonsmooth boundaries via multiscale bootstrap.

Journal of Statistical Planning and Inference, 138(5):1227�1241, 2008.

R.R. Sokal and F.J. Rohlf. The comparison of dendrograms by objective methods.

Taxon, 11(1):30�40, 1962.

P. Sousa, M. Azevedo, and M.C. Gomes. Demersal assemblages o� Portugal: Map-

ping, seasonal, and temporal patterns. Fisheries Research, 75(1-3):120�137, 2005.

G. Stefánsson and OK Pálsson. BORMICON: A Boreal Migration and Consumption

Model. Marine Research Institute Report. 58. 223 p., 1997.

R. Suzuki and H. Shimodaira. An application of multiscale bootstrap resampling to

hierarchical clustering of microarray data: How accurate are these clusters? pro-

ceedings by the Fifteenth International Conference on Genome Informatics (GIW

2004), p. P, 34, 2004.

R. Suzuki and H. Shimodaira. Pvclust: an R package for assessing the uncertainty

in hierarchical clustering. Bioinformatics, 22(12):1540�1542, 2006.

A.J. Vakharia and U. Wemmerlöv. A comparative investigation of hierarchical clus-

tering techniques and dissimilarity measures applied to the cell formation problem.

Journal of Operations Management, 13(2):117�138, 1995.

H. Valdimarsson and S. Malmberg. Near-surface circulation in Icelandic waters

derived from satellite tracked drifters. Rit Fiskideildar, 16:23�39, 1999.

J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the

American Statistical Association, 58(301):236�244, 1963.

L. Zhang, A. Zhang, and M. Ramanathan. Fourier harmonic approach for visualizing

temporal patterns of gene expression data. Bioinformatics Conference, 2003. CSB

2003. Proceedings of the 2003 IEEE, pages 137�147, 2003.

100 BIBLIOGRAPHY

Robustness of three hierarchical agglomerative clustering ...

Documents