USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 4

Post on 15-Aug-2015

28 views 1 download

Tags:

transcript

Using e-Infrastructures for Biodiversity Conservation

Gianpaolo Coro ISTI-CNR, Pisa, Italy

Module 4 - Outline

1. Data processing requirements by communities of practice

2. The D4Science Statistical Manager

3. Ecological modelling

D4ScienceD4Science is both a Data and a Computational e-Infrastructure

• Used by several Projects: i-Marine, EUBrazil OpenBio, ENVRI;

• Implements the notion of e-Infrastructure as-a-Service: it offers on demand access to data management services and computational facilities;

• Hosts several VREs for Fisheries Managers, Biologists, Statisticians…and Students.

D4Science - ResourcesLarge Set of Biodiversity and Taxonomic Datasets connected

A Network to distribute and access to Geospatial Data

Distributed Storage System to store datasets and documents

A Social Networkto share opinions and useful news

Algorithms for Biology-related experiments

Data Processing

1. Data processing requirements by communities of practice

2. The D4Science Statistical Manager

3. Ecological modelling

Some interests by communities of practice in Computational Statistics:

1. Repetition and validation of experiments

2. Exploitation of algorithms in several contexts

3. Hide the complexity of the calculations

4. Facilitate the management and the publication of the algorithms

Issues

…practically speaking, they search for:

1. Modular and pluggable solutions

2. Access by means of standard protocols

3. Hiding the complexity of parallel processing

4. Hiding the complexity of software management and provisioning

5. Active contribution with new algorithms and use cases

Issues

1. Data processing requirements by communities of practice

2. The D4Science Statistical Manager

3. Ecological modelling

The Statistical Manager is a set of web services that aim to:

• Help scientists in computational statistics experiments

• Supply precooked state-of-the-art algorithms as-a-Service

• Perform calculations by using Map-Reduce in a seamless way to the users

• Share input, results, parameters and comments with colleagues by means of Virtual Research Environment in the D4Science e-Infrastructure

Statistical Manager – Users’ View

StatisticalManager

D4ScienceComputational

FacilitiesSharing

Setup and execution

Open Platform Approach

External Computing

Facility

OGC WPS

Interface

People can contribute with:

• R scripts• Java programs• Linux programs• OGC-WPS services

The Statistical Manager allows to:

• Develop distributed computation in easy way (Statistical Manager Framework)

• Parallelize R Scripts without possibly changing the code

• Automatically produce a User Interface to perform experiments

• Reuse models and best practices developed by the community

• Connect external computational facilities via WPS OGC Standard

Statistical Manager – Developers’ View

Architecture

Internal Work

The Context: Resources and Sharing

Statistical Manager - Interface

Experiment Execution

Computations Check

Summary of the Input, Output and Parameters of the experiment

Data Space - Sharing and Import

100 Hosted Algorithms

Numbers

FishBase (US, CA, TW)GeomarNaturhistoriska riksmuseet: StartsidaAgrocampusAnonymous Individ-ualsINRAKing Abdullah Uni-versity of Science and TechnologyISTI

Users

2013 2014Avg Users per month 200 20100

Number of Algorithms 50 100

Number of contributing Organizations providing algorithms

2 CNR,

Geomar

7CNR,

Geomar,FIN,FAO,T2,IRD,

AgrocampusPublications 8 13Sum Impact

Factor 2.66 12.17

20121. L. Candela, G. Coro, P. Pagano, ”Supporting Tabular Data Characterization in a Large Scale Data Infrastructure by Lexical Matching Techniques”, In M. Agosti et al. (Eds.): IRCDL 2012, Communications in Computer

and Information Science Volume 354, pp. 21–32. Springer, Heidelberg (2012).

20132. R. Froese, J. Thorson, R. B. Reyes Jr. A Bayesian approach for estimating length-weight relationships in fishes. Journal of Applied Ichthyology. Volume 30, Issue 1, pages 78–85, 20133. G. Coro, P. Pagano, A. Ellenbroek, ”Combining Simulated Expert Knowledge with Neural Networks to Produce Ecological Niche Models for Latimeria chalumnae”, Ecological Modelling, DOI

10.1016/j.ecolmodel.2013.08.005, Ed. Elsevier.4. G. Coro, L. Fortunati, P. Pagano. Deriving Fishing Monthly Effort and Caught Species from Vessel Trajectories. Oceans 2013, Proceedings of MTS/IEEE.5. P. Pagano, G. Coro, D. Castelli, L. Candela, F. Sinibaldi, A. Manzi. Cloud Computing for Ecological Modeling in the D4Science Infrastructure. Proceedings of EGI Community Forum 2013.6. D. Castelli, P. Pagano, G. Coro, F. Sinibaldi, ”Modellazione della Nicchia Ecologica di Specie Marine (Marine Species Ecological Niche Modelling)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies)

pp. 140, Ed. CNR (Roma, Italy).7. D. Castelli, P. Pagano, G. Coro, ”Variazioni Climatiche ed Effetto sulle Specie Marine (Climate Changes and Effect on Marine Species)”. In ”Le Tecnologie del CNR per il Mare” (CNR Marine Technologies) pp. 139, Ed.

CNR (Roma, Italy).8. D. Castelli, P. Pagano, G. Coro, ”Elaborazione di Dati Trasmessi da Pescherecci (Processing of fishing vessel transmitted information)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies). pp. 133, Ed.

CNR (Roma, Italy).9. G. Coro, P. Pagano, A. Ellenbroek. Automatic Procedures to Assist in Manual Review of Marine Species Distribution Maps. To be published in M. Tomassini et al. (Eds.): International Conference on Adaptive and

Natural Computing Algorithms (ICANNGA’13), Springer, Heidelberg (2013).10. Candela L., Castelli D., Coro G., Pagano P., Sinibaldi F. Species distribution modeling in the cloud. In: Concurrency and Computation-Practice & Experience, Geoffrey C. Fox, David W. Walker (eds.). Wiley,11. Appeltans W., Pissierssens P., Coro G., Italiano A., Pagano P., Ellenbroek A., Webb T. Trendylyzer: a long-term trend analysis on biogeographic data. In: Bollettino di Geofisica Teorica e Applicata: an International

Journal of Earth Sciences, vol. 54 (Suppl.) pp. 203 - 205. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.

12. Coro G., Gioia A., Pagano P., Candela L. A service for statistical analysis of marine data in a distributed e-infrastructure. In: Bollettino di Geofisica Teorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 68 - 70. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.

13. Castelli D., Pagano P., Candela L., Coro G. The iMarine data bonanza: improving data discovery and management through a hybrid data infrastructure. In: Bollettino di Geofisica Teorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 105 - 107. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.

14. Coro G. A Lightweight Guide on Gibbs Sampling and JAGS. A Lightweight Guide on Gibbs Sampling and JAGS. Technical report, 2013.15. Vanden Berghe E., Bailly N., Aldemita C., Fiorellato F., Coro G., Ellenbroek A., Pagano P. BiOnym - a flexible workflow approach to taxon name matching. In: TDWG 2013 - Taxonomic Database Working Group 2013

(Firenze, 28-31 October 2013). 16. Coro G., Pagano P., Candela L. Providing Statistical Algorithms as-a-Service. In: TDWG 2013 - Taxonomic Database Working Group 2013 (Firenze, 28-31 October 2013).

201417. Candela L., Castelli D., Coro G., De Faveri F., Italiano A., Lelii L., Mangiacrapa F., Marioli V., Pagano P. Integrating Species Occurrence Databases to Facilitate Data Analysis. Approved for the Ecological Informatics

Journal, Elsevier 2014.18. Froese R, Coro G., Kleisner K., Demirel N. Revisiting Safe Biological Limits in Fisheries. Sumitted to the Fish and Fisheries Journal, Wiley 201419. Coro G., Candela L., Pagano P., Italiano A., Liccardo L. Parallelising the Execution of Native Data Mining Algorithms for Computational Biology. Submitted to Concurrency and Computation-Practice & Experience,

Wiley 2014.20. Coro G. , Pagano P., Ellenbroek A. Comparing Heterogeneous Distribution Maps for Marine Species. Submitted to GIScience & Remote Sensing, Taylor & Francis 2014.

201521. G. Coro, C. Magliozzi, A. Ellenbroek, P. Pagano, Improving data quality to build a robust distribution model for Architeuthis dux, Ecological Modelling, Volume 305, 10 June 2015, Pages 29-39, ISSN 0304-380022. G. Coro, C. Magliozzi, E. Vanden Berghe, N. Bailly, A. Ellenbroek, P. Pagano, Estimating absence locations of marine species from data of scientific surveys23. R. Froese, N. Demirel, G. Coro, K. Kleisner, H. Winker, Estimating Fisheries Reference Points from Catch and Resilience24. E. Vanden Berghe, N. Bailly, G. Coro, F. Fiorellato, C. Aldemita, A. Ellenbroek, P. Pagano. Retrieving taxa names from large biodiversity data collections using a flexible matching workflow25. G. Coro, C. Magliozzi, A. Ellenbroek, K. Kaschner, P. Pagano. Automatic classification of climate change effects on marine species distributions in 2050 using the AquaMaps model26. E. Trumpy, G. Coro, A. Manzella, P. Pagano, D. Castelli, P. Calcagno, A. Nador, T. Bragasson, S. Grellet. Building a European Geothermal Information Network using a

Publications around the Statistical Manager

1. Data processing requirements by communities of practice

2. The D4Science Statistical Manager

3. Ecological modelling

Niche Modelling

Scope: • characterize the environmental conditions that are suitable for the species to

subsist;• identify where suitable environment is distributed in geographical space;• estimate the actual and potential geographic distributions of a species.

Actual distribution: areas that are truly occupied by the speciesFundamental niche: the full range of abiotic conditions within which the species is viablePotential distribution: areas with abiotic conditions that fall within the fundamental niche

Niche Modelling and Absence and Presence Points

Approaches: Mechanistic models: incorporate physiological limits in a species tolerance to environmental conditions;Correlative models: automatically estimate the environmental conditions that are suitable for a species by relying on examples.

Presence points: occurrence records, i.e. places where the species has been observed in its habitat

Absence points: locations where the environment is considered unsuitable for the species. In many cases, absence points must be simulated (pseudo-absence points), because reliable data are rare.

Examples: Potential Distributions of the Coelacanth

Presence-only: MaxEnt Presence-only: GARP

Expert (semi-Mechanistic): AquaMaps

Presence\Absence: Artificial Neural Networks

Comparison between several approaches estimating the potential distribution of the Coelacanth.

The best depends on the quality of the data.Thus, cleaning operations are very important!

C-squares (concise spatial query and representation system):

• A system of geocodes that provides a basis for simple spatial indexing of geographic features

• Devised by Tony Rees of CSIRO Marine and Atmospheric Research

• A compact encoding of Latitude and Longitude and Resolution

Example:

C-square code: 3414:227:3 Resolution: 0.5°N,S,W,E limits: -42.5,-43.0,147.0,147.5

A useful converter: http://www.marine.csiro.au/marq/csq_builder.init

C-square codes

Contains information on:a) cell codesb) statistical cell properties (center, limits, and area);c) membership in relevant areas (FAO areas, EEZs or LMEs);d) physical attributes (depth, salinity or temperature);e) biological properties (e.g. primary production).

Data gathered from:Sea Around Us ProjectCSIROKansas Geological Survey

Compiled by:Kristin Kaschner & Jonathan Ready

HCAF (Half-degree Cells Authority File)

Contains information used for describing the environmental tolerance and preference of a species:

• distribution using FAO areas and bounding box• range of values per environmental parameter (min., preferred

min., preferred max., max.)

HSPEN (Half-degree Species Environmental Envelope)

Online experiment: the i-Marine Filtering Facilities

https://i-marine.d4science.org/group/biodiversitylab/processing-tools

A Niche model relying on expert knowledge

Contains the assignment of a species to a half-degree cell and the corresponding probability of occurrence of the species in a given cell;

The assignment probability is the multiplicative equation of each of the environmental parameters (SST, salinity, prim. prod., sea ice concentration, distance to land).

HSPEC (Half-degree Species Assignment)

AquaMaps

Gadus morhua

A Presence-only species model that relies on expert knowledge about the species habitat• AquaMaps Suitable: estimates the Potential Distribution• AquaMaps Native: estimates the Actual Distribution

• Maps have 0.5 degrees resolution;• Expert knowledge is used in modelling the habitat parameters;• AquaMaps adopts mechanistic assumptions combined with an automatic estimation of

parameter values.

• “good cells” - within bounding box or known FAO areas• minimum of 10 “good cells” for needed for extracting parameters

Bounding box or FAO area limits serve as independent verification of the validity of occurrence records.

AquaMaps – Good Cells

Taken from: http://www.aquamaps.org/main/presentations/Part%20II%20-%20AquaMaps%20behind%20the%20scene.pdf

Global grid of 259,200 half degree cells

Good cells are used to derive the range of environmental parameters within the species’ native range.

AquaMaps – Extracting Environmental Parameters

Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf

• Depth ranges: typically from literature; depth estimate based on habitat description

• Min = 25th percentile - 1.5 * interquartile or absolute minimum in extracted data (whichever is greater)

• Max = 75th percentile + 1.5 * interquartile or absolute maximum in extracted data (whichever is greater)

• PrefMin = 10th percentile of observed variation in an environmental parameter

• PrefMax = 90th percentile of observed variation in an environmental parameter

• Surface values for species with min depth ≤ 200m

• Bottom values for species with min depth > 200m

The environmental envelopes describe tolerances of a species with respect to each environmental parameter.

AquaMaps – Environmental Envelopes

Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf

Predictor

Preferred min

Preferred max

Min Max

PMaxRe

lativ

e pr

obab

ility

of

occ

urre

nce

Pc = Pbathymetryc x PSSTc x Psalinityc x Pchl ac x PIceDistc x PLandDistc

Probabilities of species occurrence are generated by matching the species environmental envelope against local environmental conditions to determine relative suitability of a given area.

Probability of Occurrence

AquaMaps – Environmental Envelopes

Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf

The probability is calculated for each 0.5 cell

in the oceans.A color is associated to the probability values

AquaMaps – Probability

Pc = Pbathymetryc x PSSTc x Psalinityc

x Pchl ac x PIceDistc x PLandDistc

Online experiment: AquaMaps

https://i-marine.d4science.org/group/biodiversitylab/processing-tools

What if Expert Knowledge was missing?

Artificial Neural Network

Presence/Absence Points examples

Probability (1/ 0)

• Learns from positive (presence) and negative (absence) examples (training mode);• Adapts the network weights to produce the correct outputs on the examples;• Produces probability values for new input (test mode).

Artificial Neural Networks Maps

Examples and Exercises: AquaMaps - Neural Networks

https://i-marine.d4science.org/group/biodiversitylab/processing-tools

Climate change analysis

• HCAF Scenarios can be simulated by means of interpolation.

• Interpolation produces half-degree values between a start and an end date

• Once new HCAFs are available we can produce an HSPEC for each HCAF

Simulation of HCAF Scenarios

Climate Changes Effects on Species

Estimated impact of climate changes over 20 years on 11549 species.

Bioclimate HSpec

Overall occupancy in time

Online experiment: BioClimate Analysis

https://i-marine.d4science.org/group/biodiversitylab/processing-tools

Grouping the occurrence points and the environmental features

of different species

• Group points by spatial distance or density• Detect outliers

Occurrence Points Clustering

DBScan acts on the points density

Parameters:• Epsilon = 10• Min Points = 2

Outliers

Density Clustering

XMeans

K = [20,30]Min Points = 2MaxIter=1000

KMeans

K = 24Min Points = 2MaxIter=1000MaxOptSteps = 1000

No Outliers Detected!

No Outliers Detected!

Distance Clustering

Online experiment: Clustering

https://i-marine.d4science.org/group/biodiversitylab/processing-tools

Discovering similaritiesamong habitats

Similarity between habitatsHabitat Representativeness Score:• Measures the degree to which sampled habitats are representative for a certain

area of study;• Has been used for assessing the minimum number of surveys on a study area that

are needed to cover a good heterogeneity of species habitat variables.Can be used to:• Measure the similarity between the environmental features of two areas;• Assesses the quality of models and environmental features.

HRS=10.6

Habitat Representativeness

Score

A+P HRS 10.58

PHRS 10.61

Habitat Representativeness Score

Absence

Presence The HRS is too high -> all the maps can be unreliable and need expert validation

HRS is in [0;2] for each featureThe overall HRS is the sum of the HRSs of the environmental features

Habitat Representativeness Score for each Feature

HRS 10.58

mean depth in t.c. 1.90max depth in t.c. 0.87min depth in t.c. 0.04mean annual s surface temp 1.19mean annual s bottom temp 1.59mean salinity in t.c. 1.23mean bottom salinity in t.c. 0.44mean primary production 0.61annual ice concentration 0.71distance from land 0.46ocean area in t.c. 1.54

Presence, Absence

HRS 10.61

mean depth in t.c. 1.92max depth in t.c. 0.86min depth in t.c. 0.04mean annual s surface temp 1.13mean annual s bottom temp 1.56mean salinity in t.c. 1.29mean bottom salinity in t.c. 0.34mean primary production 0.64annual ice concentration 0.78distance from land 0.49ocean area in t.c. 1.55

The most representative feature is the minimum depth in a cell of 0.5 degrees

Presence only

Even in this case the most representative feature is the minimum depth in a cell of 0.5 degrees

Online experiment: Habitat Representativeness Score

https://i-marine.d4science.org/group/biodiversitylab/processing-tools

Retrieving taxonomic information for a set of species

BiOnym

PreprocessingAnd

Parsing

A workflow approach to taxon name matching.

Accounts for:• Variations in the spelling and

interpretation of taxonomic names

• Combination of data from different sources

• Harmonization and reconciliation of Taxa names

Taxon Matcher 1

Taxon Matcher 2

Taxon Matcher n

PostProcessing

ReferenceSource(ASFIS)

ReferenceSource

(FISHBASE)

ReferenceSource

(WoRMS)

Raw Input String. E.g. Gadus morua Lineus 1758

Correct Transcriptions: E.g. Gadus morhua (Linnaeus, 1758)

ReferenceSource

(Other in DwC-A)

GSAy

GSAY

GSrAy

GSrAY

GSA

Complete matchStep RateGSAy 950GSAY 940GSrAy 930GSrAY 920GSA 910GSrA 900GSY 890GSrY 880SAy 870SAY 860SrAy 850SrAY 840GAy 830GAY 820…

Parentheses issue

Gender agreement issues

Gender agreement and parentheses issues

Year issues

GSAYear issues

Matcher Example - GSAy

GSY

GS

SrAy

Rest

Author issues, misspelling or wrongStep RateGSY 950GSAY 940GSrAy 930GSrAY 920GSA 910GSrA 900GSY 890GSrY 880SAy 870SAY 860SrAy 850SrAY 840GAy 830GAY 820…

Homonyms

Other combinations

Taxamatch

GAYVisual check

Matcher Example - GSAy

BiOnym - Output

Online experiment: BiOnym

https://i-marine.d4science.org/group/biodiversitylab/processing-tools