INCOFISH WP3 - Campinas, April 2006 WEB Tools and Data Cleaning Alexandre Marino marino@cria.org.br...

Post on 16-Jan-2016

213 views 0 download

transcript

INCOFISH WP3 - Campinas, April 2006WEB Tools and Data Cleaning

Alexandre Marinomarino@cria.org.br

Centro de Referência em Informação Ambiental, CrIA

WEB Tools and Data Cleaning

These tools were developed within the scope of thespeciesLink project, so, in some cases, there is a

complete dependency on the architecture, the localdatabase, and the libraries that were developed by CRIA.

Data Cleaning started as an idea that had not a very clear direction, it became a very particular system.

The speciesLink

project is being

funded by

FAPESP (São

Paulo state

agency) from

October, 2001 to

October, 2005.

Col 1

Col 2

Col 3

Col 4

Col 5

program

search

interface

Win2000Brahms

LinuxMySQL

Win98Access

Win98biota FreeeBSD

PostgreSQL

??

??

?

Different data sources software and systemsDifferent data sources software and systems

Protocol and Content SchemaProtocol and Content Schema

• DiGIR protocol (Distributed Generic Information Retrieval)

Potential to be globally accepted

• DiGIR software (Java Portal & PHP Provider)

Collaborative development

• DarwinCore v.2

Covers the basic content elements (taxonomic

identification, location and date of collecting event)

speciesLink site

Presentation Layer

speciesLink site

Presentation Layer

DiGIRPortal(Java)

DiGIRPortal(Java)

PerlPerl

Slow or unstable connectivity

Fast and stable connectivity

DataSOAP client

CollectionManagement

System

SQL

Collection C

DataRepository

DataSOAP client

CollectionManagement

System

SQL

Collection B

DataRepository

PostgresPHP

Provider

SOAP Server

SQL

Mirror Server

DataPHP

Provider

Collection Management

System

SQL

Collection A

System’s System’s ArchitectureArchitecture

~40 connected collections~40 connected collections

~940.000 on-line records~940.000 on-line records

March/2006March/2006

JBRJ

speciesLink network

WEB ToolsWEB Tools

• geoLoc

• spOutlier

• infoXY

• conversor

• speciesMapper

• data cleaning

About geoLoc

to assist biological collections in geo-referencing their data

the database includes approximately 110 thousand names of Brazilian localities, obtained from:

Brazilian Institute of National Statistics and Geography (IBGE) GEOnet Names Server (GNS) speciesLink/Fapesp

algorithm based on concepts in the Egaz program (Shattuck 1997) capable of calculating a coordinate for a distance and direction

ToolsTools

26 Noroeste-NW

Campinas São Paulo

ToolsTools

About spOutlier

to assist biological collections in identifying possible suspect points in existing records

uses techniques modified from Chapman 1999 to detect outliers in latitude, longitude and altitude

allows users to indicate their data set as either terrestrial or marine

useful to biologists around the world who wish to identify possible errors in their data

1, -63.25, -4.916666667, 7952, -67.05, -10.96666667, 8053, -68.0125, -12.66666667, 8094, -68.75, -13.60111111, 8155, -68.9102, -13.83333, 8106, -72.3666, -14.36611111, 7907, -78.3166, -14.38916667, 8018, -72.137, -11.8647, 700

marine

1, -63.25, -4.916672, 34.3239,67.9836aus, 150.0417,-34.90813, -68.0125, -12.66674, -22.0400, 63.9514id_teste, -45, -226, -75.3667, -14.36617, 71.37, -19.37eua, -80.8011,26.05069,-120.7642,58.721710,26.0089,-29.519711,-95.3781,16.7639

Input/Output:-degrees, min, sec-decimal degrees-UTM

DATUM:-WGS84 (World)-SAD69 (Brazil)-Córrego Alegre (SP)

-3.5800 , 52.063334.3239 , 67.9836-45 , -22

03d34'47"W , 52d3'47"N34d19'23"E , 67d59'0"N44d59'58"W , 21d59'58"S

degrees, min, s

Plot georeferenced points on a map.

Available layers:

-World-South and Central America-Brazil-São Paulo State

-95.6 -39.5166-70.2833 -4.2 -70.033333 -4.35 -69.914889 0.274694 -69.7333 -4.2333 -69.6661 -3.908333 ...

Trachurus trachurus

Pteroscion pele

Gaidropsarus biscayensis

Using

DataPostgreSQL

DataPostgreSQL

spOutliergeoLoc

SOAP

Web service

job1 job2

MapsPostGIS

MapsPostGIS

ToolsTools

About Data Cleaning

Aim at helping curators in identifying possible errors and to standardize data

Records are not modified

The system just presents "suspect" records

Col 1 Col 2 Col 3 Col n Col n

National collections

Col 1 Col 2

Internacional collections

... ...

Tables of Suspect RecordsTables of Suspect Records

chart.pm (Perl)

Local DatabaseLocal Databasedc_tax

dc_geo

PostgreSQL

PostgreSQL

Detect Suspect Records

Perl

Web

speciesLink PortalspeciesLink PortalJava

How

Data

Cle

anin

g W

ork

sH

ow

Data

Cle

anin

g W

ork

s

Demonstration on-line

Thank you!Thank you!

marino@cria.org.br

Obrigado!Obrigado!