Download - Database architectures: Current trends and their relationships to environmental data management

Environmental Modelling amp Software 21 (2006) 1579e1586wwwelseviercomlocateenvsoft

Database architectures Current trends and their relationships toenvironmental data management

Jaroslav Pokorny

Charles University Faculty of Mathematics and Physics Department of Software Engineering Malostranske nam 25 118 00 Praha Czech Republic

Received 11 November 2005

Available online 27 June 2006

Abstract

Ever increasing environmental information demands from customers authorities and governmental organizations as well as new businesscontrol functions are implemented and integrated to environmental information management systems (EIMSs) These systems are often basedon traditional file techniques or more recently on commercial database management systems (DBMSs) With a production of huge data sets andtheir processing in real-time applications the needs for environmental data management have grown significantly Numerous examples frompractice of EIMSs prove that the architecture of DBMS should be open for a permanent evolution Current trends in database developmentand an associated research meet these challenges New information and communication technologies and techniques influence todayrsquos DBMSsThey include among other things sensor networks stream processing processing uncertain and imprecise data knowledge discovery and in-telligent data analysis as well as wireless broadcast and mobile computing Both research and practice indicate that the traditional universalDBMS architecture hardly satisfies these trends and new solutions are needed Rather separate specialized engines connected into networksare beneficial The paper discusses recent advances in database technologies and attempts to highlight them with respect to requirements ofEIMSs 2006 Elsevier Ltd All rights reserved

Keywords Environmental management system Database management system Sensor Sensor network Stream processing Uncertain and imprecise data Knowl-

edge discovery and intelligent data analysis Wireless broadcast Mobile computing

1 Introduction

Without doubt the world of data is changing particularlythe nature and sources of information All these changeshave a significant influence on database needs and conse-quently on questions where the database field is and whereit should be going Abiteboul et al (2005) in their report em-phasize two main driving forces today Internet and particularsciences like the physical sciences biological sciences med-icine and engineering These sciences produce large and com-plex data sets that require more advanced database supportthan current products provide

Another trend existing since the 1960s concerns the indus-tries having faced ever increasing environmental demands

Fax thorn420 221914323

E-mail address pokornyksimsmffcunicz

1364-8152$ - see front matter 2006 Elsevier Ltd All rights reserved

doi101016jenvsoft200605004

from customers authorities and governmental organizationsRecently reflecting these demands new business control func-tions are integrated to environmental management systems1

(EMS) For their computerized part we can use the term envi-ronmental information system (EIS) if we address public envi-ronmental information systems or environmental managementinformation system (EMIS) if we deal with industrial environ-mental information systems As data or information process-ing is primarily what we focus on we will use the term EISthrough the paper

An important observation is that similarly to the sciencesmentioned EISs process also huge data sets often continually

1 By LCA (2005) an EMS is a part of the overall management system that

includes organisational structure planning activities responsibilities prac-

tices procedures processes and resources for developing implementing

achieving reviewing and maintaining the environmental policy

1580 J Pokorny Environmental Modelling amp Software 21 (2006) 1579e1586

and with triggering various control actions Consequently theneeds for environmental data management have grownsignificantly

Considering environmental data sets combined with egbusiness data emails documentations etc adequate informa-tion integration mechanisms are needed Since their beginningthe databases have had an integrative role in the world of dataReuter (2005) argues that the technological evolution of data-base technology makes database systems even the ideal candi-date for integrating all types of objects that need persistence aswell as for supporting all the different types of execution thatare characteristic of the various application classes

The most important part of each management system dealswith data through querying When users want to search anduse environmental information the following problems occur(Tomasic and Simon 1997)

(1) Data do not exist or are insufficient sometimes this mayrequire synthesis or reproduction of data

(2) Data are not referenced by data suppliers and thereforehard to locate or data are referenced under specific classi-fication criteria that are domain-specific

(3) Data are hard to access they are either private or of a toohigh cost or requiring costly pre-processing (eg datamust be re-entered manually from paper documentation)or format translation

(4) Accessed data sets are hard to use because they are incon-sistent or non-compatible for example access to longtime series but standard data collection techniques havenot been applied thereby making adjacent time seriesnot compatible

(5) The quality of retrieved data is hard to assess it is oftenhard to compare data produced using different scientificmodels because of a lack of documentation about the un-derlying computational processes

The database community focuses on information storageorganization management and access in software architec-tures called database management systems (DBMSs) Alwaysit is driven by new applications technology trends new syner-gies with related fields and innovation within the field itselfThe problems (1)e(5) are a natural part of todayrsquos database re-search and development A natural idea is that EISs based onadvanced database technologies could help to deal with theseissues

Several technological aspects influence DBMS develop-ment Focusing on the scientific data it is often coming instreams The sensor networks producing the data consist ofvery large numbers of low-cost devices each of which isa data source measuring some quantity eg the objectrsquos loca-tion or the ambient temperature Processing such data is usu-ally completely different from the data stored in enterprisedatabases Data arrive in high-speed streams and queriesover those streams need to be processed in an online fashionto enable real-time responses Moreover in comparison to en-terprise data processing these data are uncertain or impreciseOther aspects of such data processing include unclear

formulation of queries based on common techniques as theyare used for example in classical databases Often we arenot able to formulate a query eg in SQL and despite ofthe fact we believe on the other hand that something interest-ing is hidden in our data In such situations a lack of semanticsis apparent To describe data semantics metadata and its for-mal description are necessary

Data in collections considered creates an ideal platform forusing knowledge discovery methods andor intelligent dataanalysis Also online analytical processing (OLAP) datawarehouses (DW) and data mining (DM) techniques canhelp in this context

The purpose of the paper is to present the main challengesinfluencing todayrsquos database development with respect to theprocessing environmental data First in Section 2 we discussproperties of new data sources In Section 3 we repeat the con-cepts of the classical centralized DBMS architecture as it ex-ists from early 1980s According to Harder and Reuter (1983)the architecture models five-level abstraction hierarchy Its im-plementation has five technological layers that allow to sepa-rate some problems and their solutions in relativelyindependent way The main part of the paper presents fivenew technologies influencing database architectures in Section 4They include sensor data and sensor networks stream pro-cessing approaching uncertain and imprecise data knowl-edge discovery methods and intelligent data analysis andwireless broadcast and mobile computing In Section 5 weargue that new DBMS architectures are needed describingbriefly some of their proposals and give several examplesof their occurrences in practice In conclusions we summa-rize the basic ideas given in the paper and add a numberof other issues that can influence processing environmentaldata

2 New data sources

Usual enterprise data stored in databases are structured andcan be described by a so-called (database) schema Sucha schema is almost fixed or it is changed only rarely It isnot the case of collections of scientific as well as environmen-tal data By Reuter (2005) the key properties of these data col-lections (irrespective of the many differences) are thefollowing

The raw data is written once and never changes again Asa matter of fact some scientific organizations require forall projects they support that any data that influences thepublished results of the project be kept available for an ex-tended period of time typically around 15 years Raw data comes in as streams with high throughput (hun-

dreds of MBs) depending on the sensor devices Thestreams have to be recorded as they come in because inmost cases there is no way of repeating the measurement For the majority of applications the raw data is not inter-

esting What the users need are aggregates derived valuesor e in case of text fields e some kind of abstract oflsquolsquowhat the text saysrsquorsquo

1581J Pokorny Environmental Modelling amp Software 21 (2006) 1579e1586

In many cases the schema has hundreds or thousands ofattribute types whereas each instance only has tens of at-tribute values The schema of the structured part of the database is not

fixed in many cases As the discipline progresses newphenomena are discovered new types of measurementsare made units and dimensions are changed and once ina while whole new concepts are introduced andor olderconcepts are redefined All those schema changes haveto be accommodated dynamically

We will consider mainly data representing environmentalobjects and their relationships Both objects and their relation-ships are characterized by attributes Spatial environmentalobjects (such as lakes bridges buildings clouds whales treesand cars) have eg a shape and other attributes that canchange over time eg the water temperature in a lake the po-sition of a whale etc That is why time and space are impor-tant components of an environmental system Recentlyenvironmental data transmission is supported by wireless net-work technology

Obviously it is not surprising that in many cases only tra-ditional file-oriented solutions are at disposal for collections ofobjects considered For example the CORIE (Columbia RiverEstuary) system based on three forms of data (scientific datacatalogue data and task data) produces in its simulations 5 GBof forecasted data each day (Bright and Maier 2005) Never-theless its Metadata Repository is schema-less no file for-mats database access libraries or XML schemas need beagreed upon In connection with Internet Web services andEISs such solutions seem to be unsustainable

3 Layered architecture of DBMS

Everybody using eg a relational database is aware of thefact that tables of data occurring on the top of a database sys-tem are virtual in some sense More specifically they providea logical data structure suitable for user-oriented processingdata in the database In early 1980s Harder and Reuter(1983) proposed a mapping model consisting of five layersTable 1 adopted from Harder (2005) shows these five layersin detail We can observe objects to be dealt with at eachabstraction level and particular functions implementing map-pings between two consecutive layers For example the non-

Table 1

Description of the five-layer DBMS mapping hierarchy

Level of abstraction Objects Auxiliary mapping data

L5 Non-procedural

access

Tables views

rows

Logical schema

description

L4 Record-oriented

navigational access

Records sets

hierarchies

networks

Logical and physical

schema description

L3 Record and access

path management

Physical records

access paths

Free space tables

DB-key translation

tables

L2 Propagation control Segments pages Buffers page tables

L1 File management File blocks Directories

procedural access at L5 level provides tables and statementsfor their manipulation usually formulated in the SQL lan-guage Among objects at L3 level we can find data structuressupporting indexing eg well-known B-trees for characterstrings and number or R-trees for spatial data In general go-ing upwards the objects and associated operations becomemore complex and some additional integrity constraints canoccur

The concept of a multi-layered architecture considers itsideal implementation as a set of abstract machines where a ma-chine of layer kthorn 1 is implemented via a machine of layer kAlthough the number five in the architecture considered isa good compromise in practice performance problems occurSimplifying the complexity of layers on the one hand in-creases the run-time overhead on the other hand Conse-quently various ways to optimize DBMS performance aredeveloped and the number of layers is reduced for some sys-tem functions

The development of the layer L5 during last 10 years re-sulted in specification of so-called object-relational (OR)data model Its part is standardized in the standardSQL1999 (ISO 1999) and recently in SQL2003 (ISO2003) In the OR model tables can have structured componentsof their rows columns can be even of a user-defined type Spa-tial data time series or texts belong to this category Such lsquolsquoex-tensiblersquorsquo approach resulted in the so-called universal DBMSsin late 1990s The core of these engines has been extended byloosely coupled additional modules (components) for eachnew data type The vendors of leading DBMSs call these com-ponents extenders datablades and cartridges respectively Forexample spatial and text components belong to the most suc-cessful among many others

The possibility of user-defined types has introduced a lot ofserious problems into implementation of the DBMS architec-ture For some such types eg video image text and audiothere are standardized sets of predicates and functions for ma-nipulating their instances but an open problem remains how tointegrate these types into a common framework in DBMS ar-chitecture The implementation of new access paths like newtypes of indexes usually results in modifications of the DBMSkernel eg SQL compiler queries optimizer etc Suchchanges are very expensive time consuming and error proneto implement and test new access methods and user-definedtypes eg contiguous data flow from streaming data sources

Each vendor uses a different approach to open the host sys-tem architecture to a certain degree Oracle cartridges are re-stricted to secondary index integration In IBM DB2extender there is a framework for indexing data of new datatypes restricted only to B-trees It means evaluation of onlysome types of queries can be improved with such indexingIn other words a new functionality is supported but onlyfor limited class of user requirements

It seems that contribution of such software is mainly in thecases of requirements that can be decomposed into relativelyindependent parts evaluated separately in the DBMS coreand the module implementing a particular data type In otherwords these frameworks are either too complex or not flexible


enough to cope with the wide range of requirements in do-main-specific access methods A real seamless integrationcould be hardly achieved with most of these attempts In otherwords current implementations of layered DBMS architectureare not sufficient and fail in the case of universal DBMSs

The other issue of traditional solutions is that they are avail-able mainly for static applications Because of the inherentspace and time components of environmental data an environ-mental system can be implemented on top of a spatio-temporalDBMS Unfortunately such database software is also underdevelopment today

4 New technologies influencing database architectures

For data representing environmental objects a number ofnew technologies are relevant Some of them eg sensordata and sensor networks stream processing approaching un-certain and imprecise data knowledge discovery methods andintelligent data analysis and wireless broadcast and mobilecomputing are the same as those influencing the database de-velopment in general We discuss them shortly in the follow-ing subsections

41 Sensor data and sensor networks

A sensor network is designed to transmit the data from an ar-ray of sensors to a data repository on a server Sensor networksare based on inexpensive micro sensor technology which willenable most environmental objects to report their temperaturepressure state or location eg via a global positioning systemin real time These small battery-powered devices are placed inareas of interest eg in the soil across the rain forests or evena glacial area to track global warming and climate change Eachsensor node collects environmental data primarily about its im-mediate surroundings These data can support applicationswhose main purpose is to monitor the objectsrsquo attributes Variousenvironmental data can be collected analyzed to forecast the up-coming phenomenon and send prompt warnings

The sensors are generally self-powered wireless deviceswith limited processing speed storage capacity and communi-cation bandwidth Such a device draws far more power whencommunicating than when computing Thus when queryingthe information in the network as a whole it is often prefera-ble to distribute as much of the computation as possible to theindividual nodes Some system architectures for environmen-tal monitoring include a base station equipped by a (relational)DBMS which communicates with wireless sensor networksThe base station ensures an access and a control from remoteusers With a server behind a many-layered architecture canlook similarly to that in Fig 1 In fact the network becomesa new kind of database machine whose optimal use requiresoperations to be pushed as close to the data as possible Ina more complicated case sensors andor users of the sensornetworks can be even mobile

Sensor networks provide important data sources and createnew data management requirements For example they do notnecessarily use a simple one-way data stream over

a communications network Elements of the network architec-ture will make decisions about what data to pass on such aslocal area summaries and filtering in order to minimize poweruse while maximizing information content

Sensor information processing raises many of the most in-teresting database issues in a new environment with a new setof constraints and opportunities Huge data sets of environ-mental data generated by sensors will be distributed through-out the world and can come and go dynamically For examplethe Earth Observing System (EOS) of NASA is a collection ofsatellites producing data regarding atmosphere oceans andland about 13 of a petabyte of information per year Since ter-abytes of data from individual nodes will soon be the normnew requirements on computational and data managementinfrastructures appear For example DataDirectrsquos enterprise-class S2A 6000 Silicon Storage Appliance directly supportsup to 512 workstations and 180 terabytes of storage

From one perspective sensor networks are similar to dis-tributed databases but with inherent real-time propertiesOne important difference is that the evaluation rate of dataproduced in a sensor network is much higher than typicallyconsidered in distributed DBMSs This breaks the traditionalinformation integration paradigm since there is no practicalway to extract and load data into a common database toeach such occurrence Also strategies of query optimizationand query processing must be redefined

There are a lot of examples of such networks in practiceFor example Mainwaring et al (2002) mention experimentswith environmental monitoring in the context of two wildlifehabitats Great Duck Island and James Reserve Based onthe requirements from the researchers studying these habitatsthey propose a sensor network architecture for this class of ap-plications On a much larger scale the development of Envi-ronmental Observations and Forecasting System (EOFS)combines real time in situ monitoring with distribution net-works that carry data to centralized processing sites One ex-ample of this is the above cited CORIE project TheFLOODNET project (Envisense 2004) plans to provide a floodwarning in the UK

Sensor Network Server

Base Station Base Station

Sensor Nodes

Fig 1 A multi-layered sensor network architecture


In general there is a need to design flexible lightweight da-tabase abstractions that are optimized for data movement asopposed to data storage (Stonebraker and Cetintemel 2005)

42 Stream processing

As mentioned in Section 41 sensors can produce continu-ous possibly infinite streams of data EISs based exclusivelyonly on the traditional store-and-query model cannot handlethe volume and velocity of streaming data whose values mightexist a moment There are a growing number of applicationsmonitoring eg an environment where DBMSs are used to cre-ate a near-real-time image of some critical parts of the environ-ment Comparing to the streams of usual scientific data thattypically run at a constant speed monitoring applicationsmust be able to accommodate significant fluctuations in the data

Traditional DBMSs are unsuited to deal with such streamsfor various reasons (Amato et al 2004)

sensor nodes produce and deliver data continuously with-out receiving requests for that data queries over collected data can be less frequent than data

insertions produced data has often to be processed in real-time

because it can represent events that need a rapid answer queries run continuously because data streams never ter-

minate so that they can see system conditions change dur-ing their execution because of storage constraints an entire stream cannot be

stored in the disk because data streams are possibly infinite only non-block-

ing operators can be used and if the data to be processed is not available then operators

must process data only when nodes make it available

In consequence stream processing is not a data managementtask it is a data-filtering task New architectures so called datastream processing systems (DSPS) have emerged see eg Car-ney et al (2002) A rather restricted solution stream-processingengine (SPE) is an example of a new database architecture thatenables the execution of queries computations and actions onstreaming data in real time Such SPE should accept SQL-likequeries stream-oriented continuous queries and execute themover live event streams with outputting results in real timeIn SPEs most of the data processing is processed in main mem-ory read or write operations to storage is optional and can behandled asynchronously in many cases

For example in a recent pilot program Streambase devel-oped by Stonebraker (StreamBase Systems Inc 2005) shouldbe able to analyze 140000 messagess while a leading rela-tional DBMS could handle only 900 messagess

43 Approaching uncertain and imprecise data

In addition to data management issues of environmentaldata in data streams many other problems arise There are dif-ferent sources that cause information to be uncertain

incompleteness inconsistency vagueness imprecision anderror Worboys (1998) associates these notions typically withspatial data Incompleteness is related to totally or partly miss-ing data the prototypical situation of this kind is when a dataset is obtained from digitizing paper maps and pieces of linesare missing Inconsistency arises when several versions of thesame object exist due either to different time snapshots ordata sets of different sources or different abstraction levelsVagueness is an intrinsic property of many natural geographicfeatures that do not have crisp or well-defined boundariesImprecision is due to a finite representation of spatial entitiesthe basic example of this kind is the regular tessellation used inraster data where the element of the tessellation is the smallestunit that represents space Scientific measurements have stan-dard errors Error is everything that is introduced by limitedmeans of taking measurements For example location datafor moving objects involve uncertainty in current position

Individual sensors are not reliable and consequently wire-less communication is also unreliable Thus various approachesare used to provide more accurate estimation of the environ-ment In multisensor data fusion approaches like fuzzy setsor DempstereShafer evidential theory are sometimes used(Ramamritham et al 2004) Sequences and images requireapproximate processing based on similarities metrics etcAnother source of using techniques based on similarities is theinformation retrieval area Considering environmental dataequipped by metadata expressed by text strings we can takealso these methods into account An excellent survey of similar-ity measures used which are applicable on environmental data ispresented in Nunez et al (2004)

Traditional DBMSs were applied to business data process-ing which typically focused on numbers and character stringsIn those application areas data elements are precise quantitieslike address quantity on hand balance status and deliverydate As a result current DBMSs have no facilities for eitherapproximate data or imprecise queries

Huge data sets and their imperfect nature produce a numberof direct consequences for computing in general They include(Cohen 2005)

the notion of practical complexity must be revised in thesense that any above-linear algorithms might be too timeconsuming one may even avoid algorithms having largecoefficients of linearity the processed results should reflect existing data

imperfections the ability to perform pre-processing and use incremental

algorithms will become essential approaches in reducingcomputing times and approximate solutions may be the only resort for solving

large complex problems

44 Knowledge discovery and intelligent data analysis

Environmental data often need to be analyzed in order toobtain information necessary for environmental management


decisions Environmental Decision Support Systems (EDSS)are often mentioned in this context EIS and EDSS are majorbuilding blocks in environmental management and environ-mental science today EIS and EDSS are usually said tohave certain characteristics which distinguish them from stan-dard information systems eg information complexity in timeand space or incompleteness or fuzziness of data items(Denzer 2005) The authors of the project GESCONDA men-tion in Gibert et al (2005) the high quantity of informationand knowledge patterns that are implicit in large databasescoming from environmental domains specially oriented to en-vironmental databases

Particularly DM methods are suitable for this purpose His-torically DM has focused on efficient ways to discover modelsof existing data sets These models must expose some usefulaspects of the data while obscuring details not useful for theintended application In comparison to simple forms of regu-laritiesdependencies treated by statistical methods DMmethods can find more complex hypotheses that includeboth numerical and logical conditions Algorithms have beendeveloped by many research communities to perform such op-erations as classification clustering association-rule discov-ery and summarization These techniques are now part ofmainstream products from the major DBMS vendors andmost of them are applicable in EISs

OLAP or DW techniques are often sufficient For exampletemperature and pressure trends are required in an environ-ment Derivation of such information typically requires pasttemperatures and pressures stored in a database and processedalong the time dimension Often multidimensional data struc-tures are used in the context of such applications Data inOLAP and DW systems are processed by columns ratherthan by rows Data processing uses also special indexing tech-niques like bitmap indexes and various trees Although a lot ofsuitable data structures have been developed during last 20years only few of them eg UB-trees and M-trees are inte-grated into commercial DBMSs Rather specialized enginesabsorb them The architecture with two engines united bya common parser occurs in practice Classical transaction da-tabase and DW database are stored separately and viewed asone database

Recent interests in combining DM technology with DBMSsrequire new approaches to storage data sets to be mined and tooptimize DM processing New research directions include

(1) multi-dimensional OLAP for discovering unusual patternsin stream data

(2) mining clusters and outliers in stream data for discoveringunusual patterns and

(3) single-pass classification methods for stream DM

45 Wireless broadcast and mobile computing

Data broadcast is an attractive alternative to on demand ac-cess because it can broadcast data simultaneously to a largenumber of clients at a fixed cost It is suitable for location-

based services which exhibit strong temporal and spatiallocality in that clients within the neighbourhood and a certaintime period tend to seek the same kind of information (Zhengand Lee 2005)

The data to be broadcast includes also sensor data Sensorsdeployed in the environment can broadcast their data periodi-cally or when interesting events happen Unlike to traditionalcomputing client devices cannot make requests to sensors forthe data Instead client devices just listen to the broadcastchannels passively Thus the sensors have the initiative incommunication Sensors may broadcast data periodically ifthey are measuring a continuous phenomenon producing envi-ronmental data or may broadcast data only when a particularevent occurs if they are detecting whether an RFID tag2 hasjust come into range

Higher-level sensors in a sensor network can pre-processlow-level sensor data and then broadcast this derived informa-tion to client devices Such processing can require modifieddatabase techniques to be successful

Since environmental data require often to be disseminatingtimely to the user anytime and anywhere a mobile environ-ment is of increasing importance in this context Particularlyin periodic broadcast data are broadcast periodically on a wire-less channel A mobile client listens to the broadcast channeland downloads the desired data from the channel according toa query issued from the user or a stored profile of interest onthe client Of course these networks should be also able re-spond to aperiodic queries

Besides this mobile devices introduce yet another categoryof application (Seltzer 2005) caching relevant portions ofa larger data set on a smaller low-functionality device Onecan think of a mobile device as a cache of a global data setThis model has attractive properties e in particular the abilityto augment the local data set with entries as they are used orneeded Mobile telephony infrastructure requires similar cach-ing capabilities to maintain communication channels to the de-vices The access pattern observed in these caches is also readmostly and the data itself is completely transitory it can belost and regenerated if necessary

We observe that location becomes a very important prop-erty of data and introduces a new dimension to data accessmethods Traditional data access methods are not suitablefor such computing and new researches redefine some well-known techniques eg spatial queries in the mobile environ-ment with a particular emphasis on broadcast data

5 Towards new database architectures

Database technology seems to be fundamental for a deploy-ment of technologies presented in Section 4 in context of EISsSome attempts to influence the development of EISs by databasespecialists exist even from the past For example the Sequoia2000 (Stonebraker 1994) project speaks about collaboration

2 RFID (Radio-frequency identification) is the latest technology based on

radio waves that is useful for precisely identifying objects


between computer scientists and environmental researchers todesign a next-generation information system for managingdata for global change research The primary challenges witha database approach are flexibility without complexity andease of use Moreover a database approach brings the opportu-nity to link all data together on a user level and it will make allanalysis of the data easier eg via the database technology likeDM

A commonview on issues mentioned in Section 4 concerns theDBMS architecture In fact todayrsquos DBMSs provide a universalarchitecture applicable to a lot of various types of tasks By wordsof Stonebraker and Cetintemel (2005) lsquolsquoone size fits allrsquorsquo In newarchitectures of DBMSs separate engines rather lsquolsquomade to mea-surersquorsquo are supposed according to requirements of various applica-tions Besides rather traditional applications e OLAP datawarehouses and text retrieval another candidates for a separateengine are

stream processing sensor networks scientific data bases native XML databases

We have tried to highlight some characteristics of the firstthree technologies with respect to their association to environ-mental data management

Considering native XML databases solutions with separateengines are popular today Harder (2005) presents XTC archi-tecture (XML Transaction Controller) which proves that nativeXML DBMS can be implemented along the lines of five-layerarchitecture (see Table 1) Also a possibility of a hybrid engineoccurs To integrate relational and XML data IBM developsa new hybrid DB2 DBMS enabling to work on a truly nativeXML store that sits side by side with DB20s relational data re-pository On top of both data stores (relational and XML) sitsone hybrid database engine Similar solution is used by mostvendors combining a data warehouse DBMS and a usual on-line transaction processing DBMS which are united by a com-mon parser Such architecture can be inspiring forimplementation of other data types too

Another approach evolves the original idea of DBMS ex-tensibility Acker et al (2005) developed an Access Managerspecification a new programming interface to several layersof a DBMS kernel This enables the programmer to add newdata structures to the DBMS with a minimum of effort

There is also the third approach to achieve a flexibility ofprocessing data in a database way to produce a storage en-gine that is more configurable so that it can be tuned to therequirements of individual applications (Seltzer 2005) Thereare fundamentally two properties that a solution must possessto address the wide range of application needs emergingtoday modularity and configurability A modular DBMS en-gine must allow the developer to use or exclude major subsys-tems depending on whether the application needs them TheDBMS must be also configurable to its operating environ-ment the specific hardware operating system and applica-tion using it

6 Conclusions

Environmental data management analysis and communi-cation are essential components of environmental character-ization and decision making DBMSs the Internet andassociated Web technologies have become an integrating forcefor these components

According to Selinger (2005) data research challenges forthe next decade include apart from other things the followingtasks

re-examine DBMS architecture and invent ways to scalemore and better without sacrificing user-visible availabil-ity or performance learn what managing content is all about what is needed

and create new models treat metadata as a first class research

We have focused mainly on the first issue The others can alsoimprove accessibility and availability of environmental dataApproaching uncertain and imprecise data as well as knowledgediscovery and intelligent data analysis requires new models andsemantic annotations of the data Some specific approaches al-ready exist For example to increase environmental data qualitynew information processing occurs that preserves and retrievesthe origins and processing history d that is the lineage d ofobjects and processes (Bose and Frew 2005) To ensure thatthe greatest use is made of environmental data data producersshould include data lineage (and authenticity information) inthe metadata On a database level this requires more sophisti-cated techniques for metadata processing

Everything indicates that the development of new databasetechnologies has and will have consequences which will affectEISs of the future

Acknowledgement

This research was supported in part by the Nationalprogramme of research (Information society project1ET100300419)

References

Abiteboul S Agrawal R Bernstein PA Carey MJ Ceri S Croft WB

DeWitt DJ Franklin MJ Garcia-Molina H Gawlick D Gray J

Haas LM Halevy AY Hellerstein J Ioannidis YE Kersten ML

Pazzani MJ Lesk M Maier D Naughton JF Schek H-J

Sellis TK Silberschatz A Stonebraker M Snodgrass RT

Ullman JD Weikum G Widom J Zdonik SB May 2005 The Low-

ell Database Research self-assessment Communications of the ACM 48

(5) 111e118

Acker R Pieringer R Bayer R 2005 Towards truly extensible database

systems In Proceedings of DEXA 2005 Conference LNCS 3588

Springer-Verlag pp 596e605

Amato G Caruso A Chessa S Masi V Urpi A 2004 State of the art and

future directions in wireless sensor networkrsquos data management 2004-TR-

16 ISTI

Bose R Frew J 2005 Lineage retrieval for scientific data processing a sur-

vey ACM Computing Surveys 37 (1) 1e28


Bright L Maier D 2005 Deriving and managing data products in an environ-

mental observation and forecasting system In Proceedings of Conference

on Innovative Data Systems Research (CIDR) January 2005 pp 162e173

Carney D Cetintemel U Cherniack M Convey C Lee S Seidman G

Stonebraker M Tatbul N Zdoni S 2002 Monitoring streams e a new

class of data management applications In Proceedings of the 28th Inter-

national Conference on Very Large Data Bases Morgan Kaufmann Pub-

lishers pp 215e226

Cohen J 2005 Updating computer science education Communications of

the ACM 48 (6) 29e31

Denzer R 2005 Generic integration of environmental decision support systems e

state-of-the-art Environmental Modelling amp Software 20 (10) 1217e1223

Envisense 2004 FloodNet pervasive computing in the environment Avail-

able at lthttpenvisenseorgfloodnetfloodnethtmgt

Gibert K Sanchez-Marre M Rodrıguez-Roda I 2005 GESCONDA an

intelligent data analysis system for knowledge discovery and management

in environmental databases Environmental Modelling amp Software 21 (1)

115e120

Harder T Reuter A 1983 Concepts for implementing a centralized database

management system In Proceedings of International Computing Sympo-

sium on Application Systems Development March 1983 BG Teubner-

Verlag Nurnberg pp 28e104

Harder T 2005 DBMS architecture e still an open problem In Proceedings

of BTW Karlsruhe March 2005 pp 2e28

ISO 1999 Information technology e database languages e SQL e Part 1

framework (SQLframework) ISOIEC 9075-11999


foundation (SQLfoundation) ISOIEC 9075e22003

LCA Glossary Available at lthttpwwwlineadecreditoambientalorghtml

glossaryhtmlgt

Mainwaring A Polastre J Szewczyk R and Culler D 2002 Wireless sen-

sor networks for habitat monitoring Intel Research Berkeley IRB-TR-

02e006

Nunez H Sanchez-Marre M Cortes U Comas J Martınez M Rodrı-

guez-Roda I Poch M 2004 A comparative study on the use of similar-

ity measures in case-based reasoning to improve the classification of

environmental system situations Environmental Modelling amp Software

19 (9) 809e819

Ramamritham K Son SH Dipippo LC 2004 Real-time databases and

data services Real-Time Systems 28 179e215

Reuter A 2005 Databases the integrative force in cyberspace In Data

Management in a Connected World LNCS 3551 Springer Verlag

pp 3e16

Selinger P 2005 Five data challenges for the next decade In Key note of the

Conference ICDE held in April 2005 Tokyo Japan

Seltzer MI 2005 Beyond relational databases Databases 3 (3) 50e58

Stonebraker M 1994 Sequoia 2000 e a reflection on the first three years

Sequoia technical report S2K-94-58 Berkeley CA Available at lthttp

epochcsberkeleyedu8000sequoiatechreportss2k-93-23gt

Stonebraker M Cetintemel U 2005 lsquolsquoOne Size Fits Allrsquorsquo an idea whose

time has come and gone In Proceedings of the Conference ICDE April

2005 Tokyo Japan pp 2e11

StreamBase Systems Inc 2005 StreamBase 20 Available at lthttp

wwwstreambasecomindexhtmlgt

Tomasic A Simon E 1997 Improving access to environmental data using

context information ACM SIGMOD Record 26 (1) 11e15

Worboys MF 1998 Imprecision in finite resolution spatial data GeoInfor-

matica 2 257e279

Zheng B Lee DL May 2005 Information dissemination via wireless

broadcast Communications of the ACM 48 (5) 105e110


and with triggering various control actions Consequently theneeds for environmental data management have grownsignificantly

Considering environmental data sets combined with egbusiness data emails documentations etc adequate informa-tion integration mechanisms are needed Since their beginningthe databases have had an integrative role in the world of dataReuter (2005) argues that the technological evolution of data-base technology makes database systems even the ideal candi-date for integrating all types of objects that need persistence aswell as for supporting all the different types of execution thatare characteristic of the various application classes

The most important part of each management system dealswith data through querying When users want to search anduse environmental information the following problems occur(Tomasic and Simon 1997)

(1) Data do not exist or are insufficient sometimes this mayrequire synthesis or reproduction of data

(2) Data are not referenced by data suppliers and thereforehard to locate or data are referenced under specific classi-fication criteria that are domain-specific

(3) Data are hard to access they are either private or of a toohigh cost or requiring costly pre-processing (eg datamust be re-entered manually from paper documentation)or format translation

(4) Accessed data sets are hard to use because they are incon-sistent or non-compatible for example access to longtime series but standard data collection techniques havenot been applied thereby making adjacent time seriesnot compatible

(5) The quality of retrieved data is hard to assess it is oftenhard to compare data produced using different scientificmodels because of a lack of documentation about the un-derlying computational processes

The database community focuses on information storageorganization management and access in software architec-tures called database management systems (DBMSs) Alwaysit is driven by new applications technology trends new syner-gies with related fields and innovation within the field itselfThe problems (1)e(5) are a natural part of todayrsquos database re-search and development A natural idea is that EISs based onadvanced database technologies could help to deal with theseissues

Several technological aspects influence DBMS develop-ment Focusing on the scientific data it is often coming instreams The sensor networks producing the data consist ofvery large numbers of low-cost devices each of which isa data source measuring some quantity eg the objectrsquos loca-tion or the ambient temperature Processing such data is usu-ally completely different from the data stored in enterprisedatabases Data arrive in high-speed streams and queriesover those streams need to be processed in an online fashionto enable real-time responses Moreover in comparison to en-terprise data processing these data are uncertain or impreciseOther aspects of such data processing include unclear

formulation of queries based on common techniques as theyare used for example in classical databases Often we arenot able to formulate a query eg in SQL and despite ofthe fact we believe on the other hand that something interest-ing is hidden in our data In such situations a lack of semanticsis apparent To describe data semantics metadata and its for-mal description are necessary

Data in collections considered creates an ideal platform forusing knowledge discovery methods andor intelligent dataanalysis Also online analytical processing (OLAP) datawarehouses (DW) and data mining (DM) techniques canhelp in this context

The purpose of the paper is to present the main challengesinfluencing todayrsquos database development with respect to theprocessing environmental data First in Section 2 we discussproperties of new data sources In Section 3 we repeat the con-cepts of the classical centralized DBMS architecture as it ex-ists from early 1980s According to Harder and Reuter (1983)the architecture models five-level abstraction hierarchy Its im-plementation has five technological layers that allow to sepa-rate some problems and their solutions in relativelyindependent way The main part of the paper presents fivenew technologies influencing database architectures in Section 4They include sensor data and sensor networks stream pro-cessing approaching uncertain and imprecise data knowl-edge discovery methods and intelligent data analysis andwireless broadcast and mobile computing In Section 5 weargue that new DBMS architectures are needed describingbriefly some of their proposals and give several examplesof their occurrences in practice In conclusions we summa-rize the basic ideas given in the paper and add a numberof other issues that can influence processing environmentaldata

2 New data sources

Usual enterprise data stored in databases are structured andcan be described by a so-called (database) schema Sucha schema is almost fixed or it is changed only rarely It isnot the case of collections of scientific as well as environmen-tal data By Reuter (2005) the key properties of these data col-lections (irrespective of the many differences) are thefollowing

The raw data is written once and never changes again Asa matter of fact some scientific organizations require forall projects they support that any data that influences thepublished results of the project be kept available for an ex-tended period of time typically around 15 years Raw data comes in as streams with high throughput (hun-

dreds of MBs) depending on the sensor devices Thestreams have to be recorded as they come in because inmost cases there is no way of repeating the measurement For the majority of applications the raw data is not inter-

esting What the users need are aggregates derived valuesor e in case of text fields e some kind of abstract oflsquolsquowhat the text saysrsquorsquo








Table 1



L5 Non-procedural

access

Tables views

rows

Logical schema

description

L4 Record-oriented

navigational access

Records sets

hierarchies

networks


schema description


path management

Physical records

access paths

Free space tables

DB-key translation

tables
























Sensor Nodes
























































6 Conclusions







Acknowledgement


References








(5) 111e118






16 ISTI











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279










Table 1



L5 Non-procedural

access

Tables views

rows

Logical schema

description

L4 Record-oriented

navigational access

Records sets

hierarchies

networks


schema description


path management

Physical records

access paths

Free space tables

DB-key translation

tables
























Sensor Nodes
























































6 Conclusions







Acknowledgement


References








(5) 111e118






16 ISTI











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279


















Sensor Nodes
























































6 Conclusions







Acknowledgement


References








(5) 111e118






16 ISTI











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279

























































6 Conclusions







Acknowledgement


References








(5) 111e118






16 ISTI











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279































6 Conclusions







Acknowledgement


References








(5) 111e118






16 ISTI











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279











6 Conclusions







Acknowledgement


References








(5) 111e118






16 ISTI











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279











lishers pp 215e226


the ACM 48 (6) 29e31








115e120












glossaryhtmlgt



02e006





19 (9) 809e819





pp 3e16















matica 2 257e279