Geo-spatial data mining in the analysis of a demographic database

FOCUS

M. Yasmina Santos Æ L. Alfredo Amaral

Geo-spatial data mining in the analysis of a demographic database

Published online: 26 November 2004� Springer-Verlag 2004

Abstract Spatial data mining refers to the extraction ofknowledge, spatial relationships, or other interestingpatterns not explicitly stored in spatial databases. Theapproaches usually followed in the analysis of geo-spatial data with the aim of knowledge discovery areessentially characterised by the development of newalgorithms, which treat the position and extension ofobjects mainly through the manipulation of their co-ordinates. In this paper a new approach to this process ispresented, where geographic identifiers give the posi-tional aspects of geographic data. These identifiers aremanipulated using qualitative reasoning principles,which allow for the inference of new spatial relationsrequired for the data mining step of the knowledgediscovery process. The analysis of a demographic data-base, with the proposed principles, enabled the discoveryof patterns that are hidden in the explored geo-spatialand demographic data.

Keywords Data mining � Qualitative spatial reasoning �Geo-spatial data

1 Introduction

Knowledge discovery in databases is a process that aimsat the discovery of relationships within data sets. DataMining is the central step of this process. It correspondsto the application of algorithms for identifying patternswithin data. Other steps are related to incorporatingprior domain knowledge and interpretation of results.

The analysis of geo-referenced databases constitutesa special case that demands a particular approach

within the knowledge discovery process. Geo-referenceddata sets include allusion to geographical objects, loca-tions or administrative sub-divisions of a region. Thegeographical location and extension of these objectsdefine implicit relationships of spatial neighbourhood.The Data Mining algorithms have to take this spatialneighbourhood into account when looking for associa-tions among data. They must evaluate if the geographiccomponent has any influence in the patterns that can beidentified or if it is responsible for a pattern. Most of thegeographical attributes normally found in organiza-tional databases (e.g., addresses) correspond to a type ofspatial information, namely qualitative, which can bedescribed using indirect positioning systems. In systemsof spatial referencing using geographic identifiers, aposition is referenced with respect to a real worldlocation defined by a real world object. This objectrepresents a location that is identified by a geographicidentifier. These geographic identifiers are very commonin organizational databases, and they allow the inte-gration of the spatial component associated with themin the process of knowledge discovery.

This paper presents an approach to the analysis ofgeo-referenced data with the aim of knowledgediscovery, based on qualitative spatial reasoningstrategies, which enable the integration of the spatialcomponent in the knowledge discovery process. Thisapproach, implemented in the PADRAO system, allowedthe analysis of geo-referenced databases and the identi-fication of relationships existing between the geo-spatialand non-spatial data.

The following sections include: (i) an overview of theprocess of knowledge discovery and the tasks usuallycarried out in the analysis of geo-referenced databases;(ii) a description of qualitative spatial reasoning pre-senting its principles and direction, distance and topo-logic spatial relations. For these relations, an integratedspatial reasoning system was constructed and madeavailable in the Spatial Knowledge Base of the PADRAO

system; (iii) a presentation of the PADRAO systemdescribing its architecture and its implementation

Soft Comput (2005) 9: 374–384DOI 10.1007/s00500-004-0417-0

M. Y. Santos � L. A. Amaral (&)Information Systems Department,University of Minho, Campus de Azurem,4800–058 Guimaraes, Portugale-mail: {maribel, amaral}@dsi.uminho.ptTel.: +351-253-510-319Fax: +351-253-510-300

achieved through the integration of several technologies;(iv) the analysis of a demographic database based on theseveral steps of the knowledge discovery process con-sidered by the PADRAO system; and (v) a conclusion withsome comments about the proposed approach and itsmain advantages.

2 Knowledge discovery in databases

Knowledge Discovery in Databases (KDD) is a complexprocess concerning the discovery of relationships andother descriptions from data. Data Mining refers to theapplication algorithms used to extract patterns fromdata without the additional steps of the KDD process.The steps of the KDD process are data selection, datatreatment, data pre-processing, data mining and inter-pretation of results [7, 8].

Different tasks can be performed in the knowledgediscovery process and several techniques can be appliedfor the execution of a specific task. Among the availabletasks are classification, clustering, association, estimationand summarization. KDD applications integrate a vari-ety of Data Mining algorithms. The performance of eachtechnique (algorithm) depends upon the task to be car-ried out, the quality of the available data and theobjective of the discovery. The most popular DataMining algorithms include neural networks, decisiontrees, association rules and genetic algorithms [12].

The main recognized advances in the area of KDD[7, 8] are related to the exploration of relational data-bases. However, in most organizational databases thereexists one dimension of data, the geographic (associatedwith addresses or post-codes), the semantic of which isnot used by traditional KDD systems. ‘‘Spatial DataMining (SDM) refers to the extraction of knowledge,spatial relations, or other interesting patterns not explicitlystored in spatial databases’’ [12]. Several tasks can beperformed in SDM, among them: spatial characteriza-tion, spatial classification, spatial association and spatialtrends analysis [6, 12].

A spatial characterization corresponds to a descrip-tion of the spatial and non-spatial properties of a se-lected set of objects. This task is achieved by analyzingnot only the properties of the target objects, but also theproperties of their neighbours. In a characterization, therelative frequency of incidence of a property in the se-lected objects, and their neighbours, is different from therelative frequency of the same property verified in theremaining of the database [6]. For example, the inci-dence of a particular disease can be higher in a set ofregions closest to, or holding a specific industrial com-plex, showing that a possible cause-effect relationshipexists between the disease and the industry pollution.

Spatial classification aims to classify spatial objectsbased on the spatial and non-spatial features of theseobjects in a database. The result of the classification, aset of rules that divides the data into several classes, canbe used to get a better understanding of the relationships

among the objects in the database and to predictcharacteristics of new objects [12, 13]. For example, re-gions can be classified into rich or poor according to theaverage family income or any other relevant attributepresent in the database.

Spatial association permits the identification of spa-tially-related association rules from a set of data. Anassociation rule shows the frequently occurring patternsof a set of data items in a database. A spatial associationrule is a rule of the form ‘‘X! Y (s%; c%Þ’’, where Xand Y are sets of spatial and non-spatial predicates. Inan association rule, s represents the support of the rule,the probability that X and Y exist together in the dataitems analyzed, while c indicates the confidence of therule, i.e. the probability that Y is true under thecondition of X. For example, the spatial association rule‘‘is aðx;HouseÞ ^ close to ðx;BeachÞ ! is expensive ðxÞ’’states that houses which are close to the beach areexpensive [12, 13].

A spatial trend [6] describes a regular change of oneor more non-spatial attributes when moving away froma particular spatial object. Spatial trend analysis allowsfor the detection of changes and trends along a spatialdimension. Examples of spatial trends are the changes inthe economic situation of a population when movingaway from the centre of a city or the trend of change ofthe climate with the increasing distance from the ocean[12].

After the presentation of some of the most populartasks associated with the analysis of spatial data with theaim of knowledge discovery, this paper describes a newapproach to this process. This approach integratesqualitative principles in the spatial reasoning systemused in the knowledge discovery process, allowing theuse of traditional KDD systems (and their generic DataMining algorithms) in SDM.

3 Qualitative spatial reasoning

Human beings use qualitative identifiers extensively tosimplify reality and to perform spatial reasoning moreefficiently. Spatial reasoning is the process by whichinformation about objects in space and their relation-ships are gathered through measurement, observation orinference and used to arrive at valid conclusionsregarding the relationships of the objects [19]. Qualita-tive spatial reasoning [1] is based on the manipulation ofqualitative spatial relations, for which composition1

1 Geographic Information Systems allow for the storage of geo-graphic information and enable users to request information aboutgeographic phenomena. If the requested spatial relation is notexplicitly stored in databases, it must be inferred from the infor-mation available. The inference process requires searching relationsthat can form an inference path between the two objects where therelation is requested [15]. The composition operation combines twocontiguous paths in order to infer a third spatial relation. A com-position table integrates a set of inference rules used to identify theresult of a specific composition operation.

375

tables facilitate reasoning, thereby allowing the inferenceof new spatial knowledge.

Spatial relations have been classified into severaltypes [10], including direction relations [11], distancerelations [14] and topological relations [5]. Qualitativespatial relations are specified by using a small set ofsymbols, like North, close, etc., and are manipulatedthrough a set of inference rules.

The inference of new spatial relations can be achievedusing the defined qualitative rules, which are compiledinto a composition table. These rules allow for themanipulation of the qualitative identifiers adopted. Forexample, knowing the facts, A North; very far from Band B Northeast; very close to C, it is possible, by con-sulting the composition table for integrated directionand distance spatial reasoning [15], to infer therelationship that exists between A and C, that isA North; very far from C.

3.1 Direction spatial relations

Direction relations describe where objects are placedrelative to each other. Three elements are needed toestablish an orientation: two objects and a fixed point ofreference (usually the North Pole) [10, 11]. Cardinaldirections can be expressed using numerical valuesspecifying degrees (0�, 45�. . .) or using qualitative valuesor symbols, such as North or South, which have anassociated acceptance region. The regions of acceptancefor qualitative directions can be obtained by projections(also known as half-planes) or by cone-shaped regions(Fig. 1).

A characteristic of the cone-shaped system is that theregion of acceptance increases with distance, whichmakes it suitable for the definition of direction relationsbetween extended objects2 [19]. It also allows for thedefinition of finer resolutions, thus permitting the use ofeight or sixteen different qualitative directions. Thismodel uses triangular acceptance areas (Fig. 2) that aredrawn from the centroid of the reference object towardsthe primary object (in the spatial relation A North B; Brepresents the reference object, while A constitutes theprimary object).

3.2 Distance spatial relations

Distances are quantitative values determined throughmeasurements or calculated from known co-ordinates oftwo objects in some reference system. The frequently useddefinition of distance can be achieved using the Euclideangeometry and Cartesian coordinates. In a two-dimen-sional Cartesian system, it corresponds to the length of

the shortest possible path (a straight line) between twoobjects, which is also known as the Euclidean distance[15]. Usually a metric quantity is mapped onto somequalitative indicator such as very close or far for humancommon-sense reasoning [14].

Qualitative distances must correspond to a range ofquantitative values specified by an interval and theyshould be ordered so that comparisons are possible. Theadoption of the qualitative distances very close – vc,close – c; far� f and very far – vf, intuitively describedistances from the nearest to the furthest. An orderrelationship exists among these relations, where a lowerorder (vc) relates to shorter quantitative distances and ahigher order (vf) relates to longer quantitative distances[15]. The length of each successive qualitative distance,in terms of quantitative values, should be greater orequal to the length of the previous one (Fig. 3).

3.3 Topological spatial relations

Topological relations are those relationships that areinvariant under continuous transformations of space

Fig. 1 Direction relations: projection and cone-shaped systems

Fig. 2 Triangular model

Fig. 3 Qualitative distances

2 Extended objects are not point-like, so represent objects for whichtheir dimension is relevant [10]. In this work, extended objects aregeometrically represented by a polygon, indicating that their po-sition and extension in space are relevant.

376

such as rotation or scaling. There are eight topologicalrelations that can exist between two planar regionswithout holes: disjoint; contains; inside; equal; meet;covers; covered by and overlap (Fig. 4). These relationscan be defined considering intersections between the tworegions, their boundaries and their complements [5].

In some exceptional cases, the geographic spacecannot be characterized, in topological terms, withreference to the eight topological primitives presentedabove. One of these cases is related with applicationdomains, like that described in this paper, in which thegeographic regions addressed are administrative subdi-visions. Administrative subdivisions, represented in thiswork by full planar graphs3, can only be related throughthe topological primitives disjoint; meet and contains(and the corresponding inverse inside), since they cannothave any kind of overlapping. The topological primitivesused in this paper are disjoint and meet, since theimplemented qualitative inference process only considersregions at the same geographic hierarchical level.

3.4 Integrated spatial reasoning

Reasoning about qualitative directions necessarily in-volves integrated spatial reasoning about qualitativedistances and directions. Particularly in objects withextension, the size and shape of objects, and the distancebetween them, influence the directions. One of the waysto determine the direction and distance4 between regionsis calculating them for the centroids of the regions. Theextension of the geographic entities is somehow implicitin the topological primitive used to characterise theirrelations.

3.4.1 Integration of direction and distance

An example of integrated spatial reasoning about quali-tative distances and directions is as follows. The factsA is very far from B and B is very far from C do notfacilitate the inference of the relationship that existsbetween A and C: A can be very close or close to C, or Amay be far or very far from C, depending on the orien-tation between B and C.

For the integration of qualitative distances anddirections the adoption of a set of identifiers is required,which allows for the identification of the considereddirections and distances and their respective intervals ofvalidity. Hong [15] analyzed some possible combinationsfor the number of identifiers and the geometric patternsthat should characterize the distance intervals. Thelocalization system suggested by Hong is based on eightsymbols for direction relations (North;Northeast; East;Southeast; South; Southwest, West; Northwest) andfour symbols for the identification of the distance rela-tions (very close; close; far and very far).

The definition of the validity interval for each dis-tance identifier must obey some rules [15]. In these sys-tems there should exist a constant ratio (ratio¼length(disti )/length (disti�1)) relationship between the lengthsof two neighboring intervals. The presented simulatedintervals allow for the definition of new distance inter-vals by magnification of the original intervals. Forexample, the set of values for ratio 45 can be increasedby a factor of 10 supplying the values dist0 (0; 10], dist1(10; 50], dist2 (50; 210] and dist3 (210; 850]. Since thesame scale magnifies all intervals and quantitative dis-tance relations, the qualitative compositions will remainthe same, regardless of the scaled value.

It is important to know that the number of distancesymbols used and the ratio between the quantitativevalues addressed by each interval play an important rolein the robustness of the final system, i.e. in the validity ofthe composition table for the inference of new spatialrelations [15].

3.4.2 Integration of direction and topology

The relative position of two objects in the bi-dimensionalspace can be achieved through the dimension and orien-tation of the objects. Looking at each of these charac-teristics separately implies two classes of spatial relations:topological, which ignores orientations in space; anddirection that ignores the extension of the objects.

The integration of these two kinds of spatial relationsenables the definition of a system for qualitative spatialreasoning that describes the relative position existingbetween the objects and how the limits (frontiers) ofthem are related.

Sharma [19] integrated direction and topologicalspatial relations using the principles of qualitative

Fig. 4 Topological relations

3 The topology of a full planar graph refers to a planar graph thatintegrates regions completely covering the plane without any gap oroverlap. Regions are topologically represented by faces, which aredefined without holes [3].4 Defining distances between regions is a complex task, since thesize of each object plays an important role in determining thepossible distances. Sharma [23] enumerates some possible ways tothe definition of distances between regions: i) taking the distancebetween the centroids of the two regions; ii) determining theshortest distance between the two regions; or iii) determining thefurthest distance between the two regions. 5 Other validity intervals, for different ratios, can by found in [15].

377

temporal reasoning defined by Allen [2]. The approachundertaken by Sharma was possible through the adap-tation of the temporal principles to the spatial domain.Four composition tables were constructed [19], allowingfor the inference of new spatial relations. Looking atthese tables and knowing the facts A Northeast; disjointB and B Northeast; meet C; it is possible to infer that ANortheast; disjoint C:

3.4.3 Integration of direction, distance and topology

With the integration of direction and distance spatialrelations a set of inference rules were obtained [15].These rules present a unique pair (direction; distance) asoutcome, with the exception of the result of the com-position of pairs with opposite directions and equalqualitative distances. In the integration of direction andtopological spatial relations some improvements can beachieved, since several inference rules present as the re-sult a set of outcomes.

Looking at the work developed by Hong and Sharmait was realized that the integration of the three types ofspatial relations, direction, distance and topology,would lead to more accurate composition tables.

Since Hong adopted a cone-shaped system in thedefinition of the direction relations, and Sharma used aprojection-based system for the same task, the integra-tion of the three types of spatial relations was precededby the adaptation6 of the principles used by Sharma andthe construction of new composition tables for theintegration of direction and topology [17, 18].

After the identification of the composition tables thatintegrate direction and topology under the principles ofthe cone-shaped system, it was possible to integrate thesetables with the composition table proposed by Hong[15], with respect to direction and distance. This step was

preceded by a detailed analysis of the application do-main in which the system will be used, composition ofregions that represent administrative subdivisions thatcover all the territory considered, without any gap oroverlap [17]. Concerning to the distance spatial relation,it was defined that the qualitative distance very close isrestricted to adjacent regions. When the qualitativedistance is close the regions may be, or may not be,adjacent. The far and very far qualitative distances canonly exist between regions that are disjoint from eachother.

The basic assumption for the integration process wasthat the outcome direction in the integration of directionand distance is the same outcome direction in the inte-gration of direction and topology, or it belongs to the setof possible directions inferred by the last one. Thedirection that guides the integration process is thedirection suggested by the composition table of directionand distance (it is more accurate since it considers thedistance existing between the objects).

The integration process was undertaken enabling theconstruction of a composition table that allows inte-grated spatial reasoning with direction, distance andtopological spatial relations [17, 18]. The final compo-sition table is represented with the graphical symbolsexpressed in Fig. 5 (where each circle represents aspecific qualitative distance, and the point the directionbetween the objects). Due to its great size, Fig. 6 showsan extract of this table (the global table can be found in[17]). For example, the composition of A North,close; disjoint B with B Northeast; very close; meet Chas as result A North; close; disjoint C (this example ismarked in Fig. 6 with two traced arrows).

In the evaluation of the composition tableconstructed it was realized that the size of the regionsinfluenced (sometimes negatively) the results achieved.Qualitative reasoning with administrative subdivisions isa difficult task, which is influenced not only by theirregular limits of the regions but also by their size. Ascan be noted in Fig. 7, if the dimension of A is lower

Fig. 5 Graphical representa-tion of direction, distance andtopological spatial relations

Fig. 6 An extract of the compo-sition table for direction, distanceand topological relations

6 Since the system will be used with administrative subdivisions, theorientation between the several regions is calculated according tothe position of the respective centroids.

378

than the dimension of B, and the dimension of B is lowerthan the dimension of C, then the inference result mustbe A Northeast C. But if the dimension of A is greaterthan the dimension of B and the dimension of B is lowerthan the dimension of C, then the inference result mustbe A North C. A detailed analysis of these situations wasundertaken7, allowing the identification of several rulesthat integrate the dimension of the regions in the qual-itative reasoning process of the PADRAO system.Through this process, the reasoning process was im-proved, and more accurate inferences were obtained.

The performance of the qualitative reasoning systemwas evaluated [17]. The approach followed in thisperformance test was to compare the spatial relationsobtained through the qualitative inference process withthe spatial relations obtained by quantitative methods.For the three Districts of Portugal analysed the achievedresults were, in the poor scenario, exact8 for 75% of theinferences obtained in Districts with higher differencesbetween the dimensions of their regions (two of theanalyzed Districts). For the Braga District, a Districtthat integrates regions with homogeneous dimensions,the inferences obtained were 88% exact for directionand 81% exact for distance. For topology, the infer-ences were in all cases 100% exact. The approximateinferences obtained were verified in regions that haveparts of their territory in more than one acceptance areafor the direction relation. For these cases, the centroid ofthe region is sometimes positioned in one acceptancearea, although the region has parts of its territory inother acceptance areas. Another situation verified isconcerned with centroids that are positioned in the linethat divides the acceptance areas, which makes evenmore difficult the identification of the direction betweenthe regions and as a consequence the qualitativereasoning process.

After the evaluation of the qualitative reasoningsystem implemented and the analysis of the inferences

obtained, which provided a good approximation to thereality, the system will be afterwards used in theknowledge discovery process.

4 Analysis of a demographic database

PADRAO [17, 18] is a system for knowledge discoveryin geo-referenced databases based on qualitative spa-tial reasoning. This section presents its architecture,gives some technical details about its implementationand uses the system in the analysis of a demographicdatabase.

4.1 The PADRAO system

The architecture of PADRAO (Fig. 8) aggregates threemain components: Knowledge and Data Repository,Data Analysis and Results Visualization. The Knowledgeand Data Repository component stores the data andknowledge needed in the knowledge discovery process.This process is implemented in the Data Analysis com-ponent,which allows for the discovery of patterns or otherrelationships implicit in the analyzed geo-spatial and non-spatial data. The discovered patterns can be visualized in amap using the Results Visualization component. Thesecomponents are described below.

The Knowledge and Data Repository component inte-grates three central databases:

1. A Geographic Database (GDB) constructed under theprinciples established by the European Committee forNormalization in the CEN TC 287 pre-standard forGeographic Information. Following the pre-standardrecommendations it was possible to implement aGDB in which the positional aspects of geographicdata are provided by a geographic identifiers system[4]. This system characterizes the administrative sub-divisions of Portugal at the municipality and districtlevel. Also it includes a geographic gazetteer con-taining the several geographic identifiers used and theconcept hierarchies existing between them. The geo-graphic identifiers system was integrated with a spatialschema [3] allowing for the definition of the direction,distance and topological spatial relations that existbetween adjacent regions at the Municipality level.

Fig. 7 Influence of thedimension in the inference result

7 Another frequent issue in spatial analysis is related with aggre-gation techniques, which are susceptible to the Modifiable ArealUnit Problem (MAUP). The MAUP is a potential source of errorthat can affect spatial studies based on aggregate data sources [9].8 In this work, an inference is considered exact if the result achievedwith the correspondent qualitative rule is the same as if the datawas translated to quantitative information and manipulatedthrough analytical functions. Otherwise, it is considered approxi-mate.

379

2. A Spatial Knowledge Base (SKB) that stores thequalitative rules needed in the inference of new spatialrelations. The knowledge available in this databaseaggregates the constructed composition table (inte-grating direction, distance and topological spatialrelations), the set of identifiers used, and the severalrules that incorporate the dimension of the regions inthe reasoning process. This knowledge base is used inconjunction with the GDB in the inference of un-known spatial relations.

3. A non-Geographic Database (nGDB) that is integratedwith the GDB and analyzed in the Data Analysiscomponent. This procedure enables the discovery ofimplicit relationships that exist between the geo-spa-tial and non-spatial data analyzed.

The Data Analysis component is characterized by sixmain steps. The five steps presented above for theknowledge discovery process plus the Geo-SpatialInformation Processing step. This step verifies if thegeo-spatial information needed is available in theGDB. In many situations the spatial relations areimplicit due to the properties of the spatial schemaimplemented. In those cases, and to ensure that allgeo-spatial knowledge is available for the data miningalgorithms, the implicit relations are transformed intoexplicit relations through the inference rules stored inthe SKB.

The Results Visualization component is responsiblefor the management of the discovered patterns andtheir visualization in a map (if required by the userand when the geometry9 of the analyzed region isavailable). For that PADRAO uses a GeographicInformation System (GIS), which integrates the dis-

covered patterns with the geometry of the region. Thiscomponent aggregates two main databases:

1. The Patterns Database (PDB) that stores all relevantdiscoveries. In this database each discovery is cata-logued and associated with the set of rules that rep-resents the discoveries made in a given data miningtask.

2. A Cartographic Database (CDB) containing the car-tography of the region. It aggregates a set of points,lines and polygons with the geometry of the geo-graphical objects.

PADRAO was implemented using the relational databasesystem Microsoft Access, the knowledge discovery toolClementine [20], and Geomedia Professional [16], the GISused for the graphical representation of results.

The databases that integrate the Knowledge andData Repository and the Results Visualization compo-nents were implemented in Access. The data stored inthem are available to the Data Analysis component orfrom it, through ODBC (Open Database Connectivity)connections.

Clementine is a data mining toolkit based on visualprogramming10, which includes machine learning tech-nologies like rule induction, neural networks, associa-tion rules discovery and clustering. The knowledgediscovery process is defined in Clementine through theconstruction of a stream in which each operation ondata is represented by a node.

The Data Analysis component of PADRAO is based onthe construction of several streams that implement theknowledge discovery process. The several models ob-tained in the data mining phase represent knowledgeabout the analyzed data and can be saved or reused inother streams. In PADRAO, these models can be exportedthrough an ODBC connection to the PDB. The

Fig. 8 The architecture ofPADRAO

9 The geometry is not required in the knowledge discovery process,since the manipulation of the geographic information is undertakenby a qualitative approach (as described in previous sections).

10 Visual programming involves placing and manipulating iconsrepresenting processing nodes.

380

integration of the PDB with the CDB allows thevisualization of the rules explicit in the models in a map.The visualization is achieved through the VisualPadr~ao

application, a module implemented in Visual Basic:VisualPadr~ao manipulates the library of objectsavailable in Geomedia. This application was integrated inthe Clementine workspace using a specification file, i.e. amechanism provided by the Clementine system thatallows for the integration of new capabilities in itsenvironment. This approach provides an integratedworkspace in which all tasks associated with theknowledge discovery process can be executed.

4.2 The knowledge discovery process

The nGDB, of the Knowledge and Data Repositorycomponent, used in this application of PADRAO is aDemographic11 Database (DDB) that stores the parishregisters dated between 1690 and 1990 collected forthe Aveiro district (a district of Portugal). This databaseintegrates information like number of individual,name; birth date, birthplace; death date, death place,occupation; number of children, number of marriages,etc. The several attributes related to locations (places)allow for the integration of the DDB with the GDB,providing the geo-spatial data needed in the knowledgediscovery process.

Fig. 9 presents some records of the main table of thedatabase, the Individual table, in which it is possible tosee the existence of some missing data fields that must betreated and attributes with continuous values, requiringtheir transformation into discreet values. At the geo-graphical level, the demographic data will be generalisedat the Municipality level.

Table 1 systematises the several concept hierarchiesused, and the classes defined for the transformation ofattributes with continuous values into attributes withdiscreet values.

The analysis of the DDB requires the definition of adata mining objective. For the available data, andaccording to the interests of the researchers in thedemographic area, the objective was to characterisethe age at death and the number of children attributesin the District in analysis. The concretization of thisobjective begins with the selection, treatment and pre-processing of the relevant data. After the data selectionstep, all missing data fields were marked as unknown(‘?0) and the continuous attributes were transformed intodiscreet values attending to the classes previously de-fined. Fig. 10 presents the stream charged with the exe-cution of these three steps. By its analysis it is possible toverify that the first node (DB AVR : Individual) makesthe data available through an ODBC connection. Afterthat, the filter node permits the selection of the relevantattributes and the filler node substitutes all missing datafields by a ‘?0. The Children class node assigns thecorrespondent class to the number of Children attribute;the Age node calculates the age of the individuals, basedon the attributes Birth date and Death date. The result isused by the Age class node, which assigns to each valuethe respective class. The figure also presents part of theCLEM12 code used in two of the nodes, the Children classand Age class nodes, for the identification of thecorresponding classes.

The next step is concerned with the geo-spatialinformation processing. As the GDB only stores spatialrelations for adjacent regions, and as it is necessary toverify the geographical distribution of the age at deathand number of children attributes, all the other relationsthat exist between non-adjacent regions and needed in

Fig. 9 An extract of theIndividual table

11 An overview of the relevance of spatial analysis to demographicresearch can be found in Weeks [21]. The author points out ageneral framework for the application of spatial analysis todemographic research, discusses the kinds of data that are requiredfor spatial demographic analysis and summarizes some of the workundertaken in the analysis of demographic data in Egypt.

12 CLEM, Clementine Language for Expression Manipulation, is alanguage for manipulating the data that flows along the Clementinestreams.

381

the data mining step, must be inferred. In Clementine, arule induction13 algorithm is able to learn the inferencerules available in the composition table, stored in theSKB, which allows for integrated qualitative spatialreasoning. This process enables the inference of newspatial relations.

The models created, nodes infDir; infDis and infTop,can now be used in the inference process. With thesemodels and as shown in Fig. 11 (the models have theshape of a diamond) it is possible to infer the unknownspatial relationships existing in the Municipalities of theAveiroDistrict. The spatial relations for adjacent regionsstored in the GDB are gathered through the source node(GDB:geoAveiro) of the stream and combined (nodeInflection) in order to obtain new associations betweenregions. The spatial relations existing among these newassociations are identified by the models infDir; infDisand infTop. After the inferential process, the knowledgeobtained is recorded in the GDB (output nodeGDB:geoAveiro). In the stream of Fig. 11, the supernodes SuperNodeDir1 and SuperNodeDir3 are responsi-ble for the integration of the dimension of the regions inthe reasoning process. In this process, it is validated ifthe several inferences obtained for a particular region

agree independently of the composed regions. Severalpaths can be followed in order to infer a specific spatialrelation. For example, knowing the facts A North B,A East D, B East C and D North C, the direction relationexisting between A and C may be obtained composing ANorth B with B East C or combining A East D withD North C. If several compositions can be effected and ifthe results obtained from each one do not match,then the super node VerInferences excludes those resultsfrom the set of accepted ones.

The knowledge discovery process proceeds with theconstruction of the geographical model of the region.This model describes the location of each municipality inthe Aveiro district. It will be used by the data miningalgorithm, in order to integrate the geographic compo-nent in the analysis. Fig. 12 shows the two streamsconstructed for the data mining step. The stream locatedat the left side of the figure selects, from theGDB:geoAveiro table, the geographic information avail-able for the Aveiro district. The selected records areanalysed by the C5.014 algorithm, constructing the dis-trict geographical model (geo_AVR). The obtainedmodel was afterwards used in the other stream, allowingthe geographical characterisation of the number ofchildren and the age at death attributes.

Table 1 Classes and hierarchiesfor data reduction Attributes Hierarchies/Classes for Discreet values

Place Place ! Parish ! Municipality ! DistrictDate Century {1600..1699} ! 17, {1700..1799} ! 18, {1800..1899} ! 19,

{1900..1999} ! 20Month {January, February, March, April, May, June, July,

August, September, October, November, December} !{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

Age{0..12} ! 0–12, {13..25} ! 13–25, {26..45} ! 26–45,{46..110} ! 46)

Number ofChildren

{0} ! 0, {1..3} ! 1–3, {4..6} ! 4–6, {7..16} ! 7)

Fig. 10 Selection, treatment anddata pre-processing

13 A rule induction algorithm creates a decision tree aggregating aset of rules for classify the data into different outcomes. Thistechnique only includes in its rules the factors that really matter inthe decision-making process.

14 The C5.0 algorithm is a rule induction algorithm that generatesdecision trees or rule sets, predicting the value of an output field.

382

Analysing the obtained results for the age at deathattribute (decision tree at the left side of Fig. 12), only inthe XIX century (Century 19 in the figure) exists a geo-graphic characterisation of the age at death. In thegenerated model, all municipalities located at Northeast

and Southwest of Aveiro present a lower age at death,0–12, indicating that the regions at these locations had agreat rate of child mortality. The generated model, forthe number of children attribute (decision tree at theright hand side of the figure), points out that regionswith a higher birth rate are located at Northeast and Eastof the analysed district.

The last step is related to the interpretation of thediscovered patterns, verifying the relevance of each onefor the application domain. The patterns can be storedin the PDB, allowing their visualisation in a map. Theuser also has the option to run the VisualPadrao tool andvisualise the selected model. Fig. 13 shows the geo-graphic characterisation of the age at death attribute,which enables the visualisation of regions where therelative incidence of death at the ‘0–12’ age class ishigher than elsewhere in the district. These regions arelocated at Northeast and Southwest of the district. Forthe geographic characterization of the number ofchildren attribute, and despite the correspondent map isnot presented, the rules expressed in Fig. 12 point outthat a higher incidence of births is located at Northeastand East of the district.

The geographic characterisations obtained must beanalysed by demographers in order to catalogue theseveral findings and formulate hypothesis that help to

Fig. 11 Geo-spatial informationprocessing

Fig. 12 Data mining step

Fig. 13 Geographic characterization of the age at death attribute

383

explain the models obtained with the data mining tech-niques. For this task demographers must bear in mindincidents occurred in the analysed centuries, like plaguesor other relevant diseases, which could explain, forexample, the higher incidence of child mortality verifiedat some regions.

5 Conclusions

Organisational databases usually store geographicidentifiers, like addresses or postcodes, which spatialcomponent is not usually incorporated in the process ofknowledge discovery. This paper presented an ap-proach for knowledge discovery in geo-referenced da-tabases, based on qualitative spatial reasoningprinciples, where the location of geographic data wasprovided by qualitative identifiers. Direction, distanceand topological spatial relations were defined for a setof Municipalities of Portugal. This knowledge and thecomposition table constructed for integrated spatialreasoning, about direction, distance and topologicalrelations, allowed for the inference of new spatialrelations analyzed in the data mining step of theknowledge discovery process.

The integration of a demographic database (datedfrom 1690–1990 of the Aveiro district) with a geo-graphic database (with the administrative subdivisionsof Portugal), made possible the discovery of generalcharacterisations that exploit the relationships thatexist between the geo-spatial and non-spatial dataanalysed. The results obtained with the PADRAO systempoint out that traditional KDD systems, which weredeveloped for the analysis of relational databases andthat do not have semantic knowledge linked to spatialdata, can be used in the process of knowledge discoveryin geo-referenced databases, since some of this semanticknowledge and the principles of qualitative spatialreasoning are available as domain knowledge.Clementine, a KDD system, was used in the assimila-tion of the geographic domain knowledge such ascomposition tables, in the inference of new spatialrelations, and in the spatial patterns discovery.

The main advantages of the proposed approachinclude the use of already existing data mining algo-rithms developed for the analysis of non-spatial data;an avoidance of the geometric characterisation of spa-tial objects for the knowledge discovery process; andthe ability of data mining algorithms to deal with geo-spatial and non-spatial data simultaneously, thusimposing no limits and constraints on the resultsachieved.

Acknowledgements Our acknowledgment to NEPS (Nucleo deEstudos da Populacao e Sociedade) of University of Minho, formaking the demographic data available.

References

1. Abdelmoty AI, El-Geresy BA, (1995) A general method forspatial reasoning in spatial databases. Proceedings of the fourthInternational Conference on Information and KnowledgeManagement, Baltimore

2. Allen JF (1983) Maintaining knowledge about temporal inter-vals. Communications of the ACM 26(11): 832–843

3. CEN/TC-287 (1996) Geographic Information: Data Descrip-tion, Spatial Schema, European Committee for Standardisa-tion, Report prENV 12160

4. CEN/TC-287 (1998) Geographic Information: Referencing,Geographic Identifiers, European Committee for Standardisa-tion, Report prENV 12661

5. Egenhofer MJ (1994) Deriving the composition of binarytopological relations. J of Visual Languages and Computing5(2): 133–149

6. Ester M, Frommelt A et al. (1998) Algorithms forcharacterization and trend detection in spatial data bases.Proceedings of the 4th International Conference on Knowl-edge Discovery and Data Mining, AAAI Press

7. Fayyad U, Uthurusamy R (1996) Data mining and knowledgediscovery in databases. Communications of the ACM 39(11):24–26

8. FayyadUM,Piatetsky–ShapiroGet al. (eds.) (1996)Advances inknowledge discovery and data mining. Massachusetts, The MITPress

9. Fotheringham AS Wong D (1991) The modifiable areal unitproblem in multivariate statistical analysis. Environ and Plan-ning A 23(7): 1025–1044

10. Frank AU (1996) Qualitative spatial reasoning: cardinaldirections as an example. International J of GeographicalInformation Systems 10(3): 269–290

11. Freksa C (1992) Using orientation information for qualitativespatial reasoning. Theories and Methods of Spatio-TemporalReasoning in Geographic space, Lectures Notes in ComputerScience 639. Frank AU, Campari I and Formentini U (eds.),Berlin, Springer-Verlag

12. Han J, Kamber M (2001) Data mining: Concepts and tech-niques, Morgan Kaufmann Publishers

13. Han J, Tung A et al. (2001) SPARC: Spatial association rule-based classification. Data Mining for scientific and engineeringapplications. Grossman R, Kamath C, Kegelmeyer P, Kumar Vand Namburu R (eds.), Kluwer Academic Publishers: 461–485

14. Hernandez D, Clementini E et al. (1995) Qualitative distances.Spatial information Theory - A theoretical basis for GIS, Pro-ceedings of the International Conference COSIT’95, LecturesNotes in Computer Science 988, Austria, Springer-Verlag

15. Hong J-H (1994) Qualitative distance and direction reasoningin geographic space. Maine, PhD Thesis, University of Maine

16. Intergraph (1999)Geomedia Professional v3, ReferenceManual,Intergraph Corporation

17. Santos MY (2001) PADRAO: Um sistema de descoberta deconhecimento em Bases de Dados Geo-referenciadas, PhDThesis (in Portuguese), Universidade do Minho

18. Santos MY, Amaral LA (in press) Mining Geo-referencedDatabases: a way to improve decision-making. GIS in Business.Pick J (Ed.), Idea Group Publishing

19. Sharma J (1996) Integrated spatial reasoning in geographicinformation systems: Combining Topology and Direction. PhDThesis, University of Maine

20. SPSS (1999) Clementine, User Guide, Version 5.2, SPSS Inc21. Weeks JR (2004) The role of spatial analysis in demographic

research. Spatially integrated social science. Goodchild MFand Janelle DG (eds.). Oxford University Press

384

Date post:	23-Nov-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

Geo-spatial data mining in the analysis of a demographic database

Documents