[IEEE 2013 International Conference on Advanced Computer Science and Information Systems (ICACSIS) -...

Abstract — This paper describes a geospatial knowledge discovery model of historical maps data set with relative geographic referenced. The knowledge about spatiotemporal dynamic is represented by the transition rules of cellular automata model. Set of transition rules obtained by applying three data mining techniques on large amount of data grid. First, multiple linear regression analysis applied on each subsequent pair of N data grid to obtained (N-1) rules. Second, by applying clustering analysis, then they extracted into a small number of rules, which is represented all of the rules, and they associated with the first data grid of the related pair. Finally, the selected rules used in determining the next value of the given data using classification analysis. Selection of the rule applied to the data based on the distance between the data and the associated data grid of the selected rule. The model had been evaluated on ordinal data type from fire danger rating and nominal data from land use and land cover status. Model accuration measured and visualized by comparing actual data and the simulated data. The accuration ranges between 80% - 95% in the first case and 90,5% - 95,2% in the second. In the first case, by the segmentation of the model, the performance can be improved significantly, especially for von Neumann scheme.

I. INTRODUCTION

Rapid development of geospatial information technology gives impact on improvement of users need for knowledge about change and movement, which relates to the place and time, because the

entities contained in the data base 80% have a geographic reference [8]. Visualization systems based on knowledge of the temporal dynamics of geographical object increasingly important and interesting to study.

The changes are presented in the form of geospatial objects spatiotemporal trend, which requires modeling spatiotemporal to present the relationship between the cells and the cells surrounding them. Their values change every time. However, the development of analytical techniques for automatic data processing is not significant, compared with the growth of data volumes. Large amount of data availability inviting smart users to dig, in order to find the knowledge necessary for understanding the phenomenon, concluding assessment and decision making. This gives rise to the condition of the gap between the available data and analytical information that can be utilized. Several scientific approaches have been introduced, such as Spatial Statistics, Fractal based, and Artificial Neural Network [14].

Realiability of Cellular Automata (CA) models have been proven in presenting the spatial and temporal dynamics [2] [6] [10]. Especially study spatial and temporal dynamic, to explore the complex dynamics knowledge with a simple computational formula. In the past, the efforts to build spatiotemporal knowledge with data and expert was limited, because it requires a high modeling accuracy, and the data model should be equipped with adequate and expert knowledge. However, in practice it is difficult to access expert knowledge and combine into the model CA transition rules, because their expertise are relative or contextual.

Likewise reliability of data mining techniques, to extract knowledge from data in large volume has interested in study of the spatiotemporal data, especially the geographical referenced data [1] [4] [9]

Geospatial Data Extrapolation Using Data Mining Techniques and Cellular Automata

Ahmad Zuhdi1, Aniati Murni2, and Heru Suhartanto3

1 Informatics Engineering Department of Faculty of Industrial Technology Trisakti University 2 3 Faculty of Computer Science University of Indonesia

Email: 1 [email protected], 2 [email protected], 3 [email protected]

ICACSIS 2013 ISBN: 978-979-1421-19-5

413/13/$13.00 ©2013 IEEE

[11] [13]. The transition rules of CA extracted from the existing data set. Each rule can be applied to predict the value of a certain object in the future. In this study, the transition rules are used to extrapolate the value of the variable, based on the proximity of the given data with a few patterns that represent all the conditions are dominant.

II. GEOSPATIAL DATA EXTRAPOLATION

Geospatial data can be extrapolated by adopting one of the five models spatiotemporal dynamic interaction [15], where the value of the cell at (t + Δt) depends on the value of its neighbor cells at t, gt+Δt

ij = F ( gt i∀p,

j∀q ), as shown in figure1.

Function F in the model presented by CA transition rules, that evaluate the value of a cell and its environment at time t to the value of the cell at time (t + Δt).

Geospatial data in this study, which is also called data grid, obtained from the image of the map of a certain theme in a raster format, which stores data in a cell or pixel. Since the model is quantitative, the qualitative data, both ordinal and nominal types, should be converted into quantitative. The conversion applies a random number generator, with certain probability density function model,in each of the category in the data source.

Input data grouped into two parts, the first part with a large size is used as a knowledge model builders of transition rules and the rest is testing data set, used to validate the model.

Models are managed through four major phases, namely data preparation, model development, model application and model testing. Data preparation stage conduct pre-processing of the images read the raw data

and converts it into quantitative data grid. In model development, the data manipulated by regression analysis and clustering analysis to find selected rules and the data grid associated with them. Classification of the new data grid at time t, to evaluate its value in the (t + Δt) for many replication succesively is the application of the model. Finally, to evaluate model accuration, we compare value of the simulated data grids and the actual data grids of the testing data set.

III. METHODS

The model are constructed by applying three data mining techniques, namely multivariate regression analysis, clustering analysis, and classification analysis. Exploration of regression analysis is applied to a collection of transition rules n grid data from each pair of successive grid. The i-th transition rule presented by a vector, whose entries are the regression coefficients multiplication, which is formed by regressing to the i-th grid and (i +1)-th grid. The i-th grid as a predictor mapped into a particular adjacency scale (as shown in Figure 2), and the (i+1)-th grid as the outcome variable, both regressed by applying (1).

C i,j (t+1) = a. C i,j (t) + b.C i-1,j-1 (t) + c.C i-1,j (t) + d. C i-1,j+1 (t) + e.Cij-1 (t) + f.C i,j+1 (t) + g. C i+1,j+1 (t) + h.Ci+1j (t) + k.Ci+1j+1 (t) (1) Another alternative scheme, which can be applied, is

Von Neumann neighborhood with R=2, and regression model equation describe in (2)

Fig. 1: Spatiotemporal Interaction Model (I) independent model (II) dependent model (III) historical model (IV) multivariate model (V) geographic model [15]

Fig. 2 General formulation of transition rule with Moore neighbor-hood scheme (radius R=1)

ICACSIS 2013 ISBN: 978-979-1421-19-5

414

Fig. 3 Scheme of von Neumann neighbor-hood scheme (radius R=2)

C i,j (t+1) = a. C i-2,j (t) + b.C i-1,j-1 (t) + c.C i,j-2 (t) + d. C i,j-1 (t) + e.Cij (t) + f.C i,j+1 (t) + g. C i,j+2 (t) + h.Ci+1,j (t) + k.Ci+2,j (t) (2)

This process generates (n-1) transition rules, which are represented by their regression coefficients. Clustering analysis is applied, for extracting (n-1) transition rules in m patterns or clusters (m << n), which represent the dominant transition rules. Each pattern associated with one of the transition rules TR, which is closest to the centroid of the cluster, as shown in Figure 4. This pattern is also associated with the first data grid builder of the transition rules related. So from the association analysis process will produce m selected rules that associated with m grid data in model builder.

a b Fig. 4: Clustering of transition rules (b) Cluster centroid as pattern representation

Application of the model conducted by running the model on first selected testing data set, with k times successive observations. The number k, depends on the time consideration of observation. We takes k=7, since time observation was in weekly.

A transition rule is selected and applied to the first data grid GP1, if it is the solution of classification analysis problem:

Let G1, G2, ...., Gp is the p data grid model builders associated with the transition rules TR1, TR2, ... TRP

representing the dominant pattern. Transition rules, which are applied to the data grid GP1 is TRk, if D(Gk, GP1) = Minimum D (Gi, GP1) 1<i<p (3)

where D (Gi, Gj) is Euclidean distance between the data grid Gi with Gj.

Application of TRk on GP1 generate GS1 data grid simulation results. Simulations performed for the second iteration to classify GS1 (and generate GS2), by applying the classification rules as in equation 2). Simulation iteration forwarded to the second, third, ... until(k-1)-th.

Validation of the model is taken by measuring the accuracy of the model, calculated from the simulation error, which is defined as root mean square of the data grid simulation results and corresponding test data. While the calibration of the model is done by comparing the simulated results with the two benchmark (base line), the first with the value of the data grid testers constant of the first model, and the second by executing a data grid based on comparative testing autoregressive model of Cellular Automata (CA-AR) [5].

IV. CASE STUDY AND ITS RESULT

The research conducted on two cases, the massive ordinal data type and limited nominal data, which only has five data for model builders. The first case deals with fire hazard ratings of monitoring land and forest fire based on the index of drought code, ie potential drought and smoke. National Space Agency publishes daily information about a potential fire hazard through Fire Danger Rating System (FDRS), which presents information for the province region of Indonesia, except the Moluccas and Papua. The map presents a fire hazard ranking indicators, based on the parameter index [7].

(a)

ICACSIS 2013 ISBN: 978-979-1421-19-5

415

(b)

Fig. 5 Fire Danger Rating map (a) and it’s index values (b). [8]

Observational data are taken from 3 years, i.e. in 2006, 2008 and 2009. Data for 2006 and 2008 are used for model builders, while the 2009 data is used as data test.

This study evaluates three methods of random number generating, i.e. uniform discrete, uniform continue and normal distribution. The best performance is obtained from the uniform continue distribution method. An example of simulation that applied the model, shown in figure 6.

Fig. 6 Visualization of the simulated data, actual data and it’s computational error in simulation of Fire Danger Rating.

Error projected as difference of pixel value of simulated and actual data, and then it coded visually ( 0 = white, 1=blue, 2=green and 3=red). This approach gives knowledge enhancement about error, because it indicated spatially.

Model has capability to predict the condition of weekly or 7 days fire danger rating monitoring, with accuration value ranges from 80% up to 95%.

For the second case, the data is available only five observations map of status land cover in the western part of Central Java. Observation time was 1990, 1995, 2000, 2005 and 2007, the last data used for validation test.

Fig. 7: Data input of second case, Land use and Land cover map of West región of Central Java in 1990 [3]

Since availability of the provided data are very limited, only data from year 1990, 1995, 2000 used to extrapolate year 2005.

Fig. 8 Visualization of the simulated data, actual data and it’s computational error in simulation of land cover data

For the second case, model has capability to predict the condition of the next 5 years condition of land cover, the accuration value ranges from 90,5% up to 95,2 %.

In general, the results of observation and analysis can be presented as comparison between the first case, which describes conditions that ensure the

ICACSIS 2013 ISBN: 978-979-1421-19-5

416

availability of abundant data and spatiotemporal aspects of completeness, the second cases illustrate the opposite. The comparison results are presented in the table 1.

TABLE 1 GENERAL EXPERIMENT RESULTS

Case 1 Case 2

Characteristic

Availability of large amount of data, 274 data from daily observation dry season of 2006 and 2008. There are all aspects spatiotemporal properties.

Data are available only 5, but they have higher temporal resolution.

Advantages

Provides flexibility to use variety explorations of the model, such as pattern analysis clustering, segmentation, filtering and the increase number of pattern.

Model allows the cells to absorb information optimally from the surrounding cell; the model gives a convincing accuracy. Model performs temporal interpolation for higher resolution.

Disadvantages

Difficult to formulate appropriate scenario to simulate the model, so that it can capitalize on the test data to produce convincing model accuracy.

With data are very limited and very hard to develop know-ledge exploration, through simulation.

Generally, for this type of ordinal data, application

of Moore scheme and the von Neumann scheme have the same performance.

Calibration of the models was taken by comparing the accuracy of models built from the ordinal data with non-dynamic model, which assumes no change during the simulation time. The model can be applied based on global knowledge, which are includes all transition rules of the whole domain of observation time, or locally, which are based on time domain observation only.

The application of the model to the global knowledge gives worse results than non-dynamic models, which can then be improved by increasing the number of pattern observation. However, the performance improvement is significant that the model given by the segmentation (the application of local knowledge), especially for von Neumann scheme. However application of the model at extreme high

segment still has worse performance than the non-dynamic models.

V. CONCLUDING REMARKS

The model has capability to process two types of geospatial data from a single thematic map, ie qualitative ordinal and nominal. Transition rules in cellular models of geography are a machine-based system of dynamics geographical spatiotemporal. The rules constructed by applying multivariate linear regression analysis, clustering and classification analysis. Clustering analysis refers to extraction of the vectors that representation of the transition rules, while the classification analysis refers to grid data input or test data.

Model parameters can be classified into four, namely that there are common parameters in all cases or sources of data, qualitative data specific parameters and special parameters ordinal data type as well as specific parameters for quantitative data. The Parameter of model are the radius of adjacency scheme, determination coefficient of the regression R2, and any errors or accuracy of the model. While the specific parameters for qualitative data is the method of generating random numbers are applied and the interval of each category, as well as the number of dominant patterns. Specific parameters are ordinal qualitative data segmentation, and special parameter estimation method of quantitative data is the value of the cell is not yet known.

There are several alternative development and refinement, both from the aspects of modeling and simulation as well as from the aspect of application domain or content. Research development of modeling and simulation aspects, such as (1) the choice of the random number generation methods, which is appropriate for the observed data, (2) the application of cell structure and scheme of CA model, (3) the application of regression analysis model (4) analysis of the clustering pattern of collection of objects with alternative techniques, such as (5) based Hierarchical, GA based, or Density Data based. (6) Presentation of simulation techniques that explore variations of the simulation scenario is more complete and accurate, so that the user can perform a variety of experiments to study the behavior and characteristics of the system. Development of aspects of application domains, such as: urban planning, traffic monitoring and controlling, disaster preparedness, regional planning and development, economic and business development, national security, and natural resources management. REFERENCES

[1] Al-Ahmadi, K., Heppenstall, A.J., Hogg, J., See, L. 2009. A Fuzzy Cellular Automata Urban Growth Model (FCAUGM) for the City of Riyadh, Saudi Arabia. Part 2: Scenario Analysis. Applied Spatial Analysis and Policy. v2(2): 85 – 105. doi:10.1007 /s12061-008-9019-z

ICACSIS 2013 ISBN: 978-979-1421-19-5

417

[2] Boyer, L., Theyssier, G., “ON LOCAL SYMMETRIES AND UNIVERSALITY IN CELLULAR AUTOMATA” Symposium on Theoretical Aspects of Computer Science 2009” pp.195-206, http://arxiv.org/pdf/0902.1253.pdf , last accessed September 2013

[3] Creating Baseline for Adaptat ion and Mitigation of Climate Change in Indonesia (CCBASE) , “Land Use/Cover Change Snapshots by District in Banyumas Region (1990-2007)”, http://ccbase.wordpress.com/2008/11/20/land-use-and-land-cover-changes-in-banyumas-region-1990-2007/ last accessed September 2013

[4] Esmaeili, Mahdi; Gabor, Fazekas, “Finding Sequential Patterns from Large Sequence Data”, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 1, No. 1, January 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 43 http://arxiv.org/ftp/arxiv/ papers/ 1002/1002.1150.pdf last accessed September 2013

[5] Dario Floreano and Claudio Mattiussi, Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies, pp 107-123, MIT Press, 2008

[6] Dai Fuqiang, “Mining Dynamic Transition Rules of Cellular Automata in Urban Population Simulation”, 2-nd International Conference on Computer Modeling and Simulation, Hainan China, 22 – 24 Jan. 2010, pp 471-474

[7] Groot, W J. de · Field, R. D., Brady, M.A, Roswintiarti, O, Mohamad, M.” Development of the Indonesian and Malaysian Fire Danger Rating Systems”, Mitigation Adapt Strategy Global Change (2006) 12:165–180

[8] LAPAN, Data Potensi Kekeringan dan Asap (Drought Code) tanggal 20 Juli 2009, http://lapan.go.id//simba/dc200709.jpg last accessed September 2013

[9] Li, Deren, Wang,

Shuliang, “Concepts, Principles And Applications Of Spatial Data Mining And Knowledge Discovery”, ISSTM 2005, August, 27-29, 2005, Beijing, China, http://www.isprs.org/ proceedings/XXXVI/2-W25/source/CONCEPTS_ PRINCIPLES_AND_APPLICATIONS_OF_SPATIAL_DATA_MINING_AND_KNOWLEDGE_DISCOVERY.pdf last accessed September 2013

[10] Ying Long, Zhenjiang Shen, Liqun Du, Qizhi Mao, Zhanping Gao, “BUDEM: an urban growth simulation model using CA for Beijing metropolitan area” 16th International Conference on Geo Informatics and the Joint Conference on GIS and Built Environment 28-29 June 2008, Guangzhou, China."--P.xix] http://dspace.lib.kanazawa-.ac.jp/ dspace/bitstream/2297/17397/1/TE-PR-SHEN-Z-7143.pdf last accessed September 2013

[11] Maimon, Oded, Rokach, Lior, “Soft Computing for Knowledge Discovery and Data Mining”, Springer Science+Business Media, LLC, 2008 pp 209 – 230

[12] Matlab, “Language of Technical Computing”, The Mathworks Lab, Inc. 2005 http://www.ics.uci.edu/~smyth/courses/Getting_ Started_with_MATLAB.pdf last accessed September 2013

[13] Miller, Harvey J. “Geographic Data Mining and Knowledge Discovery”, International Journal of Geographical Information Science, Volume 23, Issue 5 May 2009 , pages 683 – 684

[14] Soltani, Ali, “ Towards Modeling Urban Growth with Using Cellular Automata (CA) and GIS”, Geomatics Conference 83, 2005

[15] Tobler W., "A computer movie simulating urban growth in the Detroit region". Economic Geography, 46(2): 234-240

ICACSIS 2013 ISBN: 978-979-1421-19-5

418

Date post:	29-Jan-2017
Category:	Documents
Upload:	heru
View:	216 times
Download:	2 times

[IEEE 2013 International Conference on Advanced Computer Science and Information Systems (ICACSIS) -...

Documents