Development of a new framework
for distributed processing of
Big Geospatial Data
Angéla Olasz, Binh Nguyen Thai
Institute of Geodesy, Cartography and Remote Sensing (FÖMI)
Directorate of Geoinformation
1. Introduction
2. IQmulus project short introduction
3. Defining Geospatial Big Data
4. Comparison of existing solution (Aspects of
requirements)
5. IQLib intro & objectives
6. IQLib modules and it’s status
7. Related papers & Future work
Content of this talk
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Our goal is to find a solution for processing of big geospatial data in a
distributed ecosystem providing an environment to run algorithms,
services, processing modules without any limitations on
implementation programming language as well as data
partitioning strategies and distribution among computational
nodes in order to run existing GIS processing scripts.
As a first step we focus on raster data representation:
(i) decomposition and
(ii) distributed processing.
Before building this prototype system, we have 1. analyzed data
decomposition patterns. 2. defined the common GIS user
requirements on processing environments of Big Geospatial Data 3.
tried to identify Geospatial Big Data with the help of the 4 „V”s. 4.
compared existing solutions on selected aspects.
Introduction
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
IQmulus
A High-volume Fusion and Analysis Platform for Geospatial Point Clouds, Coverages and Volumetric Data Sets
„IQmulus will leverage the information
hidden in large heterogeneous
geospatial data sets and make them a
practical choice to support reliable
decision making.”
4 year FP7 EU Research Project
2012 November – 2016 November
12 European partner, 7 countries
www.IQmulus.eu
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
To have a better understanding on what are the main attributes of
geospatial big data: it is hard to delineate the margin starting to
“exceed the capability of spatial computing technology”.
To estimate the size of the processable amount of data are use-case
specific, there are some good examples (Evans et al., 2014) where the
authors tried to identify the Geospatial Data and Geospatial Big Data
differences.
Here we have tried to compare Big Data, Geospatial Big Data and
Geospatial Data as a short review.
The nature of the digital representation of the continuous space can be
grouped in 3 main groups: vector, raster, 3D representation. Have
been compared to „non-geospatial” text based data format.
Defining Geospatial Big Data
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Defining Geospatial Big Data
Aspects of requirements and
comparison of existing solutions
We have collected the most popular frameworks supporting
distributed computing on GIS data. We tried to investigate the
capabilities of each framework in the following aspects:
what kind of:
• Input/output data types are supported or suitable for that
particular framework,
• GIS processing (or executable languages)
• Data Management flexibility- supervision of the data distribution
especially for raster datatype
• Scalability potential
• Supported OS/Platform dependencies
• GIS Case studies, projects …
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Aspects of requirements and
benchmarking of existing solution
While most of current processing frameworks follow the same
methodology as Hadoop and utilize the same data storage concept
as HDFS. One of the biggest disadvantage from processing point of
view was the data partitioning mechanism performed by HDFS file
system and distributed processing programming model.
In most cases we would like to have full control over our data
partitioning and distribution mechanism.
Existing GIS algorithms (without or with small modification)
can’t be executed (python, Matlab, R, etc.).
We decided to develop our own distributed processing
framework.
Initiative for a new framework
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
IQLib - Objectives
10
Source: https://github.com/posseidon/IQLib/
IQLib specification
IQLib is a framework with the main goal of allowing an actor (human or
machine code) to manage huge data sets describing geographical
survey areas, and can be used to overcome scalability
limitations of processing algorithms.
IQLib’s core functionalities are:
1. The data-decomposition (Tiling) of a survey area in which
data points are either associated to polygons (regular or
irregular), or grouped according to temporal attributes, or
grouped into equally sized chunks, or a mixture of the above.
2. Data distribution and distributed processing among
nodes.
3. IQLib should also provide the functionality to stitching the
output data files into a single large file.
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
High level concept of IQLib processing
framework
11 A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Modules
As a result IQLib has 4 major modules; each module is responsible for a step in
GIS data processing.
Data Catalogue module: Data catalogue module is responsible for
storing metadata corresponding to survey areas, store all the
available, known and useful information for processing.
Tiling & stitching module: Tiling algorithms usually process raw data,
creating data chunks. Stitching usually runs after processing services
have successfully done their job. Metadata of tiled (and stitched)
dataset are registered into Data Catalogue module. With this step we
always know the parents of tiled data.
Data distribution module: New! data are distributed across processing
nodes, responsible to supervise the data distribution.
Distributed processing module: Distributed processing module is
responsible for running processing services on distributed dataset.
12 A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Modules - status
Data Catalogue module
We have defined and implemented the
data model, the data/metadata access
procedures. After final approval
phase goes open source. Data
catalog is a stand-alone service
providing REST interface for users.
Tiling & stitching module *
Pre-defined tiling and stitching methods
tailored for processing algorithms.
*under planning phase
13 A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Modules - status
14 A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Data distribution module: NEW!** We would like to have full control
over data distribution across processing nodes. Currently we are
supporting SFTP protocol only. Data partitioning and distribution
algorithms could be extended by third party developers.
Distributed processing module**: Using existing processing
algorithms/scripts without any modifications or very little adjustments.
The ability to send processing services across processing nodes on
demand, with all its dependencies.
**under development
Conclusion - dev status
Almost ready phase:
Data Catalogue module
Under development:
Distributed processing
module.
Data distribution module:
NEW!
Theoretical planning phase:
Tiling & Stitching module
15 A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Currently IQLib Specification is available on GitHub at
https://github.com/posseidon/iqlib
IQLib going to have dedicated IQmulus GitHub soon!
Related papers
• A. Olasz, Nguyen Thai B, D. Kristóf (2016). A NEW INITIATIVE FOR TILING,
STITCHING AND PROCESSING GEOSPATIAL BIG DATA IN DISTRIBUTED
COMPUTING ENVIRONMENTS; ISPRS ANNALS OF THE PHOTOGRAMMETRY,
REMOTE SENSING AND SPATIAL INFORMATION SCIENCES III-4: pp. 111-118.
• B. Nguyen Thai, A. Olasz (2015). RASTER DATA PARTITIONING FOR SUPPORTING
DISTRIBUTED GIS PROCESSING; ISPRS ARCHIVES OF PHOTOGRAMMETRY
AND REMOTE SENSING XL-3/W3 pp. 543-551.
• A. Olasz, D. Kristóf, M. Belényesi, K. Bakos, Z. Kovács, B. Balázs, Sz. Szabó (2015).
IQPC 2015: WATER DETECTION AND CLASSIFICATION ON MULTI-SOURCE REMOTE SENSING
AND TERRAIN DATA; ISPRS ANNALS OF THE PHOTOGRAMMETRY, REMOTE
SENSING AND SPATIAL INFORMATION SCIENCES XL-3/W3 pp. 583-588.
• Olasz A. and Nguyen Thai B. (2014). Decision support on distributed computing
environment (IQmulus). Proceedings of the 3rd Open Source Geospatial Research &
Education Symposium OGRS pp. 107-114.
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Future work
• Finishing the implementation of all modules
• Testing IQLib in the following aspects:
1. run existing algorithms on the framework
(python, R, etc.),
2. Experiment execution on big geospatial data
(raster, vector, point cloud),
3. Benchmark (on processing time).
A new framework for distributed processing of Big Geospatial Data
• FOSS4G Bonn, 24.-26. August 2016
Institute of Geodesy, Cartography and Remote Sensing (FÖMI)
Directorate of Geoinformation
5. Bosnyák sqr. BUDAPEST, HUNGARY 1149 www.fomi.hu, www.iqmulus.eu
Thank you for your attention!
Angéla Olasz [email protected]
Binh Nguyen Thai [email protected]
Acknowledgment:
www.iqmulus.eu
www.linkedin.com/groups/IQmulus-FP7-
project-7470531