Date post: | 15-Apr-2017 |
Category: |
Technology |
Upload: | christoph-koerner |
View: | 135 times |
Download: | 0 times |
Large Scale Geo Processing on Hadoop
Christoph KoernerSlides available
About me
● Data Scientist at T-Mobile Austria● Visual Computing at Vienna University of Technology● Author of Data Visualizations with D3 and AngularJS● Author of Learning Responsive Data Visualization● Organizer of Vienna Kaggle Meetup● LinkedIn: linkedin.com/in/christophkoerner● Twitter: @ChrisiKrnr● Google+: +ChrisiHififm● Github: github.com/chaosmail
Overview
1. Introduction
2. Pre-processing
3. Large scale geo processing on Hadoop
4. Some use-cases
What is Geo Processing?
● Operations to manipulate spatial data
● Operations include geographic feature overlay, feature selection and analysis, topology processing, raster processing, and data conversion
Source: Wikipedia
Spatial Data
● Contains data for a spatial reference
● Mostly 1D or 2D Geometries such as Points, Lines, Polygons, etc.
● Usually in latitude and longitude (or x and y) coordinates
3D Coordinates
● Earth is not a perfect sphere!
● Can be approximated by a biaxial ellipsoid
● 3D coordinates need a reference ellipsoid
● Most widely used is the World Geodetic System (WGS84) used by GPS
● Minimal positioning error on the surface
2D Projections
● The earth cannot be displayed on a 2D map without distortion
● Mapping to the surface of other 3D Volumes
○ Cylindrical
○ Conical
○ Azimuthal
2D Projections
● The earth cannot be displayed on a 2D map without distortion
● Every mapping has its tradeoff
○ Length Preserving (Equidistant)
○ Area Preserving (Equal Area)
○ Angle Preserving (Conformal)
2D Projections
● Commonly used in Austria: MGI Austria Lambert (equal area)
● Commonly used in the US: Albers USA projection (equal area)
Geo Processing on Hadoop
● Acquire spatial dataset
● Pre-process the dataset
● Load the dataset into HDFS
● Perform topological processing and analysis using Hive
● Visualize the results
Data Sources
● https://www.data.gv.atFree Austrian data, demographics, health, tourism, public transport, etc.
● GIP Graphenintegrations-PlattformAustrian traffic graph, public transport, streets, etc.
● GADM Database of Global Administrative AreasCountry shapes
● Many more..
Vector Formats
● Shapefile
● GeoJSON
● WKT Well Known Text
● Points
GDAL - Geospatial Data Abstraction Library
● Translator library for raster and vector geospatial data formats
● Converts spatial data between file formats, reference systems and projections
● SQL query syntax
● Command-line tool (MIT license)
Source: gdal.org
GDAL - Transform Shapefiles to CSV
ogr2ogr -f CSV output.csv \ input.shp \-lco GEOMETRY=AS_WKT \-lco SEPARATOR=SEMICOLON \-oo ENCODING=UTF-8
GDAL - Use spatial queries
ogr2ogr -sql "SELECT A.* FROM shape1 A, shape2 B WHERE ST_Intersects(A.geo, B.geo)" \
-dialect SQLITE \data input_dir \-nln output.shp
More complex pre-processing
● Fiona for loading Shapefiles
● Shapely for geo processing
● Complex pre-processing, extraction, transformations, area and length computations, etc.
ESRI Tools on Hadoop
● Open Source Tools from ESRI
● Provided under Apache-License 2.0
● Geo processing tools for Hadoop + ArcGis
● Active development and feedback (on Github)
ESRI Tools on Hadoop
● Esri Geometry API for JavaJava library for geo processing
● Spatial Framework for HadoopHive SerDe and UDFs based on the geometry API
● Geoprocessing Tools for HadoopTools for data exchange between ArcGis and HDFS
● GIS Tools for HadoopSample application and demos
Spatial Framework for Hadoop (Hive UDFs)
● Geometry ConstructionCreate geometries from WKT, from binary, or manually
● Relationship TestsContains, intersects, overlaps, touches, etc.
● OperationsBoundary, envelop, union, convex hull, etc.
● AccessorsLength, Area, Centroid, Distance, etc.
Spatial Framework for Hadoop
ADD JAR libs/esri-geometry-api.jar;ADD JAR libs/spatial-sdk-hadoop.jar;
CREATE TEMPORARY FUNCTION ST_Point AS ‘com.esri.hadoop.hive.ST_Point’;CREATE TEMPORARY FUNCTION ST_LineString AS ‘com.esri.hadoop.hive.ST_LineString’;CREATE TEMPORARY FUNCTION ST_Polygon AS ‘com.esri.hadoop.hive.ST_Polygon’;
Spatial Framework for Hadoop
SELECT ST_Area(ST_Polygon(0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0)
) FROM src LIMIT 1;
SELECT ST_AsText(ST_Centroid(
ST_GeomFromText(geo_as_wkt))
) FROM src LIMIT 1;
Problems
● No persistent spatial indices
● No projections - length/area!
● Binary output by default
● Doesn’t work with vectorization
● No visualization
● Not feature complete (but most things work)
Use Case: Geo Processing @ T-Mobile Austria
● Network traffic analysis and optimization
● Signal performance analog railway tracks
● Better analysis of network coverage
● Many more..
Use Case: Trips Analysis @ Uber
● What do trips look like?
● How can we reduce wait time and make more trips?
● Are there new products we should introduce?
Source: slideshare.net
Use Case: Traffic Jam Prediction based on GPS/FCD
● Estimate average speed of cars on road
● Compare to the max speed on each street
● Use public traffic jam data as ground truth
● Train a model to predict traffic jams