Date post: | 11-Aug-2015 |
Category: |
Data & Analytics |
Upload: | daniel-marcous |
View: | 365 times |
Download: | 7 times |
Geo Data Analytics
@dmarcous
● DBA (@IDF)
● Big Data Professional (@IDF)
● Data Wizard - Magic with Data (@Google - Waze)
● Pure professional ● Best practices● Tools ● Tips & Tricks● Free Advice!
Agenda
● Why?● Common Language● Problems at scale● Solutions at scale● Tips & Tricks for scientists
(/Wizards)● Art● Keep an eye out for…● Dog Pictures
Why Does Geo Data Matter?
● C/C++, GEOS: http://trac.osgeo.org/geos
● C#, NTS: http://code.google.com/p/nettopologysuite/
● Java, JTS:
○ http://tsusiatsoftware.net/jts/main.html
○ http://www.vividsolutions.com/jts/JTSHome.htm
● Python, shapely: https://github.com/Toblerity/Shapely
● Ruby, ffi-geos: https://github.com/dark-panda/ffi-geos
● Javascript, JSTS: http://github.com/bjornharrtell/jsts
Geometry Object Model
Geospatial Operations
● WKT / WKB - Geospatial Markup Language○ POLYGON((34.807841777801514 32.164333053441936,34.81168270111084
32.164859820966136,34.81337785720825 32.1613540349589,34.80865716934204 32.16046394346568,34.807841777801514 32.164333053441936))
○ http://arthur-e.github.io/Wicket/sandbox-gmaps3.html● GeoJSON
○ { "type": "FeatureCollection", "features": [{ "type": "Feature", "properties": { "Name": "Verint", "Guest": "dmarcous", "Accomodations": "Beer; Pizza" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 34.807841777801514, 32.164333053441936 ], [ 34.81168270111084, 32.164859820966136 ], [ 34.81337785720825, 32.1613540349589 ], [ 34.80865716934204, 32.16046394346568 ], [ 34.807841777801514, 32.164333053441936 ]]]}}]}
○ http://geojson.io/#map=17/32.16267/34.81061● Shape Files - ESRI vector format
● GML - The Geography Markup Language (GML) is an XML grammar for expressing geographical features.
● Raster - Display file built from coordinates
Formats
Databases● RDBMS
○ Postgres (PostGIS)○ MS-SQL / DB2 / Oracle
● NoSQL○ MongoDB○ IBM Cloudant○ Lucene spatial module (elastic/ solr)
● Pure Geospatial Database○ CartoDB (OS / Hosted)○ GeoMesa (Accumulo)
■ GeoTrellis - Scala framework for processing raster data
GIS Systems
List of most popular ones - http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software
QGIS TileMillGRASS
Problem?● Non scalar data types
○ Aggregating○ Sharding○ Unordered
● Speed & Accuracy○ The Physical World is non-euclidian
http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/
Solution
Data Structures
● R-Tree (PostGIS, actually R+Tree)● Quad Tree (DB2)● Hyperdimensional Hashing● Space Filling Curves
○ Z Order Curve (MS-SQL)○ Hilbert Curve
The Curse of Dimensionality
Dimension Reduction● GeoHash - The mainstream way
○ Linear (non tangant), up to x5 difference in cell area○ Same Prefix - Close areas (sort of…)○ http://geohash.org/○ https://github.com/google/open-location-code
/blob/master/docs/comparison.adoc● S2 - The google way
○ Quadratic, same level cell ~ similar area○ Faces of a projected cube - divided by Quad-Trees to levels -
Referenced to position on face by a Hilbert Curve○ https://code.google.com/p/s2-geometry-library/
● MongoDB Geospatial Indexing ● elastic / solr spatial indexing● GeoMesa● Build your own - Store the bytes in a fast
key-value store with reduced keys (HBase / Cassandra)
Near Real Time Answers
● ESRI - Hive UDFs - https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation
● Pigeon - Pig UDFs - https://github.com/aseldawy/pigeon
● Spark -○ SpatialSpark○ GeoTrellis
Big Processing - It’s a UDF World
Graph Representation● Use Cases
○ Routing○ Supply Chains○ Users Networks
● Tools ○ GraphX (Spark!) / Giraph (MR)○ Dato SGraph (formerly known as GraphLab)○ Gephi (On small parts for exploration)
● Algorithms○ Shortest Path - Dijkstra / A-*○ Communities - Triangle Counting○ Importance - Centrality / Page Rank
Tips & Tricks
Approximation
Timezones
● tz_world○ http://efele.net/maps/tz/world/○ What do we do with shapefiles?
● APIs○ Geonames○ http://www.earthtools.org/○ Google Timezone API
● UDFs?○ Hive - from_utc_timestamp(timestamp, string timezone)
// Word Countval textFile = spark.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// Modified Word Countval textFile = spark.textFile("hdfs://...")val counts = textFile.map(line => line.split(",")) .map(point => (coord2S2Cell(point(1),point(2)), 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// Take that from a library!def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int = { return S2Cell(longitude,latitude, lvl).CellId()}
Good Old Word Count
Advanced - Precision is of the Essence
● Density Based Clustering
○ DBSCAN■ Minimum cluster size (>
Noise)
■ Epsilon (Spatial Radius)
○ R - MASS - kde2d■ RGoogleMaps for the map
■ http://www.everydayanalytics.ca/2014/04/heatmap-of-toronto-traffic-signals.html
rJava
● Wrap geospatial functions of your choice● call them from R● Use apply on an entire Dataframe!● Use as features!● Visualize??? (in 5 minutes)
R Packs for Geospatial Analysis● geonames
○ Timezone○ Weather○ Nearby places
● RGoogleMaps ○ download+paint Maps○ getGeoCode
● sp / maps / maptools○ OGC object abstractions○ Manipulate / display geo data
● rgdal - spTransform○ Convert formats / coordinates systems
● geosphere - distances / circles / centroids● fpc - DBSCAN● Coverage -
○ http://cran.r-project.org/web/views/Spatial.html
Engineered Geo features● LOCAL
○ time○ is_early / is_late○ day of week○ is_workday / is_weekend○ is_day_light (sunrise/ sunset tz_world)
● Weather○ Temperature○ is_ Rain/ Fog / Hail / Snow
● Squared (s2cell/ geohash) statistics○ Probability of users in square to predict X
● Address - is_residence / is_business● News - GDELT
WOW!
Data Art
Google Sheets
Frontend = Javascript?
● Google Maps API○ https://developers.google.com/maps/documentation/javascript/examples/layer-
heatmap
● Leaflet
R for Visualisation
● ggplot2 + geospatial packs○ http://uce.uniovi.es/mundor/howtoplotashapemap.html○ http://stackoverflow.com/questions/9558040/ggplot-map-with-l○ http://spatial.ly/2012/02/great-maps-ggplot2/
● RGoogleMaps○ http://rforwork.info/tag/rgooglemaps/
R For Interactive
● Shiny○ Leaflet
■ http://rstudio.github.io/leaflet/■ http://shiny.rstudio.com/gallery/superzip-example.html■ http://shiny.rstudio.com/gallery/bus-dashboard.html
○ Globe■ https://github.com/trestletech/shinyGlobe
R Animation
● http://rmaps.github.io/blog/posts/animated-choropleths/
@aaronkoblin
Keep an Eye Out!
https://locationtech.org/list-of-projects