Geo data analytics

Post on 11-Aug-2015

365 views 7 download

transcript

Geo Data Analytics

@dmarcous

● DBA (@IDF)

● Big Data Professional (@IDF)

● Data Wizard - Magic with Data (@Google - Waze)

● Pure professional ● Best practices● Tools ● Tips & Tricks● Free Advice!

Agenda

● Why?● Common Language● Problems at scale● Solutions at scale● Tips & Tricks for scientists

(/Wizards)● Art● Keep an eye out for…● Dog Pictures

Why Does Geo Data Matter?

● C/C++, GEOS: http://trac.osgeo.org/geos

● C#, NTS: http://code.google.com/p/nettopologysuite/

● Java, JTS:

○ http://tsusiatsoftware.net/jts/main.html

○ http://www.vividsolutions.com/jts/JTSHome.htm

● Python, shapely: https://github.com/Toblerity/Shapely

● Ruby, ffi-geos: https://github.com/dark-panda/ffi-geos

● Javascript, JSTS: http://github.com/bjornharrtell/jsts

Geometry Object Model

Geospatial Operations

● WKT / WKB - Geospatial Markup Language○ POLYGON((34.807841777801514 32.164333053441936,34.81168270111084

32.164859820966136,34.81337785720825 32.1613540349589,34.80865716934204 32.16046394346568,34.807841777801514 32.164333053441936))

○ http://arthur-e.github.io/Wicket/sandbox-gmaps3.html● GeoJSON

○ { "type": "FeatureCollection", "features": [{ "type": "Feature", "properties": { "Name": "Verint", "Guest": "dmarcous", "Accomodations": "Beer; Pizza" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 34.807841777801514, 32.164333053441936 ], [ 34.81168270111084, 32.164859820966136 ], [ 34.81337785720825, 32.1613540349589 ], [ 34.80865716934204, 32.16046394346568 ], [ 34.807841777801514, 32.164333053441936 ]]]}}]}

○ http://geojson.io/#map=17/32.16267/34.81061● Shape Files - ESRI vector format

● GML - The Geography Markup Language (GML) is an XML grammar for expressing geographical features.

● Raster - Display file built from coordinates

Formats

Databases● RDBMS

○ Postgres (PostGIS)○ MS-SQL / DB2 / Oracle

● NoSQL○ MongoDB○ IBM Cloudant○ Lucene spatial module (elastic/ solr)

● Pure Geospatial Database○ CartoDB (OS / Hosted)○ GeoMesa (Accumulo)

■ GeoTrellis - Scala framework for processing raster data

GIS Systems

List of most popular ones - http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software

QGIS TileMillGRASS

Problem?● Non scalar data types

○ Aggregating○ Sharding○ Unordered

● Speed & Accuracy○ The Physical World is non-euclidian

http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/

Solution

Data Structures

● R-Tree (PostGIS, actually R+Tree)● Quad Tree (DB2)● Hyperdimensional Hashing● Space Filling Curves

○ Z Order Curve (MS-SQL)○ Hilbert Curve

The Curse of Dimensionality

Dimension Reduction● GeoHash - The mainstream way

○ Linear (non tangant), up to x5 difference in cell area○ Same Prefix - Close areas (sort of…)○ http://geohash.org/○ https://github.com/google/open-location-code

/blob/master/docs/comparison.adoc● S2 - The google way

○ Quadratic, same level cell ~ similar area○ Faces of a projected cube - divided by Quad-Trees to levels -

Referenced to position on face by a Hilbert Curve○ https://code.google.com/p/s2-geometry-library/

● MongoDB Geospatial Indexing ● elastic / solr spatial indexing● GeoMesa● Build your own - Store the bytes in a fast

key-value store with reduced keys (HBase / Cassandra)

Near Real Time Answers

● ESRI - Hive UDFs - https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation

● Pigeon - Pig UDFs - https://github.com/aseldawy/pigeon

● Spark -○ SpatialSpark○ GeoTrellis

Big Processing - It’s a UDF World

Graph Representation● Use Cases

○ Routing○ Supply Chains○ Users Networks

● Tools ○ GraphX (Spark!) / Giraph (MR)○ Dato SGraph (formerly known as GraphLab)○ Gephi (On small parts for exploration)

● Algorithms○ Shortest Path - Dijkstra / A-*○ Communities - Triangle Counting○ Importance - Centrality / Page Rank

Tips & Tricks

Approximation

Timezones

● tz_world○ http://efele.net/maps/tz/world/○ What do we do with shapefiles?

● APIs○ Geonames○ http://www.earthtools.org/○ Google Timezone API

● UDFs?○ Hive - from_utc_timestamp(timestamp, string timezone)

// Word Countval textFile = spark.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// Modified Word Countval textFile = spark.textFile("hdfs://...")val counts = textFile.map(line => line.split(",")) .map(point => (coord2S2Cell(point(1),point(2)), 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// Take that from a library!def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int = { return S2Cell(longitude,latitude, lvl).CellId()}

Good Old Word Count

Advanced - Precision is of the Essence

● Density Based Clustering

○ DBSCAN■ Minimum cluster size (>

Noise)

■ Epsilon (Spatial Radius)

○ R - MASS - kde2d■ RGoogleMaps for the map

■ http://www.everydayanalytics.ca/2014/04/heatmap-of-toronto-traffic-signals.html

rJava

● Wrap geospatial functions of your choice● call them from R● Use apply on an entire Dataframe!● Use as features!● Visualize??? (in 5 minutes)

R Packs for Geospatial Analysis● geonames

○ Timezone○ Weather○ Nearby places

● RGoogleMaps ○ download+paint Maps○ getGeoCode

● sp / maps / maptools○ OGC object abstractions○ Manipulate / display geo data

● rgdal - spTransform○ Convert formats / coordinates systems

● geosphere - distances / circles / centroids● fpc - DBSCAN● Coverage -

○ http://cran.r-project.org/web/views/Spatial.html

Engineered Geo features● LOCAL

○ time○ is_early / is_late○ day of week○ is_workday / is_weekend○ is_day_light (sunrise/ sunset tz_world)

● Weather○ Temperature○ is_ Rain/ Fog / Hail / Snow

● Squared (s2cell/ geohash) statistics○ Probability of users in square to predict X

● Address - is_residence / is_business● News - GDELT

WOW!

Data Art

Google Sheets

Frontend = Javascript?

● Google Maps API○ https://developers.google.com/maps/documentation/javascript/examples/layer-

heatmap

● Leaflet

R for Visualisation

● ggplot2 + geospatial packs○ http://uce.uniovi.es/mundor/howtoplotashapemap.html○ http://stackoverflow.com/questions/9558040/ggplot-map-with-l○ http://spatial.ly/2012/02/great-maps-ggplot2/

● RGoogleMaps○ http://rforwork.info/tag/rgooglemaps/

R For Interactive

● Shiny○ Leaflet

■ http://rstudio.github.io/leaflet/■ http://shiny.rstudio.com/gallery/superzip-example.html■ http://shiny.rstudio.com/gallery/bus-dashboard.html

○ Globe■ https://github.com/trestletech/shinyGlobe

R Animation

● http://rmaps.github.io/blog/posts/animated-choropleths/

@aaronkoblin

Keep an Eye Out!

https://locationtech.org/list-of-projects

Contact

● Daniel Marcous● dmarcous@gmail.com