Big Data and Analytics with ArcGISCanserina Kurnia
Technical Manager – Esri Global Asia Pacific
Agenda
• What is Big Data?
• What is Hadoop?
• How does Spatial integrate with Big Data and
Hadoop?
• How do I get started?
Story Time…
U.S.
Demographic
Data
FOR EACH LOCATION
FOR EACH DEMOGRAPHIC
⬇50 MILE HEATMAP
Traditional Means…
14 Days
850 GB Raster Files
Better Way ?
What is BigData ?
7 B I L L I O N
50% LIVE IN CITIES !
~70% By 2050 ! ! !
http://www.who.int/gho/urban_health/situation_trends/urban_population_growth_text/en/
Academics
Volume
Velocity Variety
Volume
Velocity
Variety
Veracity
Validity
Visualization
Vulnerability
Value
But then I’ve seen…
→ data at rest
→ data in motion
→ many types
→ data in doubt
→ data that is correct
→ data in patterns
→ data at risk
→ data that is meaningful
“When the traditional
means are failing you”-Anonymous
What are the new means?
What’s in a name ?
http://blog.pivotal.io/pivotal/products/demystifying-hadoop-in-5-pictures
What Is Hadoop ?• Library / Framework
• Very Very Large Un/Structured
Dataset
• Multi Node Distributed Processing
• Resilient To Commodity Hardware
Failure
Hadoop Basic Stack
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
Commodity Servers
MapReduce Hive HBase
Other Hadoop Projects• Avro - Serialization / RPC System
• HBase - Distributed Columnar Database
• Hive - Ad Hoc “SQL” Interface
• Pig - Data Flow Parallel Execution (AML)
• ZooKeeper - Coordination Service
• More….
HDFS• Distributed File System
• Lots and Lots of Commodity Drives
• Fault Tolerant
• Loves Big Files
• “POSIX” Like Interface
HDFS
NameNode
DataNode DataNode DataNode
HDFS Client
HDFS Resilience !
HDFS
DataNode DataNode DataNode
BigData
Program
☓
BigData
Program
MapReduce
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
What Is MapReduce ?• Parallel Fault Tolerant Framework
• Splits Large Input
• Invoke User Defined “Map” Function
• Shuffle and Sort
• Invoke User Defined “Reduce” Function
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Name
Node
Job
TrackerClient
MapReduce & HDFS.jar
Thinking In MR
K1,V1
Map list(K2,V2)
Shuffle/Sort
K2,list(V2)
Reduce list(K3,V3)
(filter & transform) (group & aggregate)
Geo MapReduce
DensityMapID1,X1,Y1
ID2,X2,Y2
ID3,X3,Y3
ID4,X4,Y4
…
DensityMapfunction map(lineno,text)
{
tokens = text.split(‘,’)
cell = toCell(tokens[1],tokens[2])
emit( cell, 1)
}
function toCell(x,y)
{
// some math !!
return cell
}
function reduce(cell,iterator)
{
sum = 0
for( one : iterator)
sum += one
emit( cell, sum)
}
http://thunderheadxpler.blogspot.com/2013/03/bigdata-kernel-density-analysis-on.html
Writing MapReduce Is
Hard…
Think of Data
as
Water In Pipes
Cascading pipeline
⬇MapReduce Job
To CellGroupBy
count
X,Y
Collection
Cell
Count
Workflow Pipeline
RM
SourceSink
Filter
Cascading Pipe
// Pipe tap x,y input fields into spatial function
Pipe pipe = new Each("start", new Fields("X", "Y"), new SpatialDensity());
// Group by emitted ‘cell’ value
pipe = new GroupBy(pipe, “cell”);
// Count by group and name count ‘POPULATION’
pipe = new Every(pipe, Fields.GROUP, new Count(new Fields("POPULATION")));
http://thunderheadxpler.blogspot.com/2014/01/cascading-workflow-for-spatial-binning.html
How About….
No Programing ???
Apache HIVE
“SQL”
⬇MapReduce Job
HQLdrop table if exists logs;
create external table if not exists logs(
ip string,
method string,
uri string,
status string,
bytes int,
time_taken int,
referrer string,
user_agent string
) partitioned by (year int, month int, day int, hour int)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location ‘hdfs://hadoop:8020/logs/';
Other AdHoc Engines• Cloudera Impala
• Facebook Presto
• SparkSQL
• Bypass MR generation / Direct HDFS Access
What About Spatial ?
GIS Tools For Hadoop• Computational Geometry Library
• Hive Spatial UDF Functions
• GeoProcessing Extensions to ArcMap
Geometry Library• Points / Lines / Polygons
• I/O (GeoJSON,WTK,WBT,Shape)
• Spatial Relations (inside, touches, intersects,…)
• Spatial Operations (buffer, cut, convex hull,…)
• In-Memory Spatial Index
API Usage in BigData• Map-only jobs - GeoEnrichment
- Given set of locations
- Given demographic area
- Augment location with demographic attributes
BigData Binning
BigData Binning
BigData Binning
Hive Spatial UDF• Uses Geometry API
• Constructor
- ST_POINT / ST_GeomFromGeoJSON
• Relations
- ST_Contains / ST_Buffer
• Accessor
- ST_Distance, ST_Area
Hive Spatial UDF
SELECT counties.name, count(*) total FROM countiesJOIN earthquakesWHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))GROUP BY counties.nameORDER BY total desc;
GP Extensions
ArcMap
HDFS
Hive/MapReduce
Workflow
PROCESSING EVOLUTION
• Transaction - Batch
• Operational - Dashboard
• Analytics - Exploration
• Intelligent - Realtime / Predictive
Fixed
Schema
Variable
Schema
Big Data Partners
And More….
Blog Post: http://thunderheadxpler.blogspot.com
Thank you