Download - Platforms for simulation, visualisation and data analysis

Platforms for simulation, visualisation and data analysisJoris Borgdorff

Understanding large-scale human behaviour, and more generally, large complex systems and datasets.Most important: in this age, you should not treat data, simulations and human understanding separately.Software platforms with this combination are a step in this direction.This is a technical talk!!!

Platforms for simulation, visualisation and data analysisJoris Borgdorff

Understanding large-scale human behaviour, and more generally, large complex systems and datasets.Most important: in this age, you should not treat data, simulations and human understanding separately.Software platforms with this combination are a step in this direction.This is a technical talk!!!

SIM-CITY: Understanding and responding to problems of urbanisation through computation

3

Shortage of:97% of the required fire stations 80% of the fire fighting vehicles

96% of fire fighters

Fire: high risk, poor infrastructurestatic data: Road network; Fire stations; department census data; hazard mapdynamic data: origin-destination matrixreal-time data: traffic density, fire engine locationscontrol: fire station placement, road police interventionsoutput: traffic behaviour, response times, optimal fire station placement, optimal road interventions

A software platform with this combination facilitates an iterative scientific process; from its early phases of planning/setting up the experiment to predicting behavior.

5

Scenario run: response times in low traffic situationMore scenarios and fire station placements to be run for better overviewJump to micro-simulation

Scenario exploration

Models

Computing infrastructure

Input

Data

Output

trigger andintervene analysis

user

sensors

update show

emergencysupport epidemics

parameterexploration

parameter optimization

cluster cloud

public sources

GIS

experi-ments

files streams database

metrics

statistics

HPC

likelyscenarios

6

Assisted decision support

Used in SIM-CITY, to be repeated in Dynaslum with Depraj, Kumbh Mela project, Indo-Dutch project.

7

Services

Back-end

Data

Legend

Python scenario exploration

Xenon

Computing (cluster)

Provenance(CouchDB)

geographic & aggregated data

(PostGIS)

Web service

Simulation

Geographic, statistics and simulation site

Modify data and parameters

Update parameter

studyREST API

execute

Show output

process

Files(WebDAV)

rethink

raw data(Apache Spark)

scheduled dataprocessing

Jupyter Notebook

connectGeoserver

prototype in SIM-CITY- First upper part - web interface- Who has used Jupyter Notebooks? -> binder- Geoserver, geographic data understandable for machines and web interfaces- Custom web services: essential to provide new functionality, not to serve web pages.

Web interfaces

All demonstrations on https://github.com/NLeSC/collab-demosCrossfilter: make dynamic selections

9

source: computerweekly.com

Docker

Who knows docker?Very lightweightCombine different TCP/IP services with docker-composeNot yet available everywhere.

http://computerweekly.com

9

source: computerweekly.com

Docker

Who knows docker?Very lightweightCombine different TCP/IP services with docker-composeNot yet available everywhere.

http://computerweekly.com

10

Services

Back-end

Data

Legend


Xenon

Computing (cluster)

Provenance(CouchDB)


(PostGIS)

Web service

Simulation



Update parameter

studyREST API

execute

Show output

process

Files(WebDAV)

rethink



Jupyter Notebook

connectGeoserver

- who ever lost track of what simulations they ran? Provenance: keep track of tasks, configuration, use as cache, HTTP support- File service: again, needs HTTP support, WebDAV does this out of the box- For large amounts of raw data that you want to analyse multiple times, you would like some server-side processing use Apache spark (see more later)- For aggregates of raw data, store in a separate database

Apache Spark

11

Source: arstechnica.com

Who has heard of MapReduce or Hadoop? And Apache Spark

http://arstechnica.com

Apache spark example

12

Resilient Distributed Dataset


12

valdocuments:RDD[Document]=myReadFunc()



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct)



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L))



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_)



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold)



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex()



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex().map(DictionaryItem(_._2,_._1._1,_._1._2))



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex().map(DictionaryItem(_._2,_._1._1,_._1._2)).cache()



12

valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex().map(DictionaryItem(_._2,_._1._1,_._1._2)).cache()dictionary.saveAsTextFile("dictionary.txt")


13

Services

Back-end

Data

Legend


Xenon

Computing (cluster)

Provenance(CouchDB)


(PostGIS)

Web service

Simulation



Update parameter

studyREST API

execute

Show output

process

Files(WebDAV)

rethink



Jupyter Notebook

connectGeoserver

- Simulations: interface with clusters- Get data on the right place.

14

Testing: travis + dockerExtensions: pyxenon, noodles, sim-city-client

Code quality

• Git • Travis or Jenkins • Code quality • Docker images

• Software and data carpentry

15

- Bringing it together:- Git (Github)- continuous integration (Travis)- Code quality- Docker- A list of software quality measures: https://github.com/NLeSC/estep-checklist/blob/master/checklist.md

Kumbh Mela

80 million people

Kumbh Mela project

• Current focus: data gathering

1. Distribute 3.200 very cheap bracelets with WiFi

2. Camera feeds

3. GPS trackers

4. Questionnaires

• A month worth of data: tens of terabytes

Plan to use spark with Jupyter notebooks and custom services to do data analysis and later simulations.Large scale projects and data sets like these benefit from platform approach.

Combine data, simulations and human understandingUseful in large-scale contexts; in two ways: a large project or a large community. A partner like SURFsara or eScience Center can help.Direct access by user is needed in academia.