Platforms for simulation, visualisation and data analysisJoris Borgdorff
Understanding large-scale human behaviour, and more generally, large complex systems and datasets.Most important: in this age, you should not treat data, simulations and human understanding separately.Software platforms with this combination are a step in this direction.This is a technical talk!!!
Platforms for simulation, visualisation and data analysisJoris Borgdorff
Understanding large-scale human behaviour, and more generally, large complex systems and datasets.Most important: in this age, you should not treat data, simulations and human understanding separately.Software platforms with this combination are a step in this direction.This is a technical talk!!!
SIM-CITY: Understanding and responding to problems of urbanisation through computation
3
Shortage of:97% of the required fire stations 80% of the fire fighting vehicles
96% of fire fighters
Fire: high risk, poor infrastructurestatic data: Road network; Fire stations; department census data; hazard mapdynamic data: origin-destination matrixreal-time data: traffic density, fire engine locationscontrol: fire station placement, road police interventionsoutput: traffic behaviour, response times, optimal fire station placement, optimal road interventions
A software platform with this combination facilitates an iterative scientific process; from its early phases of planning/setting up the experiment to predicting behavior.
5
Scenario run: response times in low traffic situationMore scenarios and fire station placements to be run for better overviewJump to micro-simulation
Scenario exploration
Models
Computing infrastructure
Input
Data
Output
trigger andintervene analysis
user
sensors
update show
emergencysupport epidemics
parameterexploration
parameter optimization
cluster cloud
public sources
GIS
experi-ments
files streams database
metrics
statistics
HPC
likelyscenarios
6
Assisted decision support
Used in SIM-CITY, to be repeated in Dynaslum with Depraj, Kumbh Mela project, Indo-Dutch project.
7
Services
Back-end
Data
Legend
Python scenario exploration
Xenon
Computing (cluster)
Provenance(CouchDB)
geographic & aggregated data
(PostGIS)
Web service
Simulation
Geographic, statistics and simulation site
Modify data and parameters
Update parameter
studyREST API
execute
Show output
process
Files(WebDAV)
rethink
raw data(Apache Spark)
scheduled dataprocessing
Jupyter Notebook
connectGeoserver
prototype in SIM-CITY- First upper part - web interface- Who has used Jupyter Notebooks? -> binder- Geoserver, geographic data understandable for machines and web interfaces- Custom web services: essential to provide new functionality, not to serve web pages.
Web interfaces
All demonstrations on https://github.com/NLeSC/collab-demosCrossfilter: make dynamic selections
9
source: computerweekly.com
Docker
Who knows docker?Very lightweightCombine different TCP/IP services with docker-composeNot yet available everywhere.
9
source: computerweekly.com
Docker
Who knows docker?Very lightweightCombine different TCP/IP services with docker-composeNot yet available everywhere.
10
Services
Back-end
Data
Legend
Python scenario exploration
Xenon
Computing (cluster)
Provenance(CouchDB)
geographic & aggregated data
(PostGIS)
Web service
Simulation
Geographic, statistics and simulation site
Modify data and parameters
Update parameter
studyREST API
execute
Show output
process
Files(WebDAV)
rethink
raw data(Apache Spark)
scheduled dataprocessing
Jupyter Notebook
connectGeoserver
- who ever lost track of what simulations they ran? Provenance: keep track of tasks, configuration, use as cache, HTTP support- File service: again, needs HTTP support, WebDAV does this out of the box- For large amounts of raw data that you want to analyse multiple times, you would like some server-side processing use Apache spark (see more later)- For aggregates of raw data, store in a separate database
Apache Spark
11
Source: arstechnica.com
Who has heard of MapReduce or Hadoop? And Apache Spark
Apache spark example
12
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct)
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L))
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_)
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold)
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex()
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex().map(DictionaryItem(_._2,_._1._1,_._1._2))
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex().map(DictionaryItem(_._2,_._1._1,_._1._2)).cache()
Resilient Distributed Dataset
Apache spark example
12
valdocuments:RDD[Document]=myReadFunc()valdictionary:RDD[DictionaryItem]=documents.flatMap(_.tokens.distinct).map((_,1L)).reduceByKey(_+_).filter(_._2>=lowerThreshold).zipWithIndex().map(DictionaryItem(_._2,_._1._1,_._1._2)).cache()dictionary.saveAsTextFile("dictionary.txt")
Resilient Distributed Dataset
13
Services
Back-end
Data
Legend
Python scenario exploration
Xenon
Computing (cluster)
Provenance(CouchDB)
geographic & aggregated data
(PostGIS)
Web service
Simulation
Geographic, statistics and simulation site
Modify data and parameters
Update parameter
studyREST API
execute
Show output
process
Files(WebDAV)
rethink
raw data(Apache Spark)
scheduled dataprocessing
Jupyter Notebook
connectGeoserver
- Simulations: interface with clusters- Get data on the right place.
14
Testing: travis + dockerExtensions: pyxenon, noodles, sim-city-client
Code quality
• Git • Travis or Jenkins • Code quality • Docker images
• Software and data carpentry
15
- Bringing it together:- Git (Github)- continuous integration (Travis)- Code quality- Docker- A list of software quality measures: https://github.com/NLeSC/estep-checklist/blob/master/checklist.md
Kumbh Mela
80 million people
Kumbh Mela project
• Current focus: data gathering
1. Distribute 3.200 very cheap bracelets with WiFi
2. Camera feeds
3. GPS trackers
4. Questionnaires
• A month worth of data: tens of terabytes
Plan to use spark with Jupyter notebooks and custom services to do data analysis and later simulations.Large scale projects and data sets like these benefit from platform approach.
Combine data, simulations and human understandingUseful in large-scale contexts; in two ways: a large project or a large community. A partner like SURFsara or eScience Center can help.Direct access by user is needed in academia.