Tools of the Data Smithe’s Trade
Joe Smithe, Tim Hunter, Tad Slawecki, Steve Ruberg
Before we begin, a thank you:
Drew Gronewold, Tim Hunter, Steve Ruberg, Ron Muzzi, more…
Special thanks to the IJC for the invite
Recommendations to the IJC
● The end point: storage, access, analysis, presentation○ Products of sensor technology infrastructure○ Data from sensors to users, decision makers, etc.
● Some old tech are fine● Some new tech are begging to be adopted● Do what is socially sustainable and secure
○ Account for the retiring generations and the up and coming working ones
○ Adopt technologies with support from many people
■ Fair chance of hackers, greater chance of good programmers who can fix things fast
Labyrinths of data, hard to get around...
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://s382.photobucket.com/user/Gandalf-lotr/media/Gandalfsfirework.jpg.html
http://corecanvas.s3.amazonaws.com/theonering-0188db0e/gallery/original/pippinmerry011128a.jpg
http://iihtofficialblog.blogspot.com/2014/07/5-vs-of-hadoop-big-data.html
http://iihtofficialblog.blogspot.com/2014/07/5-vs-of-hadoop-big-data.html
Overview of Infrastructure Technology
DISCLAIMER: I HAVE NOT WORKED WITH ALL OF THESE TECHNOLOGIES. THIS IS MERELY A
CATALOG OF TOOLS TO DISCUSS.
Overview of Infrastructure Technology
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Target Platforms
DesktopMobile
orTablet
Web
Target Platforms
DesktopMobile
orTablet
Web
MS Windows● .NET● OneCoreApple● OS X and
Xcode*nix● Various
(Linux)
Win Phone
Apple● iOS
Android
Microsoft● ASP .NETLinux● LAMPOther● Wordpress● Drupal● Many more
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Storage formats
● Plain text○ “Future proof”○ Growth can prove challenging○ Examples: XML, WaterML,
[other]ML, CSV● Binary
○ Computers eat this stuff up, but humans don’t. Good to have transformers to create downloadable and ingestible copies
○ Examples: GRiB, NetCDF
BluePenguino - Photobuckethttp://culturepopped.blogspot.com/2014/12/the-legends-of-pac-man.html
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Data management
● Data provenance (origin) - copies aren’t great, version control systems offer limited help. Authoritative sources and citations to them mitigate noise, copies.
● Structured directories, even on the web● Relational Database Management Systems (RDBMSs)
○ Postgre SQL (recommended), MySQL, SQLite■ http://ask.metafilter.com/92162/MySQL-vs-PostgreSQL
○ Big Data - NoSQL, SciDB○ Geospatial - PostGIS, SpatialLite, MySQL Spatial
■ CUAHSI Hydroserver, THREDDS, MapServer, GeoServer, and Deegree implement above
■ Web services (accessibility)
Data management - new tech to adopt
● GRAPH DATABASES○ Fund them○ Power++
■ Utilizes the power of graphs to explore relationships between data points
■ Understand, investigate many to many, one to many, many to one relationships with ease
○ http://cyanohub.earth.lsa.umich.edu/
○ For more: http://neo4j.com/developer/graph-db-vs-
rdbms/ and http://mashable.com/2012/09/26/graph-databases/
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Model coupling or combining
● Java-based Object Modelling System● OpenMI (Open Modelling Interface, C# and Java)
○ GUIs - OpenMI Configuration Editor, Pipistrelle
A lot of specialized models focus on limited domains, and via coupling, we can attain a modelling domain that spans current problems...
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Probabilistic Modelling
● Bayesian hierarchical modelling is becoming a very popular approach in many problems where estimates are many but conclusions are few or divergent○ JAGS○ Stan
● Cha, Y. and C.A. Stow. 2014. A Bayesian network incorporating observation error to predict phosphorus and chlorophyll a in Saginaw Bay. Environmental Modelling & Software, 57: 90- 100
● Gronewold, A.D., J. Bruxer, D. Durnford, J. Smith, A. Clites, F. Seglenieks, T. Hunter, S. Qian, V. Fortin (Accepted, 2016).
Hydrological drivers of record-setting water level rise on Earth’s
largest lake system. Water Resources Research.
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Distributed processing
● High Performance Computers (HPCs, formerly Super)● MapReduce (key/value pairs as input)
○ programming model, similar to the Message Passage Interface (MPI)
○ scalable○ reputable fault tolerance (robust)
■ Apache Hadoop (an implementation)■ R and Hadoop Integrated Processing Environment
(RHIPE)
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Modelling Services, Processing, Presentation
● Matlab, R, Python (Anaconda distribution), assisted with shell scripting○ http://www.talyarkoni.org/blog/2013/11/18/the-homogenization-of-scientific-computing-or-why-python-
is-steadily-eating-other-languages-lunch/
● Julia● Web Development
○ PHP, Javascript (and packages, more later)○ Frameworks under Java, Python, Ruby on Rails○ *.NET Frameworks (Microsoft)○ Backbone.js, Django
○ Content Management Systems (CMSs) such as Drupal, CKAN
Overview
1. Target Platforms2. Data Storage Formats3. Data Management4. Model Coupling/Combining5. Probabilistic Modelling6. Distributed processing7. Modelling Services, Processing, Presentation8. Visualization and interaction
Fireworks (Visualization)
● Often cast as the data themselves...
● Javascript Packages: jqPlot, Flot, Processing (language), Raphaël, D3 (successor to Protovis), Google Charts, and Dygraphs
● Apache Flex● Mapping: OpenLayers, Google Earth/Maps● Interfaces: CUAHSI HydroShare, QGIS (like ArcGIS), uDig● Desktop plotting packages:
○ R: ggplot2, ggvis, rgl, and default packages○ Python: Matplotlib, Plotly, Pychart...
■ https://wiki.python.org/moin/NumericAndScientific/Plotting
jpTheSmithe.com
All from Environmental Modelling and Software:
● Web technologies for environmental big data (Open Access), Vitolo et al. (2015)
● Web based visualization of large climate data sets, J. R. Alder and S.W. Hostetler (2015)
● A review of open source software solutions for developing water resources web applications, Swain et al. (2015)
And we’ll probably do this again in 5-10 years next year!
Relevant parchments:
Recommendations to the IJC
● The end point: storage, access, analysis, presentation○ Products of sensor technology infrastructure○ Data from sensors to users, decision makers, etc.
● Some old tech are fine● Some new tech are begging to be adopted● Do what is socially sustainable and secure
○ Account for the retiring generations and the up and coming working ones
○ Adopt technologies with support from many people
■ Fair chance of hackers, greater chance of good programmers who can fix things fast