Post on 16-Aug-2015
transcript
CLIO INFRA
Digital infrastructure, Collaboration and Data Visualization Platform
Vyacheslav Tykhonov, Software Developer, Architect
vty@iisg.nl May 2015
Collabs functionality
Collabs functionality will be based on the DataVerse, data management and collaboration platform:- draft and published datasets for individual users- teams collaboratively can curate and analyze research datasets- dataset version control system is able to track changes in datasets- other researchers can download their own copy of the data if dataset is published as Open Data
DataVerse is flexible metadata store (repository) connected with datasets store (storage)
Value for future Researchers
The benefits of data sharing can be classified in terms of Metadata and Data access and sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools):
● access to a specific case study, citing and finding data● access to the universe of data from DataVerse network that can organize and display
them for browsing and searching● data filtering: researchers with proper authorization can obtain the subset of data
provided by data collector● data analysis to run descriptive statistics and graphics, visualization, plotting on
historical maps● Data APIs to export data for further analysis by popular statistical packages (STATA,
SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in the future (always up-to-date solution)
DataVerse 4.0 strong sides
● “all-in-one” searching through Dataverses, datasets and files with basic search bar● navigation improved with advanced facets to browse Dataverses, datasets and files● metadata fields found in results ordered by relevancy of the result● users can access all public data from the homepage (cataloging system)● users with authority can work together (download/edit) on the same datasets and
exchange API tokens to visualize datasets with limited access● advanced support for different types of datasets: Excel and CSV files
Missing features in DataVerse 4.0
● interactive data exploration tool is too basic and not sufficient enough for data
exploration● feature to filter data on specific year and topic is not implemented● data quality check is missing● new tool that will provide histograms, cross tabulations, enhanced descriptive statistics
will be developed in the future but it's not clear when and how it will look like (and we can contribute!)
● no visualization of datasets on historical maps (anyway there is some work on Worldmaps integration)
Collaboration possibilities
Descriptive metadata:- dataset file- documentation- code
Sharing and collaboration capabilities:- requesting unique API token by every researcher - user of DataVerse- exchanging API token between researchers and granting permissions to work on the
same datasets as a team
Dataset Workflow in DataVerse
Upload data -> Draft Dataset (v.1) -> Published Dataset (v.1.1) -> Published Dataset (v.2)
From the presentation of Mercè Crosas, Director of Data Science at the Institute for Quantitative Social Science (IQSS) at Harvard University
DataVerse Integration
Published Dataset -> API token -> Data Processing Engine
Dataset Sample
World strikes dataset in CSV format stored as ILO_Combine_1927-2008_20141111_indicators_07_clean2.tab file in DataVerse
"","Disputes","days","Workers Involved","Index","Log Index","location","year""1",NA,NA,NA,NA,NA,"Algeria","1927""2",NA,NA,NA,NA,NA,"Algeria","1928"..."43",42,25142,5349,0.079556548692188,-2.53128720587461,"Algeria","1969""44",57,25771,6363,0.0811251873931524,-2.51176179402955,"Algeria","1970""45",70,52161,12276,0.11951502046322,-2.12431322125655,"Algeria","1971""46",100,40588,10706,0.149763559930392,-1.89869749563266,"Algeria","1972"…"266",142,2047601,333929,0.976359111730421,-0.0239248178964878,"Argentina","1946""267",64,3467193,541377,1.57839775815381,0.456410255396957,"Argentina","1947""268",103,3158947,278179,1.18885399297973,0.172989812001577,"Argentina","1948""269",36,510352,29164,0.218578879961968,-1.5206083229028,"Argentina","1949"
Dataset published in DataVerse (Demo)
Geoservice and Geocoder
Will be delivered by Webmapper and integrated with Clio Infra infrastructure
Geoservice requirements:- should provide actual historical polygons for all countries and regions:- available as geojson/topojson for online visualization- QGIS toolbox should be supported to upload new maps and update old maps as
shape files
Geocoder should standardize all geographical locations in different datasets:- USSR and Soviet Union should be recognized as one country with the same PID- Germany before and after 1990 - Indonesia before and after 1999
Dataset Ecosystem
Data-processing activity:- ingestion, curation, integration- discovery of dataset and data quality check- analyze and clean potentially unstructured and noisy data- data analytics and statistics- data visualization- Data APIs for advanced research- linking data to other datasets (data mungling)- development of data-processing application
Dataset Quality Check
Benford’s law to do test the quality of data
Linked Edit Rules: a methodology to publish, link, combine and execute edit rules on the Web as Linked Data to verify consistency of statistical datasets and recognize wrong filled data values (for example, characters in values where numbers expected)
Chart visualization to get overview of missing data values
Data Exploration Workflow
- Draft Datasets are visible only for owners- with API tokens researchers can get access to interactive dashboard to get some
insights about data- dashboard can provide access to all variables from the dataset and visualize them on
charts, graphs, historical maps and treemaps
After dataset is prepared to go public, it can be published in Collabs:- guest users can download the copy of dataset- team members with permissions and authorized API token can contribute to dataset
Datasets combining and aggregation in data exploration tools
● there is possibility for the development of tools that can predict the same variables disambiguated in different datasets (Year and Jaar)
● geocoding services can standardize different regions (Netherlands → NL, Amsterdam → AMS)
● all possible relationship paths can be ranked from the “best” (100%) to the “worst” (0%), for example, value “05” for variable “Month” can be recognized as “May”
● standardized datasets can be used as “reference” data for other datasets from other researcher groups and depot services (CHIA, Harvard DataVerse Network, MIT, DANS)
Infrastructure will be suitable for different datasets
Quantitative datasets that store quantity data values (numbers):- data measured (length, speed, height, age)- example: current clio-infra datasets
Qualitative datasets store quality observations (descriptions):- data can be observed (colors, textures, professions, groups)- example: HISCO, world strikes dataset
Data Visualization is different for different kinds of datasets:- quantity can be plotted on charts, graphs, historical maps- quality (hierarchy) can be visualized on treemaps and maps
Our solution for data exploration and analysis to extend DataVerse
● data quality check during upload/update of dataset● automatic ingestion and recognition of years and locations in datasets● integrated with geocoder and geoservice developed by Webmapper ● interactive dashboard to do visual exploration of variables from dataset (graph /
histogram / scatterplot)● data processing engine to plot data on historical maps (if dataset has geospatial data)● correlation for continuous variables● regression analysis for estimating the relationships among variables ● building treemaps for qualitative data analysis (hierarchical data coming soon)
The data processing engine
● should be able to split values from any dataset in categories specified by user (8 by default)
● algorithm to categorize data values in proper categories should be selected (percentile by default)
● define maximum possible categories for specific dataset if there is no way to get categories number specified by user (for example, if there are 2-3 categories of data values)
● data ranges should be defined to get possibility to visualize data on chart or map in the right scale
● colors should be specified by user (Color Brewing, see http://colorbrewer2.org)● legend should be generated and attached to all visualizations● values with missing data should be shown as 'no data' regions on map● all data values should be delivered by Data API to make the engine platform
independent and communicate with other systems
API Service (Data APIs)
Data API is the most important functionality for the well equipped digital infrastructure:● easy way to analyze data in popular statistical packages (STATA, SPSS, Excel)● use common data science programming languages like Python, R to perform more
advanced research using external Data Science libraries● analyze data with toolboxes like Wolfram|Alpha and other Discovery Platforms (added
value for the future)● suitable for other researchers and developers to use advanced technique and data
mining tools that aren’t developed yet
Online Demo
We’ve got 3 days to build the demo of MVP (Minimal Viable Product) and show the possibilities for DataVerse integration with Data exploration tool (24 hours of Agile development):
Collabs web page with published dataset (Collection stuff)http://dv.sandbox.socialhistoryservices.org/dataset.xhtml?persistentId=hdl:10622/644Z77
Interactive dashboard developed from the scratch to explore dataset variables (based on NLGIS data processing engine) and Geacron world maps (Research):http://clearance.sandbox.socialhistoryservices.org/collabs/dashboard?dataset=hdl:10622/644Z77:23:24&action=map
Can be extended with various data analysis tools and any visualizations
Data Exploration dashboard based on API token (Demo)
Thank you!
Any suggestions?