Clio infra Collabs data analysis tools

transcript

CLIO INFRA

Digital infrastructure, Collaboration and Data Visualization Platform

Vyacheslav Tykhonov, Software Developer, Architect

vty@iisg.nl May 2015

Collabs functionality

Collabs functionality will be based on the DataVerse, data management and collaboration platform:- draft and published datasets for individual users- teams collaboratively can curate and analyze research datasets- dataset version control system is able to track changes in datasets- other researchers can download their own copy of the data if dataset is published as Open Data

DataVerse is flexible metadata store (repository) connected with datasets store (storage)

Value for future Researchers

The benefits of data sharing can be classified in terms of Metadata and Data access and sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools):

● access to a specific case study, citing and finding data● access to the universe of data from DataVerse network that can organize and display

them for browsing and searching● data filtering: researchers with proper authorization can obtain the subset of data

provided by data collector● data analysis to run descriptive statistics and graphics, visualization, plotting on

historical maps● Data APIs to export data for further analysis by popular statistical packages (STATA,

SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in the future (always up-to-date solution)

DataVerse 4.0 strong sides

● “all-in-one” searching through Dataverses, datasets and files with basic search bar● navigation improved with advanced facets to browse Dataverses, datasets and files● metadata fields found in results ordered by relevancy of the result● users can access all public data from the homepage (cataloging system)● users with authority can work together (download/edit) on the same datasets and

exchange API tokens to visualize datasets with limited access● advanced support for different types of datasets: Excel and CSV files

Missing features in DataVerse 4.0

● interactive data exploration tool is too basic and not sufficient enough for data

exploration● feature to filter data on specific year and topic is not implemented● data quality check is missing● new tool that will provide histograms, cross tabulations, enhanced descriptive statistics

will be developed in the future but it's not clear when and how it will look like (and we can contribute!)

● no visualization of datasets on historical maps (anyway there is some work on Worldmaps integration)

Collaboration possibilities

Descriptive metadata:- dataset file- documentation- code

Sharing and collaboration capabilities:- requesting unique API token by every researcher - user of DataVerse- exchanging API token between researchers and granting permissions to work on the

same datasets as a team

Dataset Workflow in DataVerse

Upload data -> Draft Dataset (v.1) -> Published Dataset (v.1.1) -> Published Dataset (v.2)

From the presentation of Mercè Crosas, Director of Data Science at the Institute for Quantitative Social Science (IQSS) at Harvard University

DataVerse Integration

Published Dataset -> API token -> Data Processing Engine

Dataset Sample

World strikes dataset in CSV format stored as ILO_Combine_1927-2008_20141111_indicators_07_clean2.tab file in DataVerse

"","Disputes","days","Workers Involved","Index","Log Index","location","year""1",NA,NA,NA,NA,NA,"Algeria","1927""2",NA,NA,NA,NA,NA,"Algeria","1928"..."43",42,25142,5349,0.079556548692188,-2.53128720587461,"Algeria","1969""44",57,25771,6363,0.0811251873931524,-2.51176179402955,"Algeria","1970""45",70,52161,12276,0.11951502046322,-2.12431322125655,"Algeria","1971""46",100,40588,10706,0.149763559930392,-1.89869749563266,"Algeria","1972"…"266",142,2047601,333929,0.976359111730421,-0.0239248178964878,"Argentina","1946""267",64,3467193,541377,1.57839775815381,0.456410255396957,"Argentina","1947""268",103,3158947,278179,1.18885399297973,0.172989812001577,"Argentina","1948""269",36,510352,29164,0.218578879961968,-1.5206083229028,"Argentina","1949"

Dataset published in DataVerse (Demo)

Geoservice and Geocoder

Will be delivered by Webmapper and integrated with Clio Infra infrastructure

Geoservice requirements:- should provide actual historical polygons for all countries and regions:- available as geojson/topojson for online visualization- QGIS toolbox should be supported to upload new maps and update old maps as

shape files

Geocoder should standardize all geographical locations in different datasets:- USSR and Soviet Union should be recognized as one country with the same PID- Germany before and after 1990 - Indonesia before and after 1999

Dataset Ecosystem

Data-processing activity:- ingestion, curation, integration- discovery of dataset and data quality check- analyze and clean potentially unstructured and noisy data- data analytics and statistics- data visualization- Data APIs for advanced research- linking data to other datasets (data mungling)- development of data-processing application

Dataset Quality Check

Benford’s law to do test the quality of data

Linked Edit Rules: a methodology to publish, link, combine and execute edit rules on the Web as Linked Data to verify consistency of statistical datasets and recognize wrong filled data values (for example, characters in values where numbers expected)

Chart visualization to get overview of missing data values

Data Exploration Workflow

- Draft Datasets are visible only for owners- with API tokens researchers can get access to interactive dashboard to get some

insights about data- dashboard can provide access to all variables from the dataset and visualize them on

charts, graphs, historical maps and treemaps

After dataset is prepared to go public, it can be published in Collabs:- guest users can download the copy of dataset- team members with permissions and authorized API token can contribute to dataset

Datasets combining and aggregation in data exploration tools

● there is possibility for the development of tools that can predict the same variables disambiguated in different datasets (Year and Jaar)

● geocoding services can standardize different regions (Netherlands → NL, Amsterdam → AMS)

● all possible relationship paths can be ranked from the “best” (100%) to the “worst” (0%), for example, value “05” for variable “Month” can be recognized as “May”

● standardized datasets can be used as “reference” data for other datasets from other researcher groups and depot services (CHIA, Harvard DataVerse Network, MIT, DANS)

Infrastructure will be suitable for different datasets

Quantitative datasets that store quantity data values (numbers):- data measured (length, speed, height, age)- example: current clio-infra datasets

Qualitative datasets store quality observations (descriptions):- data can be observed (colors, textures, professions, groups)- example: HISCO, world strikes dataset

Data Visualization is different for different kinds of datasets:- quantity can be plotted on charts, graphs, historical maps- quality (hierarchy) can be visualized on treemaps and maps

Our solution for data exploration and analysis to extend DataVerse

● data quality check during upload/update of dataset● automatic ingestion and recognition of years and locations in datasets● integrated with geocoder and geoservice developed by Webmapper ● interactive dashboard to do visual exploration of variables from dataset (graph /

histogram / scatterplot)● data processing engine to plot data on historical maps (if dataset has geospatial data)● correlation for continuous variables● regression analysis for estimating the relationships among variables ● building treemaps for qualitative data analysis (hierarchical data coming soon)

The data processing engine

● should be able to split values from any dataset in categories specified by user (8 by default)

● algorithm to categorize data values in proper categories should be selected (percentile by default)

● define maximum possible categories for specific dataset if there is no way to get categories number specified by user (for example, if there are 2-3 categories of data values)

● data ranges should be defined to get possibility to visualize data on chart or map in the right scale

● colors should be specified by user (Color Brewing, see http://colorbrewer2.org)● legend should be generated and attached to all visualizations● values with missing data should be shown as 'no data' regions on map● all data values should be delivered by Data API to make the engine platform

independent and communicate with other systems

API Service (Data APIs)

Data API is the most important functionality for the well equipped digital infrastructure:● easy way to analyze data in popular statistical packages (STATA, SPSS, Excel)● use common data science programming languages like Python, R to perform more

advanced research using external Data Science libraries● analyze data with toolboxes like Wolfram|Alpha and other Discovery Platforms (added

value for the future)● suitable for other researchers and developers to use advanced technique and data

mining tools that aren’t developed yet

Online Demo

We’ve got 3 days to build the demo of MVP (Minimal Viable Product) and show the possibilities for DataVerse integration with Data exploration tool (24 hours of Agile development):

Collabs web page with published dataset (Collection stuff)http://dv.sandbox.socialhistoryservices.org/dataset.xhtml?persistentId=hdl:10622/644Z77

Interactive dashboard developed from the scratch to explore dataset variables (based on NLGIS data processing engine) and Geacron world maps (Research):http://clearance.sandbox.socialhistoryservices.org/collabs/dashboard?dataset=hdl:10622/644Z77:23:24&action=map

Can be extended with various data analysis tools and any visualizations

Data Exploration dashboard based on API token (Demo)

Thank you!

Any suggestions?

Clio infra Collabs data analysis tools

Data & Analytics