Date post: | 22-Jun-2015 |
Category: |
Technology |
Upload: | incisiveevents |
View: | 245 times |
Download: | 1 times |
DATA LIBERATION
Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier
Tony HirstDepartment of Communication and
SystemsThe Open University
data NOT information
Craftby Vicky Hugheston
[Disruptive Innovation?]
“First” generation:data catalogues
Breathing life into data…
=importData(“CSV_URL”)
Google Sheets
the spreadsheet becomes
A DATABASE
Google Charts
Visualisation API
Google Charts
Visualisation API
Google Charts
Visualisation API
“Second” generation:data management
systems
DMS – Data Management System
BUT
There’s lots more data that’s locked up in web pages…
Scraping…
“grabbing web content in a machine readable
format and then processing it for your
own purposes”
DIY API
Original HTML web
page
Accessible web page
Extract Information
-> data
Recreating the database that was used
to populate a (templated) page
Implied semantics
…quick’n’dirty=importHTML(“pageURL”,“table”,N)
PDF scraping
Scrapers
Views
Scraper SQLite database
SQLite database Scraper
Sometimes the data is spread
across different files…
Row based aggregation
Sometimes the data is spread
across different websites…
…Normalisation…
Data Enrichment
Column Additions/An
notations
Sometimes the data is split
across different files…
Column based merge
-> Data cleansing
Clustering…
OpenRefinehttp://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/
/via Martin Hawksey/@mhawksey
OpenRefine
OpenRefine
“Finessing” a common identifer
Common identifiers (common KEYS) make
it MUCH easier to JOIN datasets by column
Book Title -> ISBN
I am “psychemedia” on Twitter, delicious, slideshare, flickr, etc
etc
Reconciliation…
OpenRefine
OpenRefine
OpenRefine
OpenRefine
Linked Data™
So who speaks SPARQL?
Diners - Journal Canteenby avlxyz
You DON’T have to….
Just think about how one piece of data might be related to another
through a common means of addressing them…
http://ouseful.info
@psychemedia