“Docker” for (open)data · “Docker” for (open)data John Krauss The Governance Lab at NYU...

ldquoDockerrdquo for (open)data

John KraussThe Governance Lab at NYU

Brooklyn New York USAjohnthegovlaborg

Arnaud SahuguetThe Governance Lab at NYU

Brooklyn New York USAarnaudsahuguetgmailcom

ABSTRACTOpen data has received a lot of attention Some speak ofopen data as revolutionary as much as open source was forsoftware The reality is somewhat different

Someone ndash a civic hacker a data journalist or a researcherin academia ndash willing to make sense of the data usually hasto go through the following steps (a) find the datasets theyneed (b) download the datasets (c) realize they are missinga few and download those too (d) create a schema andload the datasets into a database (e) join the datasets intoone or more views that permit analysisand finally (f) runqueries against the dataset This is not only time consuming(usually half a day of work) but also extremely frustratingThe process must be repeated whenever the datasets getupdated

In this paper we present a novel approach for packaging opendata we call Docker for Data inspired by the eponymouscloud container solution With Docker for Data we packagedatasets into coherent and self-contained units that can bedeployed with a few clicks and used within minutes insteadof hours We also present a concrete application in thecontext of open data investigation centered around the issueof housing costs in the context of New York City

INTRODUCTIONOpen data has received a lot of attention Some speak ofopen data as revolutionary as much as open source wasfor software The Web in general Android and most ofthe technology sector would probably not be as successfulwithout opensource technologies

ldquoOpen data is data that can be freely used reused andredistributed by anyone ndash subject only at most to therequirement to attribute and sharealikerdquo [2] Accordingto McKinsey [9] this can translate into $3 to $5 trillion ineconomic impact Open data also creates large opportunitiesfor innovation [17] and social impact [8]

But in practice the reality is somewhat different as describedby some leaders of the opendata movement eg [6 15]Here is a somewhat contrived view of a the current situationSomeone ndash a civic hacker a data journalist or a researcherin academia ndash willing to make sense of the data usually has

Bloomberg Data for Good Exchange Conference28-Sep-2015 New York City NY USA

to go through the following steps (a) find the datasets theyneed (b) realize that they are missing a few (c) downloadthe datasets (d) create a schema and load the datasets intoa database (e) join the dataset into one or more views thatmake sense and finally (f) run queries against the datasetThis is not only time consuming (usually half a day of work)but also extremely frustrating And the process has to berepeated whenever the datasets get updated

In this paper we present a novel approach to packagingopendata datasets called Docker for Data ndash inspired by theeponymous [3] cloud container solution ndash where we packagedatasets into coherent and self-contained units that can bedeployed with a few clicks and used within minutes Therest of this paper is organized as follows We start with amotivating example of opendata in the context of real estatespeculation in New York City and the diffulties one faceswhen trying to make sense of the data We then describethe architecture of Docker for Data and show how it canbe applied for our example Before we conclude we presentsome related and future work in this space

MOTIVATIONIn the current state if you are interested in exploring oranswering questions using opendata you have a few options

First you can rely on existing opendata portals like Socrata[14] or CKAN [1] which are the go-to solutions for mostcities But such portals are really dataset-centric and offervery limited query capabilities one dataset at a time So ifyour exploration require combining multiple datasets youare out of luck

Another option is to leverage cloud service providers likeAmazon or Google Either the datasets have been alreadyuploaded or you need to upload them Then you can leveragetheir respective ldquobig-queryrdquo solutions This comes with somelimitations in terms of what you can do It also implies thatwhatever proprietary data you want to combine or ldquosecretqueryrdquo you want to run will be uploaded to someone elsersquoscloud

The last option is the DIY solution as described in theintroduction which is both time consuming and frustrating

The inspiration for this work was an interest in the housingmarket in New York City Affordable housing has received alot of attention in cities like New York City [4] San Francisco

[16] etc as the price of both buying a building and renting anapartment have risen drastically in just a few years Peoplein the community advocates and politicians all want tounderstand why the city is becoming so expensive whattrends can be discovered and what they can learn from suchtrends to try to keep their communities in place

Remarkably New York City makes publicly available itsentire register of deeds and mortgages called ACRIS [10] asopen data in a machine-readable format Despite this thedata has seen little use in the community As a simple dataexploration starting point putting points on a map showingwhere property values were increasing rapidly or where acertain bank was lending would require

1 Identifying which tables you would need from the opendata portal to get the transactions (one table) thenames of buyers sellers and banks (another table) andthe lot identifiers for the properties sold or mortgages(another table)

2 Downloading these tables which is 5GB of CSVs fromthe data portal This could take several hours if thedownloads donrsquot fail You may have to start over fromscratch if one does

3 Find the geographic tables which are linked to theopen data portal but live elsewhere as shapefiles Youwould need some background knowledge of how to joinlot identifiers to geographic info in order to know thistable was the one you need

4 Download the shapefile and convert it into a databaseformat for the join Yoursquoll also need to do this fivetimes as each borough is in a separate table

5 Load the CSVs into a database for the join Theyrsquore toobig to work with in a spreadsheet application Therersquosno officially documented schema so yoursquoll need to figureit out yourself or search around on the web for one Theload itself will take at least twenty minutes on standardhardware unless you have installed more complicatedsoftware like pgloader

6 Add the appropriate indices so the join is not very slowWrite a view or derivative table that combines the fourso you can query what you want to look for faster

The data is open but the process of using it is still so difficultas to discourage use Even after a civic-minded coder goesthrough the above process itrsquos unclear how they can sharetheir transformations with others If you are simply a dataenthusiast (with limited hacking skills) the barrier of entryis simply too high

ARCHITECTURE amp IMPLEMENTATIONDocker [3] ldquoallows you to package an application with all ofits dependencies into a standardized unit for software develop-mentrdquo Borrowing from the Docker play-book (and DevOpsmore generally) we package datasets into self-contained databundles that be deployed quickly and easily

Letrsquos start with a little terminology We have some data

sources usually city or agency data portals running softwarelike Socrata or CKAN Each data source contains one ormore datasets usually CSV files Docker for data definesrecipes that map datasets into data dumps

For Docker for data we are using Postgres as the databaseA recipe defines a relational schema for a given datasetand describes the translation from a CSV file or shapefileinto a Posgres data dump For each dataset there is acorresponding recipe Using existing metadata from thedataset the translation can be done automatically Otherrecipes can be defined manually to combine datasets togethereither as relational views or brand new tables

The GovLab maintains a build server whose processes canbe viewed at httpbuilddockerfordatacom which readsthe recipes and uploads the SQL output to S3 The buildserver is also packaged as a Docker container and could easilybe run by a third party

Our docker client (deployed by you the user) makes it easyto look for recipes download the corresponding data dumplocally and load the data into the database

The architecture of the Docker for Data is described in Figure1

Figure 1 Our architecture

REVISITING OUR HOUSING EXAMPLEIn this section we revisit our motivating exampleand show how we are doing it in the context ofDocker for Data The source code is available fromhttpgithubcomtalosdocker4data

The ACRIS deed and mortgage data is published by the cityof New York on its open data portal The build server is

able to automatically generate recipes from the portal Thisis one of the three automatically generated recipes

source = datasocratadatacityofnewyorkusacris_real_property_masterdata

data httpwwwopendatacachecomdatacityofnewyorkusapiviewsbnx9-e6tjrowscsvdescription Document Details for Real Property Related Documents Recorded in ACRISmaintainer

id httpsgithubcomtalosdocker4datametadata

attribution Department of Finance (DOF)category City Governmentdescription Document Details for Real Property Related Documents Recorded in ACRISsocrata

id httpwwwopendatacachecomdatacityofnewyorkusapiviewsbnx9-e6tjjson

name ACRIS - Real Property Masterstatus needs reviewtable acris_real_property_master

-- automatically generated

The other two recipes are very similar and can be found hereand here

ACRIS does not include geographical data so it has to becombined with the cityrsquos tax lot map This is not hosted onthe open data portal but this user-contributed recipe addsit to Docker for Data

source = datacontribusnynycplutodatajson

data httpwwwnycgovhtmldcpdownloadbytesnyc_pluto_14v1ziptable pluto

source = datacontribusnynycplutobeforeshbinbash

unzip data

ls csv | tail -n 1 | xargs head -n 1 | sed -E s + g gt dataconcatenated

ls csv | xargs tail -q -n +2 | sed s[^[print]]g | sed -E s + g gtgt dataconcatenated

mv dataconcatenated data

With all the data sources available on Docker for Data a finalbundle can be defined that combines them into a usable tableshowing deed transfers in NYC from 1966 to the present

source = datacontribusnynycdeedsdatajson

requirements socratadatacityofnewyorkusacris_real_property_master latestsocratadatacityofnewyorkusacris_real_property_legals latestsocratadatacityofnewyorkusacris_real_property_parties latestsocratadatacityofnewyorkusacris_document_control_codes latestcontribusnynycpluto latest

description All real property sales with location for New York City from 1966 to the present derived from ACRISmaintainer

category City Governmenttable deeds

and

source = datacontribusnynycdeedsaftersql

CREATE TABLE deeds_masterASSELECT DISTINCT (

CASE substr(document_id 0 3)WHEN 20 THEN document_idWHEN FT THEN 100 || substr(document_id 4)WHEN BK THEN 000 || substr(document_id 4)ELSE document_ID END)BIGINT as document_id

mgood_through_date mdocument_datemdocument_amt mrecorded_datetime mmodified_datedccdoc__type_description dccdoc__type as doc_type

FROM socratadatacityofnewyorkusacris_real_property_master msocratadatacityofnewyorkusacris_document_control_codes dcc

WHERE dccclass_code_description = DEEDS AND OTHER CONVEYANCES ANDdccdoc__type = mdoc_type

DELETE FROM deeds_master USING deeds_master aliasWHERE deeds_masterdocument_id = aliasdocument_id AND

deeds_mastergood_through_date lt aliasgood_through_date

CREATE UNIQUE INDEX deeds_master_docid ON deeds_master (document_id)

CREATE TABLE deeds_partiesASSELECT DISTINCT mdocument_id pgood_through_date

CASE pparty_type WHEN 1 THEN dccparty1_typeWHEN 2 THEN dccparty2_typeWHEN 3 THEN dccparty3_type

ELSE pparty_type END AS party_typepname paddr1 paddr2 pcountry pcity pstate pzip

FROM socratadatacityofnewyorkusacris_real_property_parties pdeeds_master msocratadatacityofnewyorkusacris_document_control_codes dcc

WHERE (CASE substr(pdocument_id 0 3)WHEN 20 THEN pdocument_idWHEN FT THEN 100 || substr(pdocument_id 4)WHEN BK THEN 000 || substr(pdocument_id 4)ELSE pdocument_id END)BIGINT = mdocument_id AND

mgood_through_date = pgood_through_date ANDSUBSTR(pdocument_id 4) ~ ^[0-9]+$ ANDdccdoc__type = mdoc_type

CREATE INDEX deeds_parties_docid ON deeds_parties (document_id)

CREATE TABLE deeds_legalsASSELECT DISTINCT mdocument_id mgood_through_date

(borough 1000000000) + (block 10000) + lot as bblleasement lpartial_lot lair_rights lsubterranean_rightslproperty_type laddr_unit

FROM socratadatacityofnewyorkusacris_real_property_legals ldeeds_master m WHERE (CASE substr(ldocument_id 0 3)

WHEN 20 THEN ldocument_idWHEN FT THEN 100 || substr(ldocument_id 4)WHEN BK THEN 000 || substr(ldocument_id 4)ELSE ldocument_id END)BIGINT = mdocument_id AND

mgood_through_date = lgood_through_dateCREATE INDEX deeds_legals_docid ON deeds_legals (document_id)

CREATE TABLE deeds ASSELECT m leasement lpartial_lot lair_rights lsubterranean_rights

lproperty_type laddr_unit party_typepname paddr1 paddr2 pcountry pcity pstate pzip plbbl plcdplct2010 plcb2010 plcouncil plzipcode pladdress plunitsresplunitstotal plyearbuilt plcondono plgeom

FROM deeds_legals l deeds_master m deeds_parties p contribusnynycpluto plWHERE ldocument_id = mdocument_id

AND mdocument_id = pdocument_idAND lbbl = plbbl

The command to install Docker for Data is

curl -s httpgitiovYsiV | bashsource ~bash_profile

The command to download the dataset and load it into thelocal container is

d4d install nycdeeds

Installing the deeds table will take about five minutes on ahigh-speed connection with the resulting table having about10 million rows

To run a query simply type the following command and youare inside a Postgres environment with all the tables loadedfor you and ready to be queried

d4d psql

For example for a list of the top 100 addresses most oftenused when buying or selling properties one would only need

this query

SELECTCOUNT(DISTINCT document_id) num_transactionsCOUNT(DISTINCT geom) num_propertiesCOUNT(DISTINCT name) num_namesMIN(document_date) first_purchaseMAX(document_date) last_purchaseaddr1 address

FROM deedsGROUP BY addr1ORDER BY COUNT(DISTINCT geom) DESCLIMIT 100

Docker for Data is already being used by data activistsin New York The Real Estate Investment Cooperative(REIC) has used the real estate data from Docker for Datato visualize ldquoflipsrdquo in the city or properties that have soldat 50+ markups in less than two years

Figure 2 Map visualization using Docker for Dataat REIC

The interactive map can be browsed at httpbitly1TiDTxB1

In the hands of REIC open data from Docker for Data canbe used to argue that an increase of 1 in the real propertytransfer tax could provide a sustainable stream of millions ofdollars for affordable and cooperative housing in areas mostaffected by speculative real estate investment

RELATED AND FUTURE WORKRelated workSocrata [14] and CKAN [1] are the two main opendata portalsoftware As mentioned before they are dataset centric andoffer very limited query capabilities They focus more on datapublishing Downloading from a data portal is usually slow ascities and agency donrsquot like to invest too much on bandwidthHosted datasets eg Amazon Google provide prepackagedsolutions that do not always offer enough flexibilities Opencivic data is very location-centric and such solutions are1httpsdocker4datacartodbcomviz34453774-28da-11e5-8e42-0e0c41326911public_map

often weak in terms of GIS features In both cases the enduser must rely on somebody else hosted solution Uploadingproprietary data or logic is problematic

Docker for Data is not the first attempt at packaging datain an end-user friendly way The city of Philadelphia wasexperimenting with Sqlite [19] bundles on its opendata portal[7] The PC-AXIS file format [18] is an attempt at makingdatasets optimized for OLAP applications with rolling-upand drilling-down queries

Future workWith minimal work we could increase the number of auto-matically collected datasets We only automatically generaterecipes for datasets posted on Socrata data portals It shouldbe possible to include CKAN portals into the automaticallygenerated mix

While itrsquos possible to add your own recipes to Docker forData the tooling could use improvement Developing aparallel tool to the client ldquod4drdquo tool called ldquob4drdquo to make iteasy to write and contribute new recipes would be essentialto allowing people to contribute their work The toolchaincould take advantage of interactive tools like iPython to savea workflow as a recipe

The client is still very simple and not clever enough toeliminate unneeded artifacts requirements and temporarytables This means that data can end up being duplicatedon S3

Since Docker for Data is packaged as a container it wouldbe possible to add additional modules also packaged ascontainers that supply visualization output or databaseadministration outside of the command line Recipes couldcontain pre-packaged templates and queries that can beactivated with the addition of the necessary module

Datasets are not currently versioned and when updateshappen the assumption must be to throw away the old dataand replace it with the new There could be efficiencies withprojects like dat or within S3 itself to version and streamonly changed sections of data

Search and discovery are currently limited A full-text searchwith ranking by relevance could be implemented on top ofexisting metadata

Using the large collection of pre-collected data and extensiveschema available it should be possible to provide suggestionsfor possible joins between disparate datasets

CONCLUSIONAs we advocated in a related paper [12] opendata is reallydata ldquoof the people by the people for the peoplerdquo Thecurrent effort to make lots of data open is great but thisis just tackling the first step publishing Make this dataeasy to download and use is really the next step This isthe problem we are trying to solve with Docker for Data bytaking some inspiration from the Docker container technologyand applying it to open datasets

This work is still at an early stage but it has been received by

great interestes at hackathons and the Real Estate InvestmentCooperative (REIC) is using it to visualize anomalies in pricemarkups

With the increasing appetite for citizen science [5 11 13]we think that Docker for Data (and any similar efforts) couldbe a worthy tool making access to open data only a fewclicks away

We welcome your feedback at httpwwwdockerfordatacomThe project is open source Contributions and forks arewelcome at httpgithubcomtalosdocker4data

References[1] Dietrich D and Pollock R 2009 CKAN Apt-get for thedebian of data 26th chaos communication congress berlingermany 27ndash30 december 2009 (2009) 36

[2] Dietrich D et al 2009 Open data handbookhttpopendatahandbook org

[3] Docker Docker Build ship run an open platform for dis-tributed applications for developers and sysadmins Dockerhttpswwwdockercom

[4] Furman Center for Real Estate and Urban Policy Af-fordable housing httpfurmancenterorgresearchareaaffordable-housing

[5] Haklay M 2012 Francois greyrsquos 7 myths of citizen scienceMuki haklayrsquos personal blog httpspoveshamwordpresscom20120613francois-greys-7-myths-of-citizen-science

[6] Headd M 2015 I hate open data portals Civic in-novations The future is open httpcivicio20150401i-hate-open-data-portals

[7] Headd M 2013 Sqlite DBs as part of data re-leases Twitter httpstwittercommheaddstatus408395756744491008

[8] Howard A 2014 More than economics The socialimpact of open data Tech Republic (31~jul 2014)

[9] Manyika J 2013 Open data Unlocking innovation andperformance with liquid information McKinsey

[10] New York City Department of Finance Automated cityregister information system (ACRIS) httpa836-acrisnycgovCP

[11] Noveck BS 2015 Smart citizens smarter state Thetechnologies of expertise and the future of governing HarvardUniversity Press

[12] Sahuguet A et al 2014 Open civic data Of the peopleby the people for the people Data Engineering Bulletin(Dec 2014)

[13] Silvertown J 2009 A new dawn for citizen scienceTrends Ecol Evol 24 9 (Sep 2009) 467ndash471

[14] Socrata Socrata open data solutions for data trans-

parency Socrata httpwwwsocratacom

[15] Wellington B Why open data is still too closed - myTEDxNewYork talk httpiquantnytumblrcompost108236949969why-open-data-is-still-too-closed-my-tedxnewyork

[16] Wiener S 2015 More affordable housing mdash not ahousing moratorium mdash is what we need in san francisco mdashmedium Medium httpsmediumcomScott_Wienermore-affordable-housing-not-a-housing-moratorium-is-what-we-need-in-san-francisco-15df3ce5b7cd

[17] Zuiderwijk A et al 2014 Special issue on innovationthrough open data Guest editorsrsquo introduction Journal oftheoretical and applied electronic commerce research 9 2(2014) indashxiii

[18] 2013 PC-Axis file format Statistiska centralbyraringnhttpwwwscbsepc-axis_file-format

[19] SQLite httpswwwsqliteorg

Introduction
Motivation
Architecture amp Implementation
Revisiting our housing example
Related and Future Work
- Related work
- Future work
- - Conclusion
  - References

Page 2: “Docker” for (open)data · “Docker” for (open)data John Krauss The Governance Lab at NYU Brooklyn, New York, USA john@thegovlab.org Arnaud Sahuguet The Governance Lab at NYU