Scoda openrefine-directordata

transcript

A recipe for grabbing director informa-on from OpenCorporates using OpenRefine given an OpenCorporates company ID or OpenCorporates company page URL

For more informa<on, contact: schoolOfData.org

Here’s the start of thing we’re star<ng with – a list of companies…

Here’s the sort of thing we want – lists of directors associated with each company (where that informa<on is available).

The first step is to create a web address/URL to call the OpenCorporates API and ask it for data about a par<cular company. OpenRefine can create a new column populated with the contents of calls made to a URL contained in, or generated from, another column.

The URLs should take the form:

h"p://api.opencorporates.com/companies/JURISDICTION/COMPANY_ID

If you already have company page URLs in a column, add column based on that column using: value.replace(‘h"p://’,’h"p://api”)

If you have JURISDICTION/COMPANY_ID in a column, use the formula: “h"p://api.opencorporates.com/companies/”+value

The data comes back as JSON data, which we will need to process.

Each JSON result contains the data for a single company. The data rela<ng to the directors can be found as a list down the path value.parseJson()['results']['company']['officers’]

Let’s parse the JSON data an put the directors informa<on into another column…

What we are aiming for is a contrivance based on the form:

32866743::SIMON ALAN CONSTANT-‐GLEMAS::director::2010-‐04-‐07::null 32866744::KARIN JACQUELINE HAWKINS::director::2006-‐01-‐17::2012-‐02-‐22 32866745::ANDREW WILLIAM LONGDEN::director::2003-‐11-‐03::null …

where we list director ID, name, posi<on, appointment date, termina<on date.

This func<on will parse the data into string with the form:

32866743::SIMON ALAN CONSTANT-‐GLEMAS::director::2010-‐04-‐07::null||32866744::KARIN JACQUELINE HAWKINS::director::2006-‐01-‐17::2012-‐02-‐22||32866745::ANDREW WILLIAM LONGDEN::director::2003-‐11-‐03::null||…

The func<on reads as follows: “for each officer, join their ID, name, posi<on, start date and end data with ::, then join each of these director descrip<ons using ||”.

The use of two different – and hopefully unique – delimiters means we can split the data on each delimiter type separately.

The parsed data is put into a new column in this combined list form.

We can then split the data so that we create a new row for each director using the delimiter we defined: ||

Note that values from the other columns will not be copied into any newly created rows – we will have to do that ourselves either now, or later.

For each director, we now want to split their details out across several columns, one for each data field (ID, name, posi<on, appointment date, termina<on date).

We can do this by splijng on the other separator type we used: ::

The newly created columns are labeled with automa<cally generated names. It would probably make sense to rename them to something slightly more convenient.

Finally, we can do a likle more <dying. For any columns we want to export, such as company name, or company ID, we can Fill down using the corresponding values from the original row the directors’ informa<on was pulled from.

If you want to know more, contact us…

Scoda openrefine-directordata

Technology