Mapping Corporate Networks With OpenCorporates

Post on 27-Jan-2015

105 views 0 download

Tags:

description

 

transcript

OpenCorporatesCo-Director Mapping

Mapping

Corporate

Sprawls

Tony HirstDept of Communications and Systems,

The Open University

As company filings start to appear as open data, opportunities may arise for watchdogs to start mining this data in support of their investigations and monitoring activities.

This presentation introduces several ideas relating to mapping network structures in order to learn something about the structure of “corporate sprawls”, corporate groupings defined on the basis of co-director relationships.

Social Media

MappingIntr

oduc

ing

“Gra

phs”

To introduce the idea of a network map, let’s have a look at a view we can construct over the Twitter social space…

Emer

gent

Soc

ial P

ositi

onin

g

This network maps shows Twitter users who are commonly followed by the followers of @TOGYnews

Although hard to see at this scale, the map is actually constructed from labeled points connected by lines (in the jargon, “nodes connected by edges”).

The algorithm used to position the labeled nodes tries to place nodes that are heavily connected to each other close to each other. In a sense, we can view the diagram as a map, with regions that are highlighted using false colours identifying clusters of nodes that may in some sense be similar to each other based on the sharing of common followers.

A

B

Is followed by

Follows

Is followed byfocus

Find

the

follo

wer

s

The map is constructed using data grabbed from the Twitter API.

Using one or more “focus” users (a specific Twitter account, for example, or the set of users of a particular hashtag), we grab a list of their followers.

A

B

Is followed byFollo

ws

Followspeer

peerFollows

Is followed byfocus

Find

Frie

nds

of F

ollo

wer

s

Follows

For each of the followers, we grab a list of their friends (or a sample thereof) – that is, a lists of some or all of the people they follow on Twitter.

We can use this data to construct a network of people followed by the followers of the original focus.

It is typically at this point, where there is most relational information contained within the network, that we lay it out using automatic layout tools.

A

B

Is followed byFollo

ws

peerFollows

Is followed byfocus

Find

Com

mon

Frie

nds

of F

ollo

wer

s

Follows

Drawing on the insight that people on Twitter are likely to follow accounts that are of interest to them, we can start to imagine the network as a projection of the interests of the people who are interested in one or more of the things the focus is associated with.

However, interests of followers may spread to a wide range of topics, so we look for consistency of interest, pruning the network to remove people who are not commonly followed by the followers of the focus. That is, we remove nodes who are followed by only a few of the followers of the focus.

peerFollows

focus

Filte

r out

not

com

mon

ly fo

llow

ed

Having laid out the network map, we might now tidy it up a little by removing all the nodes that are not themselves followed by a significant number of the followers of the original focus,

Emer

gent

Soc

ial P

ositi

onin

g

The result is a map that shows groups of people positioned according to the shared projected presumed interests of their followers.

A M

ore

Prin

cipl

ed A

ppro

ach

It may also be possible to use metadata associated with social networks to develop additional insights.

A recent paper describes one way of mining social network data for information about people working for a particular company, and using public biographical information along with social connection data to map out the organisational structures of large companies.

Corporate Structure

MapsIntr

oduc

ing

“Gra

phs”

A more principled way of looking at corporate structures at a company level may possibly be derived from publicly available corporate information.

C3

C1C2

D1

D3D2

Com

pani

es &

Dire

ctor

s

For example, if we can get hold of directorial appointment and termination data, we can start to construct maps that who how companies are connected by common directors, as well as which companies are co-directed by particular directors.

As with the emergent social positioning network maps, if particular directors have particular corporate interests, we may be able to identify particular organisational groupings in corporate sprawls made up from dozens of operating companies working across a range of business areas.

Com

pany

Rec

ords

on

Ope

nCor

pora

tes

One possible source of open company information is OpenCorporates.

OpenCorporates’ ambitious aim is to mint a unique corporate identifier for every corporate legal entity in the world [CHECK], as well as collating, and normalising (or “harmonising”) company information about company filings, trademarks, patents(?) and officers (that is company directors, company secretaries and so on).

For GB registered companies, there is a growing repository of data relating to company directorships, which provides us with an opportunity to develop maps that show how companies are connected by virtue of having common directors.

Subs

idia

ry C

ompa

nies

hav

e “w

orki

ng”

dire

ctor

s

Just a note – my experience in looking at data related to GB registered companies suggests that the directors of the “top”/nominal company in a large multinational grouping are “atypical” compared to the officers appointed to UK based operating companies in the same corporate sprawl, being appointed from the great and the good, or from senior officers who do not take directorships across operating divisions or companies, rather than representing directors of operating companies.

When seeding corporate sprawl trawlers – algorithms that try to identify companies that make up a corporate sprawl based on co-directorships – my experience suggests that it often makes sense to see the search with one or more operating companies who have directors that are likely to be directors of other operating companies, rather than the “top level” company.

Co-DirectorMappingMor

e G

raph

s

We can reuse the ideas that underpin the construction of the emergent social positioning graph to map out corporate structures based on director information.

Dire

ctor

Rec

ords

on

Ope

nCor

pora

tes

As well as corporate information pages, OpenCorporates maintains information pages about directorial appointments. At the moment, there are no authority files providing identifiers that identify the same physical person – each directorial appointment to company provides the director with a unique officer ID. It is possible to search for officers of other companies with the same name as a particular director, but no identifiers that link them as the same physical person. (That said, there does appear to be a slot in the metadata for authoritative identifiers.)

Star

t With

One

or M

ore

Seed

Com

pany

So how might we go about constructing a corporate sprawl?

Let’s start with one or more seed company.

C1

D1Follows

Has directorD2

Find

Frie

nds

of F

ollo

wer

s

Has director

The general shape of this diagram might remind you of something…?

For each of the seed companies, we grab a list of their directors.

We can use this data to construct a network of people who are directors or other officers of the original seed company or companies.

Find

Dire

ctor

s of

See

d Co

mpa

ny(s

)

Here’s another way of imagining it – a company surrounded by its directors.

C1

C2

Is directed by

Follows

Has directorD2

Find

Frie

nds

of F

ollo

wer

s

Has director

Is dire

cted by

D1

For each of the directors, we run a search for them on OpenCorporates, to see what directorial appointments have been made to other companies for people of exactly the same name.

We can use this data to construct a network of companies directed by the directors of the original seed company.

For those companies that are directed by N or more of the directors associated with the seed company or companies (where N is typically 2) we might now say they are part of the corporate sprawl. The companies sharing fewer than N directors associated with companies admitted to the corporate sprawl are added to a list of possible candidate companies. As we find more directors associated with companies included in the sprawl, we might be able to “legitimise” membership of these companies within the sprawl.

Find

Com

pani

es W

ith T

wo

or M

ore

Seed

Dire

ctor

s

We now have a larger set of companies, reflecting those companies who share N or more directors with the original seed company or companies.

C1

C2

Has director

Has dire

ctor

Has directorD3

D1Follows

Has directorD2

Find

Frie

nds

of F

ollo

wer

s

Has director

If we so decide, we can continue with this snowball discovery process, looking up further directors associated with companies we have included in our sprawl, with a view to trying to discover more companies that should be included in the sprawl.

Using this snowball approach, I have constructed a scraper on Scraperwiki that mines OpenCorporates, given one or more seed companies (or seed directors) to map out corporate sprawls, limiting myself to the capture of current directors and active companies registered in the UK.

(The code needs checking and is perhaps not as easy to use as it might be. Developing a more robust and user friendly tool may be worth exploring if this approach is seen to be useful.)

C3

C1C2

D1

D3D2

Com

pani

es &

Dire

ctor

s

So – we can generate a network that connects companies with their directors, and grow this network out to identify companies that share several directors.

As with the emergent social positioning map, we can use automatic layout tools to try to position companies and directors close to each other based on their connectivity, producing a map over the corporate sprawl.

C3

C1C2

Com

pani

es

We can view this network in various ways. For example, we might choose to view just the companies.

Page

Rank

This map shows companies in a corporate sprawl grown out from Royal Dutch Shell.

Note the presence of BP in there – somehow, these two groupings are connected by shared directorships of some intermediate company.

C3

C1C2

D1

D3D2

Com

pani

es &

Dire

ctor

s

One of the nice things about representing this sort of structure in an abstract mathematical or computational way is that we can wrangle it with code...

So for example, companies C1 and C2 are connected by a single shared director, whereas C2 and C3 are connected by two directors.

C3

C1C2

Com

pani

es S

harin

g D

irect

ors

We can represent this by transforming the original bipartite (two types of node) graph that connects directors to companies and companies to directors by a graph that just connects companies who were connected by directors.

The thickness of the line (or “edge”) connecting the companies represents its “weight”, which in this case is given by the number of shared directors between connected companies.

C3

C2

Com

pani

es S

harin

g Tw

o or

Mor

e D

irect

ors

We can also filter the graph, for example by adding together the weights of all the edges incident on a node, and throwing away all nodes for whom this sum is below a specified threshold value.

We might alternatively prune the network by removing (“cutting”) all edges below a specified weight, and then throwing away nodes that aren’t connected to other nodes. (For example, we might remove connections between companies that only share a single director, and then throw away companies that aren’t connected to any other companies. Which is to say, we cut out companies that don’t share two or more directors with any other single company. When you start working with graphs, you begin to realise quite how beautiful, and powerful, a way they are for working data elements that are related to each other in some way.)

Page

Rank

Here’s an example of the Shell corporate sprawl with the directors removed and edges connecting companies that share two or more directors. The labels are sized relative to the PageRank score of each node, which a measure of how well connected the node is in the graph (the “importance” of each node is dependent on the “importance” of the nodes connected to it….)

The lines also provide a background that highlights the connectivity - and structure – of the corporate elements.

Betw

eenn

ess

In this view, I have resized the labels based on the betweenness centrality of each node. This network statistic highlights nodes that play an important role in connecting clusters or groupings of nodes. So for example, we see the suggestion that The Consolidated Petroleum Company and Shell Mex and BP Limited may be the companies that connect the Shell sprawl to the BP one.

Betw

eenn

ess

(rep

ositi

oned

)

This is just a tweaking of the layout of the previous graph to try to highlight the separation of the different clusters.

C3

C1C2

D1

D3D2

Com

pani

es &

Dire

ctor

s

Just as we collapsed the network to show how companies could be linked directly by virtue of co-directorships, so we can collapse the network to show how directors are connected.

For example, director D1 is connected by a single shared company to directors D2 and D3, whereas D2 and D3 are connected by two companies.

D1

D3D2

Co-D

irect

ors

Once again, we use line thickness (that is, edge weight) to denote how heavily connected directors are.

Page

Rank

Here’s a view over connected directors in the the Shell corporate sprawl.

OpenCorporates

Scraperwiki db

JSON

D3.js

Networkx

Gexf

Gephi sigma.js

As to how we get those graphs plotted? I built a crude workflow in Scraperwiki that gets data out of the scraped database and into a form that allows it to be visualised using the Gephi desktop tool or in a web page using different Javascript libraries (sigma.js or d3.js).

This is Gephi – a cross-platform desktop tool that’s great for generating effective network visualisations. I have some tutorials and sample datasets if anyone wants to give it a whirl…

“Where” Next…?

- geocode registered addresses- explore non-gb registered companies

So where can we take the OpenCorporates data next?

I have a couple of ideas:

- we can go spatial in a geographical sense and start to geocode the registered addresses of companies, to see whether any of them are located in offshore tax havens, for example, or to see whether there are different registered addresses that might lead us to yet more companies (by virtue of sharing common registered office addresses, rather than co-directors, for example);- we could start trying to tie non-gb registered companies into the mix. At the moment, director information for other territories is sparse – might them be some other way we can look for connections?

And “When”?- company timelines (set-up dates, renaming)- explore director timelines (by company)- explore director timelines (by directory)

Another approach might be to start analysing corporate sprawls in a time dimension. There are several opportunities here:

- If we have access to company formation and dissolution dates, we can map out a timeiline of a corporate sprawl, which might reveal how companies change name, directorship or association with other companies;- if we get all the director information associated with a company, we can visualise how director appointments and terminations occurred across one or more companies, which might in turn reveal identifiable “features” that we might be able to associate with news or business restructuing events;- if we track down companies a particular director appears to be associated with, we can start to develop “career timelines” of directors, showing how they have been associated with different corporate groupings over time (and maybe the odd company on the side…)

Linking out and in

- linking companies or directors with external datasets

Whilst it is possible to generate insight from the analysis of data that is contained just within OpenCorporates, there are likely to be many opportunities for using OpenCroporates to annotate other datasets, or use external datasets to annotate OpenCorporates data

Sank

ey F

low

Dia

gram

s

As this example starts to explore, we might try to reconcile company names as recorded in local spending data records with corporate entities identified within in OpenCorporates to build up a better picture of how money flows into corporate sprawls.

On a lobbying front, we might look for mentions of meetings between government officials and and company officers, and then try to make mappings between government departments and operational areas of a corporate sprawl, and so on.

What do you think?

[ This is part of an ongoing informal exploration of the patterns and structures we can find across large open datasets.

For more information, follow:

- blog.ouseful.info- @psychemedia

All comments welcome. ]