Local News EngineLess time, more scrutiny
Problem - more local scrutiny data than we can keep up with
1 - solo reporters/journalists/bloggers
2 - papers with (sadly) diminishing editorial staff
3 - civil society
4 - councils themselves
The trends aren’t going to reverse
Massive increase in local accountability (devolution)
Increases in local data
People won’t scrutinise this stuff themselves
Armchair auditors didn’t work
Resources in media declining
Bad for communities, democracy
How can we do more scrutiny, with fewer resources
Indexed by subject Indexed by name
Local News Engine
Prototype funded by Google Digital News Initiative – e50,000
AT LAST I CAN BUY IN PROPER CODERS
Open Data Services Co-operative – world class
Pile up the newsworthy scrutiny dataMy patch covers parts of two central London boroughs - Camden (very large) and Islington (very small)
Data about building or altering houses, opening or changing pubs, bars and clubs, sex shops, gambling establishments, people due to be in court.
Camden planning applications - data store download
Camden commercial licensing - scraped
Islington planning applications - scraped
Islington commerical licening - scraped
Magistrates Court list (upcoming cases) - parsed from pdf to data
This is novel (we think)
Datastore v scrapers – no contest
on the time that the scrapers take to run, and the range of data that’s included in them. In order to speed up the scrapers and to ensure that the data was comparable, we spun up some VMs on Google Compute Engine to run the scrapers.
Camden License: 38.4h runtime, data back to 2005
Camden Planning: 2 min runtime, data back to 2010
Islington License: 39.5h runtime, data back to 2006
Islington Planning: 16.2h runtime, data back to 2006
Sort out the newsworthy people
By names - a newsworthy person appearing in a newsworthy data set could be newsworthy. (very) literally everyone who has been in the newspaper is newsworthy
Performed entity extraction on Camden New Journal and Islington Tribune, producing all the names of people and companies who had been in it.
Run geospatial search for all data with addresses in target area
Run list of 1,000-odd names from entity extraction as a search
My Kings Cross patch covers bits of two London Boroughs
Sort newsworthy places
By place - simple things happening in some places are news in their own right - eg a planning application or someone in court.
Users have wide definition of what is an interesting place - for some the whoel borough, others a particular ward/street
All the data has reasonable address information
Define area of interest by wards (for now this can be more precise to SOAs)
Perform geospatial search
Data Issues - DPA exemptionsData is published by arms of government for public scrutiny. Special purposes exemption in DPA covers processing:
‘This exemption protects freedom of expression in journalism, art and literature (which are known as the ‘special purposes’).
The scope of the exemption is very broad and it can exempt from most provisions of the DPA, including subject access – but never principle 7 or the section 55 offence (unlawful obtaining etc of personal data).
However it does not give an automatic blanket exemption. In order for the exemption to apply:
the data must be processed only for journalism, art or literature,
it must be being processed with a view to publication,
you must have a reasonable belief that the publication is in the public interest, and
you must have a reasonable belief that compliance with the DPA is incompatible with journalism, art or literature.
Data issues - access and licensing
Council data mainly had to be scraped - only one dataset in a modern data store.
Data therefore not licenced properly, asked council, they relaxed
Courts pdfs can be accessed by a journalist with reasonable reason. But each court varies. Courts info very sensitive - contains juveniles, cases with reporting restrictions etc. Must be handled with great care - contempt and no fault defamation.
British principle of open justice behind access, but poorly implemented.
Issues and questions
Ethics (for citizens) – extension of journalism ethics as scrutiny becomes
Despite open data accessibility of local data is rubbish
Still requires good coding skills – ODS world class – code on Talk About Local Github
Court lists – Japanese puffer fish of data
Sorting Criteria (emerging)Names
Broadly based on proper noun (‘named entity’) extraction from CNJ and Islington Tribune and people who crop up more than once.
‘A name will appear in the search results if ANY of its related entries match the search criteria. So, in the case of the SMITH record, Mrs Cherry Smith had a planning application in Caledonian ward in 2006 to cut down a tree, hence the match.
We’ve got a couple of ideas for solutions. One is to show on the result when the date of the most recent match is, another is to expose date UI. They are of differing complexities, though.
Areas
‘We try to get one or more locations associated with a result (eg address of defendant, location of crime, location of planning application) and if one or more of those locations is either in the postcode prefix list "N1", "N7", "WC1", "NW1" or the words “islington” or “camden” appear in a field that we think might contain a description of a location, it matches.’