Date post: | 26-Jan-2015 |
Category: |
Education |
Upload: | tony-hirst |
View: | 106 times |
Download: | 1 times |
Taking this opportunity to explore some of the issues associated with whatever this thing called “data journalism” is…
1
I’m not a journalist, and don’t have any form of journalism training. But I do have an interest in ICT, and from that have an interest in “communicaDon”. Let’s start with an easy(?!) quesDon -‐ what is journalism? One way of answering that quesDon is to list some of the funcDons, or aMributed, associated with it – informing, educaDng, holding to account, watchdog funcDon, campaigning, contextualising for a par'cular audience.
2
Sensemaking seems to me to be an important part of it… In part contextualisaDon, in part idenDfying the bits that make the difference, the bits that make it important, the bits that make it news that people need to know… …and oRen with a parDcular audience in mind.
3
CriDcal judgement.
4
Second quesDon: what is data? NaDonal staDsDcs, sports results, polls, financial figures, health data, school league tables, etc etc. Is a book data? Or a speech? What if I split a speech up into separate words, count the occurrence of each unique word and then display the result as a “tag cloud”, or word frequency diagram.
5
One way of thinking about data is that it is a parDcular sort of source, or a source that can respond to a parDcular style of quesDoning in a parDcular way. Another take on this is that many “data sources” are experts on a parDcular topic, experts that know a lot of a very parDcular class of facts.
6
One way of thinking about data is that it is a parDcular sort of source, or a source that can respond to a parDcular style of quesDoning in a parDcular way. Another take on this is that many “data sources” are experts on a parDcular topic, experts that know a lot of a very parDcular class of facts.
7
So what is data journalism? If I was to ask you, the members of a school of journalism, “is this or that news arDcle ‘journalism’” I imagine one response might, “well…. It’s the output of a journalisDc process.” But if I point at a map with some markers on it and ask: “is this map “data journalism”, you might answer: yes. Or at least, that’s what many of the early job ads for data journalists implied.
8
Sports journalism has sport as the topical contextual frame for some journalisDc acDvity, PoliDcal journalism has poliDcs as the topical contextual frame for some journalisDc acDvity, InvesDgaDve journalism has a parDcular process as the contextual frame for some journalisDc acDvity, a process that may be applied to parDcular topic areas. So for data journalism does “data” relate to the topic or the process? Where we focus on data outputs, then the implicaDon is that the “topic” of data is the focus of the framing. But I think we need to reframe to consider the procedural role.
9
So as a starDng point, let’s frame the idea that data journalism is a process related epithet that implies one of the key sources in a journalisDc acDvity is “data”.
10
11
By focusing on this noDon of data journalism as relaDng to process, we can then start to explore with a liMle bit more criDcality what the pracDce of data journalism might involve that idenDfies it as such. That is, how is pracDce influenced by the fact that it must engage with “data as a source”?
12
The inverted pyramid gives us one way of considering the data journalisDc process, or at least idenDfying some of the steps involved in a data invesDgaDon. But there are many other ways of conceptualising the process – for example, finding stories and telling stories…
13
When it comes to finding stories, do we: a) want to find stories in a dataset we are provided with, or b) use data to help draw out a story lead we have already been Dpped off to?
14
Anscombe’s Quartet is a toy dataset that first appeared in a 1973 paper by staDsDcian Francis Anscombe. His paper – Graphs in StaDsDcal Analysis – was based around the claim that “graphs are essenDal to good staDsDcal analysis”.
15
But this is where we start to hit some stumbling blocks.
16
And a big stumbling block is one that is oRen denied in higher educaDon, which is the provision of skills, as compared to “higher level conceptual or academic understanding”. There is an old saw that we become beMer writers through reading more. But how much Dme do you invest in reading charts? Really reading them? I came across this beauDfully Dtled book a few weeks ago -‐ “Making Sense of Squiggly Lines”. The blurb on the back summarises the situaDon well: “Data points are just words, but when connected with a squiggly line they tell a story”.
17
18
In an ideal world, the process would be simple: have data, get story.
19
But it’s not that simple. It’s more likely that we need to engage with the dataset to try to tease the stories out of it, or facts and relaDonships from it that we can used to support the claims we make in a narraDon of some sort of story that is at least supported by the data, or contextualises it in a narraDve way that is hopefully “truthy”.
20
One of the ways I like to work with data is to have a conversaDon with it – asking quesDons of it and then further quesDons based on the responses I get.
21
SomeDmes it looks at first as if we have data in a form where we might be able to do something with it – then we realise it needs cleaning and reshaping. For example, in this case we have percentage signs contaminaDng numbers, data organised in separate secDons – but how do we get a “well behaved” view over data from all the wards – and different sorts of data: votes polled per candidate versus the size of the electorate in a parDcular ward for example. Walkthrough: hMp://blog.ouseful.info/2013/05/03/a-‐wrangling-‐example-‐with-‐openrefine-‐making-‐ready-‐data/
22
But this is where we start to hit some stumbling blocks.
23
And a big stumbling block is one that is oRen denied in higher educaDon, which is the provision of skills, as compared to “higher level conceptual or academic understanding”.
24
Tidying data – or cleaning data – or more colloquially, “wrangling data” – refers to the process we need to engage in to turn a dataset we have found into one that is useable. Many published datasets are horrible. Really horrible. They don’t work as we might want or expect them to in the applicaDons we tend to have to hand.
25
Take producing data visualisaDons, for example: have data, produce visualisaDon. No. That’s like saying: have two hours of rambling conversaDon with source, have 200 word story with strong quotes. No. Just: no. It doesn’t work like that. Yes, there are powerful charDng tools available BUT they require the data to be clean and Ddy and to be in the right shape for the tool. But it typically isn’t.
26
We have to wrangle it. Now wrangling is a technical job, and arguably a job for technicians – higher apprenDces of the journalisDc world – not graduate journalists. But I think out journalists are going to have to learn the equivalent of some machining in the mechanical world.
27
Just by the by, I didn’t draw those block diagrams, I wrote them.
28
I “wrote” these charts – you can see how at the top. That code – applied to a suitably shaped version of a dataset known as Anscombe’s Quartet. The data has been reshaped to 3 column format: a column for the x values, that are ploMed on the horizontal x-‐axes; a column for the y values, that form the verDcal y-‐axes; and a column for the groups, which specify which panel, or facet, each point should be ploMed in. The code defines the construcDon of those charts. Exactly. There is no magic. At least, no other magic.
29
One of the first datasets I played with was MPs’ expenses data. Here are a couple of ways I started to chat with it – imagine talking to someone whop knows about *all* the expenses claims put in by every MP over a parliamentary session… (The charts were created using an online interacDve tool developed by IBM called Many Eyes.) The bar chart Is ordered, for a parDcular expenses area, by total amount for each individual MP. The block histogram shows how many MPs made a total claim in parDcular expenses area of a parDcular binned value. (A ‘bin’ is a range.)
CriDcal judgement – it applies to data too...
31
One of the things to menDon about mapping data mapping and visualisaDon techniques is that they oRen tells us things we already (think we) know; in that sense, they are not news. But they may also tell us things we know in new, visually appealing ways. And by making use of such ‘confirmatory’ visualisaDons and displays we can build confidence within an audience that they know how to interpret these sorts of representaDon.
32
As the audience becomes comfortable reading the charts and making sense of data, when there is something new or surprising in the data, the surprise manifests itself in the reading of the data or chart. For journalists working with data, developing a sense of familiarity with how to interpret and read data when it is just confirming what you already know helps to refine your senses for sposng things that are odd, noteworthy, or newsworthy. Taking a liMle bit of Dme each day to: -‐ read charts as if they were stories; -‐ look behind the data to find original sources, such as polls or data containing news releases, and then compare the original release with the way it is reported, paying parDcular aMenDon to the points that are highlighted, and how the data is contextualised;
will help you develop some of the skills you will need if you want to be able to idenDfy, develop and treat some of the stories that your specialist source that is data can provide you with, of only you ask…
33
A scaMerplot is another very powerful sort of chart – we can plot two sorts of value against each other to see if there are any groups, or trends. Some scaMerplot tools allow you to size or colour nodes according to further dimensions. Colouring nodes by group (if sensible groups exist) can also help you see whether parDcular groups are clustered or group together in parDcular areas of the chart.
Maps can be used to pull out different sorts of relaDonships – for example, plosng markers in the centre of each MP’s ward coloured by the total value of travel expenses claim in a parDcular area, we can easily see whether or not an MP is claiming an amount significantly different to MPs in neighbouring wards. In this case – travel expenses – we might expect (at first glance at least) a homophiliDc effect – folk a similar distance away from Westminster should presumably make similar sorts of travel claim? At second glance, we might then start to refine our quesDoning – does ward size (in terms of geographical area) or rurality have an effect? Does an MP travel to and from home more than neighbours (or perhaps claim more in terms of accommodaDon in London?)
35
SomeDmes we need to provide quite a lot of explanaDon when it comes to making sense of even a simple data visualisaDon – “what am I supposed to be looking at?”
36
The other way of using data is to tell stories. But what does that even mean…?
37
The other way of using data is to tell stories. But what does that even mean…?
38
In passing, it’s worth menDoning that one thing staDsDcs does is help provide context. Is this number a big number in the greater scheme of things? Is this thing likely to happen by chance or is there a meaningful causal relaDonship between this thing and another thing? The chart in the corner is a reminder about how surprising probabiliDes can be. The chart shows the probability (y-‐axis) that two people share a birthday (the number of people is given on the x-‐axis). The chart shows that if there are 23 or more people in a room, there is more than a 50/50 chance that two of them will share a birthday (that is, share the same birth day and month, though not necessarily same birth year). How many people are in the room? If it’s more than 23 – I bet that at least two people share a birthday (at least in terms of day and month).
39
40
One of the first datasets I played with was MPs’ expenses data. Here are a couple of ways I started to chat with it – imagine talking to someone whop knows about *all* the expenses claims put in by every MP over a parliamentary session… (The charts were created using an online interacDve tool developed by IBM called Many Eyes.) The bar chart Is ordered, for a parDcular expenses area, by total amount for each individual MP. The block histogram shows how many MPs made a total claim in parDcular expenses area of a parDcular binned value. (A ‘bin’ is a range.)
42
43
44
45
46
47
48
49
50
51
52
The other way of using data is to tell stories. But what does that even mean…?
53
54
55