Open Data: Analysis and Visualisation

Post on 12-Jan-2015

170 views 3 download

Tags:

description

This presentation gives an overview of the Open data. A number of case studies are given on the spatio-temporal analysis and visualization of the Social Media data (Twitter). The presentation also explains the creation of a heatmap visualisation by using R.

transcript

Muhammad Adnan

Department of Geography, University College London

Web: http://www.uncertaintyofidentity.com

Twitter: @gisandtech

Open Data: Analysis and Visualisation

Dr. Muhammad Adnan• Research Associate

– Working on an EPSRC funded project “Uncertainty of Identity”

– http://www.uncertaintyofidentity.com

• Data Mining• Social Media Analysis• Data Visualisation

Research Interests

Outline

• Open Data

• Crowd-Sourced Data (Social Media)

• Analysis and Visualisation Challenges

• Twitter Case Study• Spatial Analysis• Temporal Analysis

• R• A brief introduction• How to create heat maps

Open data

Data that is:

Open and Free to the public CompleteAccessibleTimely

Machine processableNon-discriminatory

Dataset examples

• National Budgets• Car registries• National roads• Water heights• Schools• Weather• Public transport• Council tax bands• And many more

Census Profiler• http://www.censusprofiler.org/• Users can visualise 2001 Census data

Education Profiler• http://www.educationprofiler.org/• Users can visualise education datasets

Open Data Profiler• http://www.opendataprofiler.com/• Users can visualise 60 different 2011 Census datasets

Crowd Sourced datasets

• Twitter• Public streaming API can be used to download live tweets

• Four Square• Has an API which can be used to access the Four Square data

• Facebook• Facebook applications can access user information

• Flickr• Wikipedia• Youtube

How big are crowd sourced datasets ?• Facebook

• Number of active users: 850 Million• Average daily uploaded photos: 360 Million• Total data size: 30+ Petabytes

• Twitter• Number of active users: 200 Million• Daily tweets (posts): 350 Million

• Foursquare• Number of active users: 15 Million• Total check-ins: 1.5 Billion

What are the issues with these datasets ?

• How representative social media data sets are of the Census or Electoral roll data ?

• Who: Ethnicity, Gender, and Age of social media users

• Where: Where social media conversations are happening and who is leading them• Intelligence about where people are located and what they are doing

• When: What time of day conversations happen

Twitter (www.twitter.com)

• Online social-networking and micro blogging service• Launched in 2006

• Users can send messages of 140 characters or less

• Approximately 200 million active users

• 350 million tweets daily

• In 2012, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets

Basic Analysis of the Twitter data

Data available through the Twitter API

• User Creation Date• Followers• Friends• User ID• Language• Location• Name• Screen Name• Time Zone

• Geo Enabled• Latitude• Longitude• Tweet date and time• Tweet text

Users can download 1% sample of the live tweets through the API

Created with approx. 100 million tweets

4 million geo-tagged tweets downloaded during August and December, 2012

4 million geo-tagged tweets downloaded during August and December, 2012

Hourly and Daily Twitter Activity in London

Hourly Twitter Activity in London

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Monday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Tuesday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Wednesday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Thursday

Hour

Daily Twitter Activity in London

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Friday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Saturday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Sunday

Hour

Daily Twitter Activity in London

Analysis of User Names on Twitter

• A name is a statement of the person’s ethnic, linguistic, and cultural identity.• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo

mateos is a Spanish (Hispanic) name.

Analysing Names on Twitter

• Some examples of NAME variations on Twitter

Real Names

Kevin Hodge

Andre Alves

Jose de Franco

Carolina Thomas, Dr.

Prof. Martha Del Val

Fabíola Sanchez Fernandes

Fake Names

JustinBieber_Home.

WHAT IS LOVE?

MysticMind

KIRILL_aka_KID

Vanessa

Petuna

Analysing Names on Twitter• Some examples of NAME variations on Twitter

Real Names

Kevin Hodge -> F: ‘Kevin’ ; S: ‘Hodge’

Andre Alves -> F: ‘Andre’ ; S: ‘Alves’

Jose De Franco -> F: ‘Jose’ ; S: ‘De Franco’

Carolina Thomas, Dr. -> F: ‘Carolina’ ; S: ‘Thomas’

Prof. Martha Del Val -> F: ‘Martha’ ; S: ‘Del Val’

Fabíola Sanchez Fernandes -> F: ‘Fabíola’ ; S: ‘Fernandes’

Where they tweet from:

Surname: JONES

Where they tweet from:

Surname: DEE

Where they tweet from:

Surname: SHAH

• A name is a statement of the person’s ethnic, linguistic, and cultural identity.• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo

mateos is a Spanish (Hispanic) name.

Predicting Ethnicity of Twitter Users by using their ‘Names’

Classifying Twitter Data to ethnic origins

• Applied ONOMAP (www.onomap.org) on FORENAME + SURNAME pairs

Kevin Hodge (ENGLISH)

Pablo Mateos (Spanish)

Top 10 Ethnic Groups of Twitter Users

English Italian

Pakistani Indian

TurkishGreek

Bangladeshi

Spanish

German French

Portuguese

Sikh

Tweeting Activity by different Ethnic Groups

• Onomap groups were aggregated to match the appropriate groups from the Census

London TotalWhite British

White other

Indian Pakistani BangladeshiBlack African

Chinese

Week Night

53611 71.35% 12.12% 2.63% 2.63% 1.82% 1.52% 1.74%

Week Day 80676 73.12% 11.80% 2.41% 2.41% 1.56% 1.25% 1.61%

Weekend 67351 72.86% 12.17% 2.61% 2.61% 1.67% 1.39% 1.73%

Comparison of Ethnic Groups between ‘2011 Census’ and ‘Twitter’

2011 Census 44.89% 12.65% 6.64% 2.74% 2.72% 7.02% 1.52%

Comparison of the distribution of ethnicity with the 2011 Census

2011 Census Twitter

White British (Quintiles)

Gender and Age Analysis of Twitter Users by using their ‘forenames’

Gender Analysis of Twitter Users

Male Female Unisex Not Found0%

10%

20%

30%

40%

50%

60%

Number of Tweets Number of Unique Users

Age estimation from ‘forenames’

0-4 5-9 10-14

15-19

20-24

25-29

30-34

35-39

40-44

45-49

50-54

55-59

60-64

65-69

70-74

75-79

80-84

85+0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

PAUL BETTY GUY MUHAMMAD

Age group

Per

cen

t

Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National Statistics)

Age-Sex structure of Twitter Users and 2011 Census

Male Female

Tweets by different Land-use Categories

Temporal Activity: Tweets from different Land-use Categories

Ethnic Segregation of Twitter Users

Segregation Analysis

• To find out the level of integration/segregation of different types of Twitter users

• During different hours of the week and weekends

• Information Theory Index

 

Segregation Analysis

• The value of the information theory index is between 0 (low segregation) and 1 (high segregation).

Ethnic Groups H (Domestic buildings and

gardens)

H (Week Nights) H (Week Days) H (Weekend)

British 0.483 0.401 0.211 0.315

Irish 0.670 0.571 0.357 0.475

White Other 0.630 0.510 0.303 0.420

Pakistani 0.765 0.679 0.488 0.633

Indian 0.748 0.673 0.451 0.590

Bangladeshi 0.864 0.834 0.671 0.784

Black Caribbean 0.831 0.808 0.548 0.666

Black African 0.764 0.704 0.492 0.640

Chinese 0.712 0.608 0.403 0.524

Other 0.710 0.593 0.374 0.497

Extending the analysis to other cities

Tweet density map of London

Tweet density map of Paris

Tweet density map of New York City

Top 10 ethnic groups in London

Top 10 ethnic groups in Paris

Top 10 ethnic groups in NYC

English Spanish

GermanJewish

Irish Italian

Portuguese

Tweeting Activity by different Ethnic Groups (NYC)

Scottish

Black Caribbean

Chinese

French

GermanTurkish

Spanish Italian

Portuguese

Tweeting Activity by different Ethnic Groups (Paris)

English

Polish

Gender Analysis

Exploring the Languages on Twitter

Data available through the Twitter API

• User Creation Date• Followers• Friends• User ID• Language• Location• Name• Screen Name• Time Zone

• Geo Enabled• Latitude• Longitude• Tweet date and time• Tweet text

Twitter Languages (World)

Twitter Languages (Europe)

Twitter Language Maps

Twitter Language Maps

Twitter Language Maps

Temporal Analysis of the data sets

Temporal Analysis of the Twitter Data

• Data: 12 September, 2012 – 25 September, 2013

• We extracted a total of approx. 800 million tweets over the last year

• A temporal activity analysis of different cities could potentially reveal a lot of information about the residents of the city

• But Twitter data is not clean and has lots of problems !

Problems with the data

1) Extracting the data for individual cities or places

• Use of bounding boxes to extract the data• New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589

• http://isithackday.com could be used to find the bounding boxes of different cities

Problems with the data

2) Twitter data has a GMT and BST timestamp. Conversion to other time stamp is very time consuming

• 12p.m. in ‘London’ is 5a.m in Los Angeles, if the time stamp is GMT.• 12p.m. in ‘London’ is 6a.m in Los Angeles, if the time stamp is BST.

Temporal Analysis of different cities

Jaka

rta

Ista

nbul

Paris

Sao P

aulo

New Y

ork C

ity

London

Los Angel

es

Rio d

e Ja

nerio

Mex

ico C

ity

Riyad

h

Tokyo

Chicag

o

Buenos

Aires

Mad

rid

Dalla

s

Philadel

phia

Man

ches

ter

Houston

Was

hingto

n

Toronto

Boston

Seoul (

Korea)

Dubai

San F

anci

sco

Osaka

(Jap

an)

Atlanta

Sydney

Mel

bourne

Glasg

ow

Dublin

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

Nu

mb

er o

f T

wee

ts (

Mill

ion

s)

• Approx. 170 million tweets were sent from the following 30 cities.

Temporal Analysis of different cities

LONDON

Temporal Analysis of different cities

LONDON

PARIS

Temporal Analysis of different cities

JAKARTA

Temporal Analysis of different cities

JAKARTARIYADH

Temporal Analysis of different cities

JAKARTA

ISTANBUL

Introduction to R

What is R?

• The R statistical programming language is a free open source package based on the S language developed by Bell Labs.

• The language is very powerful for writing programs.

• Many statistical functions are already built in.

• Very easy to create maps and different visualizations.

• You will have to write some code to get the things done !

• R is available @ www.r-project.org

• Supports both 32 and 64 bit Windows PCs, Linux, Unix, and Mac OS operating sytems

What is R?

Getting Started

• The R GUI?

Getting Started

80

Interacting with R

> 1 + 1[1] 2

> 1 + 1 * 7[1] 8

> (1 + 1) * 7[1] 14

> sqrt(16)[1] 4

> x <- 1> x[1] 1 > y <- 2> y[1] 2> z <- x+y> z[1] 3

Math: Variables:

Importing Data

• How do we get data into R?

• First make sure your data is in an easy to read format such as CSV (Comma Separated Values)

• Use code:– D <- read.csv(“path”,sep=“,”,header=T)– D <- read.table(“path”,sep=“,”,header=T)

Working with data.

• Accessing columns.• D has our data in it…. But you can’t see it directly.• To select a column use D$column.

Basic Graphics

• Histogram– hist(D$wg)

How to create a heat map in R ?

How to create a heat map in R ?

• Three steps:– Read a CSV file– Chose the colours for the heat map– Create the heat map

How to create a heat map in R ?

• Step 1: Read a CSV fileread.csv(“FILE NAME", sep=",", header=T)

How to create a heat map in R ?

• Step 1: Read a CSV fileread.csv(“FILE NAME", sep=",", header=T)

• Assign it to a variableInput <- read.csv(“FILE NAME", sep=",", header=T)

i.e. with ‘<‘ (less than) and ‘-’ (dash) symbols.

How to create a heat map in R ?

• Step 1: Read a CSV file

How to create a heat map in R ?

• Step 2: Chose the colours for the heat map

colours <- c(0) (Create an empty variable)

How to create a heat map in R ?

• Step 2: Chose the colours for the heat map

colours <- c(0)

colours[1] <- "#FDD49E"

colours[2] <- "#FDBB84"

colours[3] <- "#FC8D59"

colours[4] <- "#EF6548"

colours[5] <- "#D7301F"

colours[6] <- "#B30000"

colours[7] <- "#7F0000"

How to create a heat map in R ?

• Step 2: Chose the colours for the heat map

colours <- c(0)

colours[1] <- "#FDD49E"

colours[2] <- "#FDBB84"

colours[3] <- "#FC8D59"

colours[4] <- "#EF6548"

colours[5] <- "#D7301F"

colours[6] <- "#B30000"

colours[7] <- "#7F0000"

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Input Data

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Whether to apply scaling on the data. Options are ‘col’, ‘row’, and ‘none’.

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Leave them as they are!

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Colours

Any Questions ?

• Open Data• Crowd-Sourced Data (Social Media)• Analysis and Visualisation Challenges• Twitter Case Study

• Spatial Analysis• Temporal Analysis

• R• A brief introduction• How to create heat maps