Analysing the digital traces of Social Media users

Analysing the digital traces of Social Media users

Muhammad Adnan, Guy Lansley, Paul Longley

Consumer Research Data Centre, Department of Geography, University College London

Web: www.uncertaintyofidentity.com ; www.cdrc.ac.uk

Twitter: @gisandtech

http://www.uncertaintyofidentity.com

http://www.cdrc.ac.uk/

Introduction

• Past years have witnessed a rapid growth of the use of

online services

• Online shopping, bank transactions, social networking services

• Issues related to cyber-crimes, identity frauds, and hacking

• ‘Uncertainty of Identity’ project: Combining real and virtual

world datasets to better understand the identity of individuals

• Real world (Census, Demographic Classifications)

• Virtual world (Email addresses, Social media accounts)

Introduction

• Geodemographics

• Census data represent the night time geography

• Social media datasets can be used to provide day and travel time

geographies

• Spatial and temporal analysis of social media users

• Activity pattern analysis

• Tweet content analysis

• Develop tools for Identity analysis

• E-mail addresses

• Social media accounts

Outline

• Some popular social media services

• Twitter

• Introduction

• Case Study 1: Social Media Geodemographics

• Case Study 2: Activity pattern analysis

• Temporal analysis of Twitter activity around different world cities

• Case Study 3: Twitter Geographic Profiler

• An Uncertainty of Identity tool

Some popular social media services

• Facebook

• 2 billion total users

• 1.28 billion active users

• Google Plus

• 1.6 billion total users

• 540 million active users

• Twitter

• More than 1 billion total users

• 255 million active users

(1) Mediabistro. 2014. Social Media Stats 2014. Retrieved 17th November, 2014 from http://www.mediabistro.com/alltwitter/social-media-statistics-

2014_b57746.

http://www.mediabistro.com/alltwitter/social-media-statistics-2014_b57746








Twitter (www.twitter.com)

• Online social networking and micro-blogging web service

• Users can send messages of 140 characters or less

• Approx. 500 million tweets daily

• 78% of Twitter’s active users are on mobile

• 44% of users have never sent a tweet (inactive users)

• Twitter API: for downloading live tweets of data

Data available through the Twitter API

• User Creation Date

• Followers

• Friends

• User ID

• Language

• Location

• Name

• Screen Name

• Time Zone

• Geo Enabled

• Latitude

• Longitude

• Tweet date and time

• Tweet text

• A database of 1.4 billion social media messages

• September, 2012 – February, 2014

• Geo-tagged tweets

• Latitude / Longitude

Case Study 1: Social Media Geodemographics

Social Media Geodemographics

• Geodemographics

• Analysis of people by where they live” (2)

• Night time characteristics of the population

• Social Media Geodemographics

• Moving beyond the night time geography

• Who: Ethnicity, Gender, and Age of social media users

• When: What time of day conversations happen

• Where: Where social media conversations happen

(2) Sleight, P. (2004). Targetting Customers-How to Use Geodemographic and Lifestyle Data in Your Business.

Twitter data for the case study

• Approx. 8 million geo-tagged tweets (Jan – Dec, 2013)

• Sent by 385,050 unique users

• 155,249 users sent 5 or more tweets (7.6 million tweets)

Flows of people and information

• Entropy is a measure of uncertainty in a random variable

• Shannon Entropy

• 7.6 million tweets were aggregated to 4,765 LSOAs

• Entropy was calculated

• High values indicate high flows of people and information

𝐻 𝑋 = − 𝑝 𝑥𝑖 log𝑏 𝑝 𝑥𝑖

𝑛

𝑖=1


Morning (6am – 11.59am) Afternoon (12pm – 5.59pm)


Evening (6pm – 11.59pm) Afternoon (12 midnight – 6.59am)


Variables for creating a geo-temporal classification

1. Residence

• Where twitter users live

1. Ethnicity

• Probable ethnic origins of Twitter users

1. Age

• Probable Age of Twitter users

1. Land Use Category of a Tweet message

• Residential; Non-domestic building; Park etc.

2. Temporal Scales

• Day, Afternoon, Night, Peak travel hours

Residence of Twitter Users

• 170m X 170m grid was used to find the probable residence of users

• Probable residence was found for the 75,522 users

Extracting demographic attributes of Twitter users by

using their forenames and surnames

A name is a statement of the bearer’s cultural, ethnic, and linguistic

identity (3)

(3) Mateos P, Longley P A, O’Sullivan D 2011. Ethnicity and population structure in personal naming networks. PloS ONE (Public Library of Science) 6 (9)

e22943.

Analysing Names on Twitter

• Some examples of NAME variations on Twitter

• Approx. 68% of the accounts have real names

Fake Names

Castor 5.

WHAT IS LOVE?

MysticMind

KIRILL_aka_KID

Vanessa

Justin Bieber Home

Real Names

Kevin Hodge

Andre Alves

Jose de Franco

Carolina Thomas, Dr.

Prof. Martha Del Val

Fabíola Sanchez Fernandes

Onomap: Names to Ethnicity classification

• Onomap was created by clustering names of 1 billion individuals

around the world

• Applied ONOMAP (www.onomap.org) on forename – surname pairs

Kevin Hodge (English)

Pablo Mateos (Spanish)

…

…

…

…

http://www.onomap.org/

Top 10 Ethnic Groups of Twitter Users

• A total of 67 ethnic groups were identified

• Monica dataset provided by CACI Ltd, UK

• Supplemented with UK birth certificate records

Age estimation from ‘forenames’

Age distribution of Twitter users

Twitter Users vs. 2011 Census (Greater London)

(4) Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)

Land-use Categories

• Every tweet message was assigned a land-use category


1. Residence V1: Tweet made near probable London residence

V2: Tweeter lives ‘outside the UK’

V3: Tweeter lives in the rest of the UK outside London

2. Total Number of Tweets V4: Total number of tweets made by the user

3. Ethnicity V5: West European

V6: East European

V7: Greek or Turkish

V8: South East Asian

V9: Other Asian

V10: African & Caribbean

V11: Jewish

V12: Chinese

V13: Other minority

4. Age V14: <=20

V15: 21 - 30

V16: 31 - 40

V17: 41 - 50

V18: 50+

5. Tweets outside the UK V19: In West Europe (not including UK)

V20: In East Europe

V21: In North America

V22: In Central or South American

V23: In Australasia

V24: In Africa

V25: In Middle East

V26: In Asia

V27: In Paris


6. Number of countries visited V28: Number of countries tweeter has visited

7. London Land Use Category V29: Residential location

V30: Non-domestic buildings

V31: Transport links and locations

V32: Green-spaces

V33: All other land uses

8. 2011 London Output Area Classification V34: Intermediate Lifestyles

V35: High Density and High Rise Flats

V36: Settled Asians

V37: Urban Elites

V38: City Vibe

V39: London Life-Cycle

V40: Multi-Ethnic Suburbs

V41: Ageing-City Fringe

9. Temporal Scales V42: Morning Peak Hours

V43: Week Day

V44: Afternoon

V45: Week Night

V46: Weekend

• Segmentations were created by using K-means clustering algorithm

• K-means tries to find cluster centroids by minimising

• Seven clusters

• Group A: London Residents

• Group B: Commuting Professionals

• Group C: Student Lifestyle

• Group D: The Daily Grind

• Group E: Spectators

• Group F: Visitors

• Group G: Workplace and tourist activity

Computing the geo-temporal classifications

n

x

n

yyxV z

1 1

2

)(

Group A: London Residents

• Tweets made near primary residential locations

• Tweets made on weeknights or weekends

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

V1V2

V3V4

V5

V6

V7

V8

V9

V10

V11

V12

V13

V14

V15

V16

V17

V18

V19

V20

V21V22

V23V24

V25V26

V27

V28

V29

V30

V31

V32

V33

V34

V35

V36

V37

V38

V39

V40

V41

V42

V43

V44V45

V46

Group B: Commuting Professionals

• Tweets made from

• Transport locations

• ‘Urban Elites’ LOAC classification

• Tweets made by individuals of intermediate age (21-30)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

V1V2

V3V4

V5

V6

V7

V8

V9

V10

V11

V12

V13

V14

V15

V16

V17

V18

V19

V20

V21V22

V23V24

V25V26

V27

V28

V29

V30

V31

V32

V33

V34

V35

V36

V37

V38

V39

V40

V41

V42

V43

V44V45

V46

Group F: Visitors

• Tweeters live outside London

• Tweets originated from residential land uses

• Mixed age groups

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

V1V2

V3V4

V5

V6

V7

V8

V9

V10

V11

V12

V13

V14

V15

V16

V17

V18

V19

V20

V21V22

V23V24

V25V26

V27

V28

V29

V30

V31

V32

V33

V34

V35

V36

V37

V38

V39

V40

V41

V42

V43

V44V45

V46

Group G: Workplace and tourist activity

• Tweets sent from non-domestic buildings

• Full range of Twitter age cohorts

• Tweets originate from a mix of residents and international visitors

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

V1V2

V3V4

V5

V6

V7

V8

V9

V10

V11

V12

V13

V14

V15

V16

V17

V18

V19

V20

V21V22

V23V24

V25V26

V27

V28

V29

V30

V31

V32

V33

V34

V35

V36

V37

V38

V39

V40

V41

V42

V43

V44V45

V46

Social Media Geodemographics

• Geo-temporal demographic classifications

• Census (night time geography)

• Social media data (day and travel time geography)

• Issues of representation

• An insight into the residential and travel geographies of individuals

• An insight into the spatial activity patterns of different kind of

social media users

Case Study 2: Analysis of Twitter activity around

world cities

(5) Muhammad Adnan, Alistair Leak, Paul Longley. “A geocomputational analysis of Twitter activity around different world cities”. Geospatial

Information Science.

Activity Pattern Analysis

• Comparison of the use of Twitter between different cities

• Weekly patterns of activity

• Seasonal shifts

• Data: 19th September, 2012 – 25th September, 2013

• Point-in-polygon operations were performed to extract data

for different city in the world

• Approx. 170 million tweets were sent from the top 30 cities

Top 30 cities on Twitter

0

5

10

15

20

25

30

35

40

Nu

mb

er

of

Tw

eets

(M

illi

on

s)

• Approx. 170 million tweets were sent from the following 30 cities.

Time zone issue

• By default, Twitter API sends the data in local time zone

• Data was converted from GMT to the corresponding time

zones

Date & Time (GMT) Date & Time (UTC +1)

Wed Dec 05 00:04:23, 2012 Wed Dec 05 01:04:23 2012

Wed Dec 05 00:06:29, 2012 Wed Dec 05 01:06:29 2012

Wed Dec 05 00:07:35, 2012 Wed Dec 05 01:07:35 2012

Temporal Analysis of Twitter Cities Jakarta Istanbul Paris

Sao Paulo, Brazil New York City London

Temporal Analysis of Twitter Cities Riyadh Tokyo Madrid

Buenos Aires, Argentina

Temporal Analysis of Twitter Cities

London


London

Paris


Jakarta


Jakarta

Riyadh


New York City


New York City

Tokyo

Case Study 3: Twitter Geographic Profiler (a part of

Uncertainty of Identity Toolkit)

Introduction

• Uncertainty of Identity Toolkit is a framework for the

identification and profiling of individuals from their

• Social media accounts

• E-mail addresses

• Twitter Geographic Profiler

• Maps ethno-cultural communities of a person’s friends

• Extracting identities of Twitter users

• Mapping them to probable ethnic origins

• Could have potential applications in targeted marketing

Twitter Geographic Profiler

• Given an individual’s Twitter Username or ID

• Extracts the information of individual’s friends

• Extracts the forename-surname pairs of the friends

• Maps forename-surname pairs to Onomap

• Builds an ethno-cultural profile person’s friends

• Maps the geographic distribution

Data available through the Twitter API

• User ID

• User Creation Date

• Followers

• Friends

• Language

• Location

• Name

• Screen Name or User Name

• Time Zone

• Geo Enabled

• Latitude

• Longitude

• Tweet date and time

• Tweet text

Twitter: getting the ids and usernames

• Given a Twitter username of a person, we use the Twitter

API to get the list of friends’ ids

– A max of 15 requests every 15 minutes is allowed

– Each query can get up to 5000 ids

– Generally enough to download all the ids

• Using the ids, we fetch the name associated to each id

– Limited to 180 requests every 15 min

– Returns a single string from which we need to extract the name

and surname tokens

– Not necessarily a valid forename + surname!

• E.g., “University of Birmingham”, “John1965”, “ What is Love”,

“Mystic_mind”

Twitter: getting forename-surname pairs

• Name field was divided into different tokens

• Forenames and Surnames were detected by matching the

string tokens against the database of forename surnames

pairs of 26 countries

• Users discarded

– where tokens were not matched against valid forename and

surname

Onomap: from names to ethnicity

• ONOMAP (www.onomap.org) was applied on forename –

surname pairs

Kevin Hodge (English)

Pablo Mateos (Spanish)

…

…

…

…

http://www.onomap.org

Friends’ Ethnicity Histogram

Once the entire list of friends name + surname pairs has been parsed, we can

easily estimate the distribution over the set of possible ethno-cultural groups of

the Twitter user's friends

surname. A forename or surname is a statement of the bearer’s

cultural, ethnic, and linguistic identity [4]. The tool uses an

efficient approach to identify the presence of surnames as

substrings in an e-mail address. Then, it predicts the probable

ethnicity, and maps the geographical distribution of the surname

in the UK. For this purpose, this tool uses data from three

different data sources, namely Onomap, Worldnames [6], and the

2007 Register of Electors for the United Kingdom.

Worldnames is an online service which maps the geographical

distribution of a searched surname around 26 different countries

of the world. It was created by using the names data extracted

from the telephone directories and electoral registers from

different countries.

This paper is organized as follows. Section 2 of the paper

describes the Twitter geographic profiler tool and discusses the

use and privacy implications of the tool. Section 3 describes the

E-mail Address Profiler tool and the underlying suffix tree

construction algorithm. Finally, section 4 concludes the paper.

2. TWITTER GEOGRAPHIC PROFILER

This tool builds a map of the ethno-cultural communities of a

person's friends. That is, we want to determine the distribution

over the set of possible ethno-cultural groups of the friends of a

given individual. To this end, we integrate information from two

sources, namely Twitter and Onomap. Note, that the same ideas

can be applied to data collected from other Online Social

Networks (OSN), such as Facebook or Foursquare1. However,

different OSN capture social interactions around different and

sometimes specific themes, i.e., Foursquare’s venues. In this

work, we decide to focus on Twitter data because of the general

context of the interactions, i.e., they are not restricted to a specific

theme or interest, and because, unlike Facebook, information is

easily accessible through the Twitter API2.

More specifically, given the Twitter username of the person being

analysed, we download the list of (surname, forename) pairs of

his or her friends. We then map this list of names to a list of

ethno-cultural groups, according to the classification of Onomap.

We also map the surnames to the most probable countries of

origin. With these lists to hand, we estimate respectively the

distribution of a user’s friends over the set of possible ethno-

cultural groups and over the set of countries. In the following

subsections we report the implementation details of the tool and

its applications and implications in terms of users' privacy.

Finally, note that the social graph of Twitter is directed, in the

sense that the friendship relation is not necessarily reciprocated.

As a consequence, there are two lists associated with each user,

one for the accounts that the user is following and one for the

accounts that follow the user, i.e., his or her followers. In this

work, we consider the first as representing the list of a user's

friends. Subsection 2.1 describes the system implementation of

this tool and Subsection 2.2 discusses the use of the tool and its

privacy implications.

1 Foursquare is a Location-Based Social Network (LBSN). LBSNs are

based on the concept of check-in, where a user can register in a certain

location and share this information with friends. Moreover, the user can

leave recommendations and comments about the visited venues. 2 https://dev.twitter.com/docs/api

2.1 System Implementation

Given the Twitter username of an individual, we first probe the

list of his/her friends’ ids using the GET friends/ids method. As

most of the methods in the API, the number of requests that can

be performed in a certain time interval is bounded. More

specifically, we are only allowed to send 15 requests every 15

minutes. Note also that with each query we can only get up to

5000 user ids. However, we find that generally one single request

is sufficient to download the complete list of friends. Given the

list of ids, we use the GET users/lookup method to fetch the

(surname, forename) pair associated with each id. This returns up

to 100 users profiles, given a list of ids as input. Note that request

rate of the GET users/lookup method is currently limited to 180

requests every 15 minutes. We should stress that these rate

limitations do not prevent us to parse the complete list of friends

of a user, as the distribution of the number of accounts followed

by Twitter users has been shown to be approximately a power law

distribution [7]. As a consequence, the majority of the users

actually follow a limited number of profiles, which are then

accessible even with the rate limitation in place.

With the list of (surname, forename) pairs to hand, we query

Onomap to get the ethno-cultural classification associated with

each (surname, forename) pair, and the

SearchSurnameTopCountries method to get the list of the

countries where an instance of a given surname was observed.

Each element of the latter list is attributed with the relative

frequency of the corresponding surname in that country, so as to

take into account differences in population counts. Given this

ranking, we then classify each surname as originating from the

corresponding highest ranked country. As for SearchEthnicity, the

method returns the most probable classification for both the

surname and the forename, as well as the overall classification for

the pair. Finally, in order to query the Onomap API in an

asynchronous way, we make use of the grequest3 package.

3 https://github.com/kennethreitz/grequests

Figure 1: Screenshot of the Twitter Geographic Profiler. The

bottom part of the screen shows the histogram of the Twitter

user's friends ethno-cultural groups.

Friends’ Geographic Origins Map showing the geographic origin of the Twitter user's friends’ surnames as assigned by our tool. Below the map the user is shown a list of the top 10 countries with the respective frequency.

However, we decide to limit the number of simultaneous

asynchronous requests to 50 in order to avoid congesting

Onomap's server.

Once the entire list of friends (surname, forename) pairs has been

parsed, we can easily estimate the distribution over the set of

possible ethno-cultural groups and over the set of countries of the

Twitter user's friends. Figs. 1 and 2 respectively show the

histogram of the ethno-cultural groups and a map visualising the

countries of origin of the friends of a sample user. In the map a

darker (brighter) color denotes a higher (lower) probability of

having a friend originating from that country.

Note that, when we extract the (surname, forename) pairs using

the GET users/lookup method, a filtering system needs to be put

in place to discard invalid strings. In fact, while in other OSN

such as Facebook the user is forced to enter the surname and the

forename in two separated fields, in Twitter the users are required

to enter their name (or any alternative identifying string) in a

single Username field. As a consequence, we need to parse the

username string to separate it into its constituent tokens. Then, we

need to apply some heuristic to detect the (surname, forename)

pair among the extracted tokens. In this work we mark as invalid

any string that is composed of a single token. If this is the case,

we skip the profile of the corresponding friend.

If the string contains two or more tokens, we take the first one to

be the forename and the last one to be the surname. Moreover,

when a (surname, forename) pair is sent to Onomap, an error

message will indicate if the system is unable to parse the surname

or the forename, or both. In the latter case we stop the

computation and we proceed to parse the next (surname,

forename) pair.

2.2 Discussion & Privacy Implications

Although relatively simple, the above tool can be used in a

number of applications that leverage the ethno-cultural

information of a person's friends. To start with, note that in

addition to what described above we can query Onomap to

classify the (surname, forename) pair of the Twitter user whose

friends list is being analysed. Hence, given a large enough sample

of users, we can estimate the average friendship distribution of a

given ethno-cultural group.

This in turn can be used to measure the multiculturality of a given

community. For example, one may compute the Shannon entropy

of the ethno-cultural distribution of a community (or an

individual) to get a readably interpretable measure of how open

the community (the individual) is to other groups, in terms of

bondings. Intuitively, the more peaked the distribution is, the

lower is the Shannon entropy and the less prone a community is to

bond with a large number of groups. Similarly, by computing the

Jensen-Shannon divergence between the average distributions [8]

and applying multidimensional scaling [9] to the resulting

distance matrix one can embed these communities in the

Euclidean space for the purpose of visualising and grouping

similar ethno-cultural groups.

However, note that we expect the level of exposure to different

ethno-cultural groups to vary across the geographical space. That

is, on average a resident of London is likely to have friendships

spanning a wider spectrum of communities rather than a resident

of Swansea4, due to the substantial mixture of ethnic groups living

in London. As a consequence, the above analysis should be

performed within a limited geographical space. Luckily, it has

been shown that roughly 50% of Twitter users have a location

assigned in their profile, and the vast majority of these locations

are at town level [10], thus such an analysis would indeed be

feasible.

Given the friendships distribution of a given ethno-cultural group,

it is also possible to use outlier detection techniques [11] to

identify individuals or group of individuals that stand out in terms

of the ethno-cultural groups they bond with. Potentially, one can

also infer the ethnicity of an individual whose name is unknown

but for which a list of friend names is available.

To understand the extent of the privacy implications of our tool,

we should stress that the default behaviour of Twitter is to set the

profile of a user as public. Although the setting can be changed to

private, thus making it impossible for our tool to operate on the

profile, when testing our tool we did not encounter any private

profile. Consequently, we can safely assume to be able to

download the list of names of a user’s friends and perform our

ethno-cultural profiling.

As for the limitations of the current implementation of our tool,

we observed that the Twitter data contains a large amount of

noise, which can considerably affect the results of the

computation. The source of this noise is twofold. Firstly, the need

of extracting the surname and forename tokens from a single

string introduces unwanted uncertainty. In this sense, more

sophisticated natural language processing techniques should be

investigated to extract the correct (surname, forename) pair.

Secondly, we note that a considerable number of accounts

followed by the Twitter users is actually represented by news

feeds, i.e., BBC, CNN, etc., celebrities, notable academics and

4 http://www.swansea.gov.uk/index.cfm?articleid=44946

Figure 2: Map showing the geographical origin of the Twitter

user's friends’ surnames as assigned by our tool. Below the

map the user is shown a list of the top 10 countries with the

respective frequency.

Twitter Geographic Profiler

• Potential applications include

– Measure the level of segregation/integration of a given individual

(community) as the Shannon entropy of the (average) friends’

ethnicity histogram

– Outliers detection: identify uncommon behaviors, e.g., individuals

that stand out in terms of the ethno-cultural groups they bond with

• Limitations

– Twitter data is very noisy

– Request limits

• Social media datasets can be used to create Geo-temporal

demographic classifications

• Day and travel time geographies

• Activity patterns

• Temporal analysis can identify some interesting patterns of a

geographical area

• Weekly patterns of activity

• Seasonal shifts

• Twitter Geographic Profiler: Identification and profiling of

ethno-cultural characteristics of individuals

• From their Twitter accounts

Conclusion

• Study of privacy implications on social media services • Facebook, FourSquare

• Future work: Consumer Data Research Centre • Use of social media for retail sector

• Spatial and temporal catchments of the social media users

Conclusion

• E.g. Day-time catchment

1. Identify the unique ID of users

frequently transmitting from a

particular location at a given

time or date range

2. Request their other activity

through Twitter’s API, filter by

time/date

3. Aggregate

Time catchments

The Twitter work-day time catchment of Bishopsgate

Activity at Bishopsgate in 2013

60

Waterloo St Pancras Victoria Paddington

London Bridge Liverpool Street Kings Cross Euston

Natural History Museum

Residential catchment of Twitter users

• First establish which

users have tweeted from

inside the building

• Create a customer

catchment by identifying

all of these users Tweets

sent from domestic land

uses

• E.g. ASDA in Clapham

Junction

The Twitter residential catchment of ASDA

Supermarket at Clapham Junction

Any Questions ?

Thanks for Listening

Date post:	06-Jul-2015
Category:	Social Media
Upload:	muhammad-adnan
View:	179 times
Download:	0 times