Date post: | 06-Jul-2015 |
Category: |
Social Media |
Upload: | muhammad-adnan |
View: | 179 times |
Download: | 0 times |
Analysing the digital traces of Social Media users
Muhammad Adnan, Guy Lansley, Paul Longley
Consumer Research Data Centre, Department of Geography, University College London
Web: www.uncertaintyofidentity.com ; www.cdrc.ac.uk
Twitter: @gisandtech
Introduction
• Past years have witnessed a rapid growth of the use of
online services
• Online shopping, bank transactions, social networking services
• Issues related to cyber-crimes, identity frauds, and hacking
• ‘Uncertainty of Identity’ project: Combining real and virtual
world datasets to better understand the identity of individuals
• Real world (Census, Demographic Classifications)
• Virtual world (Email addresses, Social media accounts)
Introduction
• Geodemographics
• Census data represent the night time geography
• Social media datasets can be used to provide day and travel time
geographies
• Spatial and temporal analysis of social media users
• Activity pattern analysis
• Tweet content analysis
• Develop tools for Identity analysis
• E-mail addresses
• Social media accounts
Outline
• Some popular social media services
• Introduction
• Case Study 1: Social Media Geodemographics
• Case Study 2: Activity pattern analysis
• Temporal analysis of Twitter activity around different world cities
• Case Study 3: Twitter Geographic Profiler
• An Uncertainty of Identity tool
Some popular social media services
• 2 billion total users
• 1.28 billion active users
• Google Plus
• 1.6 billion total users
• 540 million active users
• More than 1 billion total users
• 255 million active users
(1) Mediabistro. 2014. Social Media Stats 2014. Retrieved 17th November, 2014 from http://www.mediabistro.com/alltwitter/social-media-statistics-
2014_b57746.
Twitter (www.twitter.com)
• Online social networking and micro-blogging web service
• Users can send messages of 140 characters or less
• Approx. 500 million tweets daily
• 78% of Twitter’s active users are on mobile
• 44% of users have never sent a tweet (inactive users)
• Twitter API: for downloading live tweets of data
Data available through the Twitter API
• User Creation Date
• Followers
• Friends
• User ID
• Language
• Location
• Name
• Screen Name
• Time Zone
• Geo Enabled
• Latitude
• Longitude
• Tweet date and time
• Tweet text
• A database of 1.4 billion social media messages
• September, 2012 – February, 2014
• Geo-tagged tweets
• Latitude / Longitude
Case Study 1: Social Media Geodemographics
Social Media Geodemographics
• Geodemographics
• Analysis of people by where they live” (2)
• Night time characteristics of the population
• Social Media Geodemographics
• Moving beyond the night time geography
• Who: Ethnicity, Gender, and Age of social media users
• When: What time of day conversations happen
• Where: Where social media conversations happen
(2) Sleight, P. (2004). Targetting Customers-How to Use Geodemographic and Lifestyle Data in Your Business.
Twitter data for the case study
• Approx. 8 million geo-tagged tweets (Jan – Dec, 2013)
• Sent by 385,050 unique users
• 155,249 users sent 5 or more tweets (7.6 million tweets)
Flows of people and information
• Entropy is a measure of uncertainty in a random variable
• Shannon Entropy
• 7.6 million tweets were aggregated to 4,765 LSOAs
• Entropy was calculated
• High values indicate high flows of people and information
𝐻 𝑋 = − 𝑝 𝑥𝑖 log𝑏 𝑝 𝑥𝑖
𝑛
𝑖=1
Flows of people and information
Morning (6am – 11.59am) Afternoon (12pm – 5.59pm)
Flows of people and information
Evening (6pm – 11.59pm) Afternoon (12 midnight – 6.59am)
Flows of people and information
Variables for creating a geo-temporal classification
1. Residence
• Where twitter users live
1. Ethnicity
• Probable ethnic origins of Twitter users
1. Age
• Probable Age of Twitter users
1. Land Use Category of a Tweet message
• Residential; Non-domestic building; Park etc.
2. Temporal Scales
• Day, Afternoon, Night, Peak travel hours
Residence of Twitter Users
• 170m X 170m grid was used to find the probable residence of users
• Probable residence was found for the 75,522 users
Extracting demographic attributes of Twitter users by
using their forenames and surnames
A name is a statement of the bearer’s cultural, ethnic, and linguistic
identity (3)
(3) Mateos P, Longley P A, O’Sullivan D 2011. Ethnicity and population structure in personal naming networks. PloS ONE (Public Library of Science) 6 (9)
e22943.
Analysing Names on Twitter
• Some examples of NAME variations on Twitter
• Approx. 68% of the accounts have real names
Fake Names
Castor 5.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Justin Bieber Home
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes
Onomap: Names to Ethnicity classification
• Onomap was created by clustering names of 1 billion individuals
around the world
• Applied ONOMAP (www.onomap.org) on forename – surname pairs
Kevin Hodge (English)
Pablo Mateos (Spanish)
…
…
…
…
Top 10 Ethnic Groups of Twitter Users
• A total of 67 ethnic groups were identified
• Monica dataset provided by CACI Ltd, UK
• Supplemented with UK birth certificate records
Age estimation from ‘forenames’
Age distribution of Twitter users
Twitter Users vs. 2011 Census (Greater London)
(4) Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
Land-use Categories
• Every tweet message was assigned a land-use category
Variables for creating a geo-temporal classification
1. Residence V1: Tweet made near probable London residence
V2: Tweeter lives ‘outside the UK’
V3: Tweeter lives in the rest of the UK outside London
2. Total Number of Tweets V4: Total number of tweets made by the user
3. Ethnicity V5: West European
V6: East European
V7: Greek or Turkish
V8: South East Asian
V9: Other Asian
V10: African & Caribbean
V11: Jewish
V12: Chinese
V13: Other minority
4. Age V14: <=20
V15: 21 - 30
V16: 31 - 40
V17: 41 - 50
V18: 50+
5. Tweets outside the UK V19: In West Europe (not including UK)
V20: In East Europe
V21: In North America
V22: In Central or South American
V23: In Australasia
V24: In Africa
V25: In Middle East
V26: In Asia
V27: In Paris
Variables for creating a geo-temporal classification
6. Number of countries visited V28: Number of countries tweeter has visited
7. London Land Use Category V29: Residential location
V30: Non-domestic buildings
V31: Transport links and locations
V32: Green-spaces
V33: All other land uses
8. 2011 London Output Area Classification V34: Intermediate Lifestyles
V35: High Density and High Rise Flats
V36: Settled Asians
V37: Urban Elites
V38: City Vibe
V39: London Life-Cycle
V40: Multi-Ethnic Suburbs
V41: Ageing-City Fringe
9. Temporal Scales V42: Morning Peak Hours
V43: Week Day
V44: Afternoon
V45: Week Night
V46: Weekend
• Segmentations were created by using K-means clustering algorithm
• K-means tries to find cluster centroids by minimising
• Seven clusters
• Group A: London Residents
• Group B: Commuting Professionals
• Group C: Student Lifestyle
• Group D: The Daily Grind
• Group E: Spectators
• Group F: Visitors
• Group G: Workplace and tourist activity
Computing the geo-temporal classifications
n
x
n
yyxV z
1 1
2
)(
Group A: London Residents
• Tweets made near primary residential locations
• Tweets made on weeknights or weekends
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
V1V2
V3V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18
V19
V20
V21V22
V23V24
V25V26
V27
V28
V29
V30
V31
V32
V33
V34
V35
V36
V37
V38
V39
V40
V41
V42
V43
V44V45
V46
Group B: Commuting Professionals
• Tweets made from
• Transport locations
• ‘Urban Elites’ LOAC classification
• Tweets made by individuals of intermediate age (21-30)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
V1V2
V3V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18
V19
V20
V21V22
V23V24
V25V26
V27
V28
V29
V30
V31
V32
V33
V34
V35
V36
V37
V38
V39
V40
V41
V42
V43
V44V45
V46
Group F: Visitors
• Tweeters live outside London
• Tweets originated from residential land uses
• Mixed age groups
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
V1V2
V3V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18
V19
V20
V21V22
V23V24
V25V26
V27
V28
V29
V30
V31
V32
V33
V34
V35
V36
V37
V38
V39
V40
V41
V42
V43
V44V45
V46
Group G: Workplace and tourist activity
• Tweets sent from non-domestic buildings
• Full range of Twitter age cohorts
• Tweets originate from a mix of residents and international visitors
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
V1V2
V3V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18
V19
V20
V21V22
V23V24
V25V26
V27
V28
V29
V30
V31
V32
V33
V34
V35
V36
V37
V38
V39
V40
V41
V42
V43
V44V45
V46
Social Media Geodemographics
• Geo-temporal demographic classifications
• Census (night time geography)
• Social media data (day and travel time geography)
• Issues of representation
• An insight into the residential and travel geographies of individuals
• An insight into the spatial activity patterns of different kind of
social media users
Case Study 2: Analysis of Twitter activity around
world cities
(5) Muhammad Adnan, Alistair Leak, Paul Longley. “A geocomputational analysis of Twitter activity around different world cities”. Geospatial
Information Science.
Activity Pattern Analysis
• Comparison of the use of Twitter between different cities
• Weekly patterns of activity
• Seasonal shifts
• Data: 19th September, 2012 – 25th September, 2013
• Point-in-polygon operations were performed to extract data
for different city in the world
• Approx. 170 million tweets were sent from the top 30 cities
Top 30 cities on Twitter
0
5
10
15
20
25
30
35
40
Nu
mb
er
of
Tw
eets
(M
illi
on
s)
• Approx. 170 million tweets were sent from the following 30 cities.
Time zone issue
• By default, Twitter API sends the data in local time zone
• Data was converted from GMT to the corresponding time
zones
Date & Time (GMT) Date & Time (UTC +1)
Wed Dec 05 00:04:23, 2012 Wed Dec 05 01:04:23 2012
Wed Dec 05 00:06:29, 2012 Wed Dec 05 01:06:29 2012
Wed Dec 05 00:07:35, 2012 Wed Dec 05 01:07:35 2012
Temporal Analysis of Twitter Cities Jakarta Istanbul Paris
Sao Paulo, Brazil New York City London
Temporal Analysis of Twitter Cities Riyadh Tokyo Madrid
Buenos Aires, Argentina
Temporal Analysis of Twitter Cities
London
Temporal Analysis of Twitter Cities
London
Paris
Temporal Analysis of Twitter Cities
Jakarta
Temporal Analysis of Twitter Cities
Jakarta
Riyadh
Temporal Analysis of Twitter Cities
New York City
Temporal Analysis of Twitter Cities
New York City
Tokyo
Case Study 3: Twitter Geographic Profiler (a part of
Uncertainty of Identity Toolkit)
Introduction
• Uncertainty of Identity Toolkit is a framework for the
identification and profiling of individuals from their
• Social media accounts
• E-mail addresses
• Twitter Geographic Profiler
• Maps ethno-cultural communities of a person’s friends
• Extracting identities of Twitter users
• Mapping them to probable ethnic origins
• Could have potential applications in targeted marketing
Twitter Geographic Profiler
• Given an individual’s Twitter Username or ID
• Extracts the information of individual’s friends
• Extracts the forename-surname pairs of the friends
• Maps forename-surname pairs to Onomap
• Builds an ethno-cultural profile person’s friends
• Maps the geographic distribution
Data available through the Twitter API
• User ID
• User Creation Date
• Followers
• Friends
• Language
• Location
• Name
• Screen Name or User Name
• Time Zone
• Geo Enabled
• Latitude
• Longitude
• Tweet date and time
• Tweet text
Twitter: getting the ids and usernames
• Given a Twitter username of a person, we use the Twitter
API to get the list of friends’ ids
– A max of 15 requests every 15 minutes is allowed
– Each query can get up to 5000 ids
– Generally enough to download all the ids
• Using the ids, we fetch the name associated to each id
– Limited to 180 requests every 15 min
– Returns a single string from which we need to extract the name
and surname tokens
– Not necessarily a valid forename + surname!
• E.g., “University of Birmingham”, “John1965”, “ What is Love”,
“Mystic_mind”
Twitter: getting forename-surname pairs
• Name field was divided into different tokens
• Forenames and Surnames were detected by matching the
string tokens against the database of forename surnames
pairs of 26 countries
• Users discarded
– where tokens were not matched against valid forename and
surname
Onomap: from names to ethnicity
• ONOMAP (www.onomap.org) was applied on forename –
surname pairs
Kevin Hodge (English)
Pablo Mateos (Spanish)
…
…
…
…
Friends’ Ethnicity Histogram
Once the entire list of friends name + surname pairs has been parsed, we can
easily estimate the distribution over the set of possible ethno-cultural groups of
the Twitter user's friends
surname. A forename or surname is a statement of the bearer’s
cultural, ethnic, and linguistic identity [4]. The tool uses an
efficient approach to identify the presence of surnames as
substrings in an e-mail address. Then, it predicts the probable
ethnicity, and maps the geographical distribution of the surname
in the UK. For this purpose, this tool uses data from three
different data sources, namely Onomap, Worldnames [6], and the
2007 Register of Electors for the United Kingdom.
Worldnames is an online service which maps the geographical
distribution of a searched surname around 26 different countries
of the world. It was created by using the names data extracted
from the telephone directories and electoral registers from
different countries.
This paper is organized as follows. Section 2 of the paper
describes the Twitter geographic profiler tool and discusses the
use and privacy implications of the tool. Section 3 describes the
E-mail Address Profiler tool and the underlying suffix tree
construction algorithm. Finally, section 4 concludes the paper.
2. TWITTER GEOGRAPHIC PROFILER
This tool builds a map of the ethno-cultural communities of a
person's friends. That is, we want to determine the distribution
over the set of possible ethno-cultural groups of the friends of a
given individual. To this end, we integrate information from two
sources, namely Twitter and Onomap. Note, that the same ideas
can be applied to data collected from other Online Social
Networks (OSN), such as Facebook or Foursquare1. However,
different OSN capture social interactions around different and
sometimes specific themes, i.e., Foursquare’s venues. In this
work, we decide to focus on Twitter data because of the general
context of the interactions, i.e., they are not restricted to a specific
theme or interest, and because, unlike Facebook, information is
easily accessible through the Twitter API2.
More specifically, given the Twitter username of the person being
analysed, we download the list of (surname, forename) pairs of
his or her friends. We then map this list of names to a list of
ethno-cultural groups, according to the classification of Onomap.
We also map the surnames to the most probable countries of
origin. With these lists to hand, we estimate respectively the
distribution of a user’s friends over the set of possible ethno-
cultural groups and over the set of countries. In the following
subsections we report the implementation details of the tool and
its applications and implications in terms of users' privacy.
Finally, note that the social graph of Twitter is directed, in the
sense that the friendship relation is not necessarily reciprocated.
As a consequence, there are two lists associated with each user,
one for the accounts that the user is following and one for the
accounts that follow the user, i.e., his or her followers. In this
work, we consider the first as representing the list of a user's
friends. Subsection 2.1 describes the system implementation of
this tool and Subsection 2.2 discusses the use of the tool and its
privacy implications.
1 Foursquare is a Location-Based Social Network (LBSN). LBSNs are
based on the concept of check-in, where a user can register in a certain
location and share this information with friends. Moreover, the user can
leave recommendations and comments about the visited venues. 2 https://dev.twitter.com/docs/api
2.1 System Implementation
Given the Twitter username of an individual, we first probe the
list of his/her friends’ ids using the GET friends/ids method. As
most of the methods in the API, the number of requests that can
be performed in a certain time interval is bounded. More
specifically, we are only allowed to send 15 requests every 15
minutes. Note also that with each query we can only get up to
5000 user ids. However, we find that generally one single request
is sufficient to download the complete list of friends. Given the
list of ids, we use the GET users/lookup method to fetch the
(surname, forename) pair associated with each id. This returns up
to 100 users profiles, given a list of ids as input. Note that request
rate of the GET users/lookup method is currently limited to 180
requests every 15 minutes. We should stress that these rate
limitations do not prevent us to parse the complete list of friends
of a user, as the distribution of the number of accounts followed
by Twitter users has been shown to be approximately a power law
distribution [7]. As a consequence, the majority of the users
actually follow a limited number of profiles, which are then
accessible even with the rate limitation in place.
With the list of (surname, forename) pairs to hand, we query
Onomap to get the ethno-cultural classification associated with
each (surname, forename) pair, and the
SearchSurnameTopCountries method to get the list of the
countries where an instance of a given surname was observed.
Each element of the latter list is attributed with the relative
frequency of the corresponding surname in that country, so as to
take into account differences in population counts. Given this
ranking, we then classify each surname as originating from the
corresponding highest ranked country. As for SearchEthnicity, the
method returns the most probable classification for both the
surname and the forename, as well as the overall classification for
the pair. Finally, in order to query the Onomap API in an
asynchronous way, we make use of the grequest3 package.
3 https://github.com/kennethreitz/grequests
Figure 1: Screenshot of the Twitter Geographic Profiler. The
bottom part of the screen shows the histogram of the Twitter
user's friends ethno-cultural groups.
Friends’ Geographic Origins Map showing the geographic origin of the Twitter user's friends’ surnames as assigned by our tool. Below the map the user is shown a list of the top 10 countries with the respective frequency.
However, we decide to limit the number of simultaneous
asynchronous requests to 50 in order to avoid congesting
Onomap's server.
Once the entire list of friends (surname, forename) pairs has been
parsed, we can easily estimate the distribution over the set of
possible ethno-cultural groups and over the set of countries of the
Twitter user's friends. Figs. 1 and 2 respectively show the
histogram of the ethno-cultural groups and a map visualising the
countries of origin of the friends of a sample user. In the map a
darker (brighter) color denotes a higher (lower) probability of
having a friend originating from that country.
Note that, when we extract the (surname, forename) pairs using
the GET users/lookup method, a filtering system needs to be put
in place to discard invalid strings. In fact, while in other OSN
such as Facebook the user is forced to enter the surname and the
forename in two separated fields, in Twitter the users are required
to enter their name (or any alternative identifying string) in a
single Username field. As a consequence, we need to parse the
username string to separate it into its constituent tokens. Then, we
need to apply some heuristic to detect the (surname, forename)
pair among the extracted tokens. In this work we mark as invalid
any string that is composed of a single token. If this is the case,
we skip the profile of the corresponding friend.
If the string contains two or more tokens, we take the first one to
be the forename and the last one to be the surname. Moreover,
when a (surname, forename) pair is sent to Onomap, an error
message will indicate if the system is unable to parse the surname
or the forename, or both. In the latter case we stop the
computation and we proceed to parse the next (surname,
forename) pair.
2.2 Discussion & Privacy Implications
Although relatively simple, the above tool can be used in a
number of applications that leverage the ethno-cultural
information of a person's friends. To start with, note that in
addition to what described above we can query Onomap to
classify the (surname, forename) pair of the Twitter user whose
friends list is being analysed. Hence, given a large enough sample
of users, we can estimate the average friendship distribution of a
given ethno-cultural group.
This in turn can be used to measure the multiculturality of a given
community. For example, one may compute the Shannon entropy
of the ethno-cultural distribution of a community (or an
individual) to get a readably interpretable measure of how open
the community (the individual) is to other groups, in terms of
bondings. Intuitively, the more peaked the distribution is, the
lower is the Shannon entropy and the less prone a community is to
bond with a large number of groups. Similarly, by computing the
Jensen-Shannon divergence between the average distributions [8]
and applying multidimensional scaling [9] to the resulting
distance matrix one can embed these communities in the
Euclidean space for the purpose of visualising and grouping
similar ethno-cultural groups.
However, note that we expect the level of exposure to different
ethno-cultural groups to vary across the geographical space. That
is, on average a resident of London is likely to have friendships
spanning a wider spectrum of communities rather than a resident
of Swansea4, due to the substantial mixture of ethnic groups living
in London. As a consequence, the above analysis should be
performed within a limited geographical space. Luckily, it has
been shown that roughly 50% of Twitter users have a location
assigned in their profile, and the vast majority of these locations
are at town level [10], thus such an analysis would indeed be
feasible.
Given the friendships distribution of a given ethno-cultural group,
it is also possible to use outlier detection techniques [11] to
identify individuals or group of individuals that stand out in terms
of the ethno-cultural groups they bond with. Potentially, one can
also infer the ethnicity of an individual whose name is unknown
but for which a list of friend names is available.
To understand the extent of the privacy implications of our tool,
we should stress that the default behaviour of Twitter is to set the
profile of a user as public. Although the setting can be changed to
private, thus making it impossible for our tool to operate on the
profile, when testing our tool we did not encounter any private
profile. Consequently, we can safely assume to be able to
download the list of names of a user’s friends and perform our
ethno-cultural profiling.
As for the limitations of the current implementation of our tool,
we observed that the Twitter data contains a large amount of
noise, which can considerably affect the results of the
computation. The source of this noise is twofold. Firstly, the need
of extracting the surname and forename tokens from a single
string introduces unwanted uncertainty. In this sense, more
sophisticated natural language processing techniques should be
investigated to extract the correct (surname, forename) pair.
Secondly, we note that a considerable number of accounts
followed by the Twitter users is actually represented by news
feeds, i.e., BBC, CNN, etc., celebrities, notable academics and
4 http://www.swansea.gov.uk/index.cfm?articleid=44946
Figure 2: Map showing the geographical origin of the Twitter
user's friends’ surnames as assigned by our tool. Below the
map the user is shown a list of the top 10 countries with the
respective frequency.
Twitter Geographic Profiler
• Potential applications include
– Measure the level of segregation/integration of a given individual
(community) as the Shannon entropy of the (average) friends’
ethnicity histogram
– Outliers detection: identify uncommon behaviors, e.g., individuals
that stand out in terms of the ethno-cultural groups they bond with
• Limitations
– Twitter data is very noisy
– Request limits
• Social media datasets can be used to create Geo-temporal
demographic classifications
• Day and travel time geographies
• Activity patterns
• Temporal analysis can identify some interesting patterns of a
geographical area
• Weekly patterns of activity
• Seasonal shifts
• Twitter Geographic Profiler: Identification and profiling of
ethno-cultural characteristics of individuals
• From their Twitter accounts
Conclusion
• Study of privacy implications on social media services • Facebook, FourSquare
• Future work: Consumer Data Research Centre • Use of social media for retail sector
• Spatial and temporal catchments of the social media users
Conclusion
• E.g. Day-time catchment
1. Identify the unique ID of users
frequently transmitting from a
particular location at a given
time or date range
2. Request their other activity
through Twitter’s API, filter by
time/date
3. Aggregate
Time catchments
The Twitter work-day time catchment of Bishopsgate
Activity at Bishopsgate in 2013
60
Waterloo St Pancras Victoria Paddington
London Bridge Liverpool Street Kings Cross Euston
Natural History Museum
Residential catchment of Twitter users
• First establish which
users have tweeted from
inside the building
• Create a customer
catchment by identifying
all of these users Tweets
sent from domestic land
uses
• E.g. ASDA in Clapham
Junction
The Twitter residential catchment of ASDA
Supermarket at Clapham Junction
Any Questions ?
Thanks for Listening