Date post: | 09-Jan-2017 |
Category: |
Science |
Upload: | lake-como-school-of-advanced-studies |
View: | 247 times |
Download: | 4 times |
Social and economical networks from (big-)data
Esteban Moro@estebanmoro
Master City Science, April 2016
@estebanmoro
Summary
1. Intro to Social/Geo Big Data 2. Sources of Social/Geo Big Data 3. Tools for Social/Geo Big Data 4. Applications of Big Data in Social and
Economical problems 5. Outlook
@estebanmoro
Mobile phone data
1. Intro to Social Geo Big Data
@estebanmoro
The data explosion
@estebanmoro
@estebanmoro
The three V’s
@estebanmoro
90% of the data today was created in the last 2 years
Volume
@estebanmoro
Volume http://blogs.msdn.com/b/data__knowledge__intelligence/archive/2013/02/18/big-data-big-deal.aspx
@estebanmoro
Velocity
@estebanmoro
Variety
@estebanmoro
The three layers of resources
@estebanmoro
Data is not information. Neither value
AcciónDecisión
Datos
Conoci-miento
Infor-mación
ML
SNA
NLP
@estebanmoro
NLP
SNA
Tweets about eventbrand, person
Linguistic analysis of its content
Content classification. Alert generation
Data is not information. Neither value
@estebanmoroMcKinsey Global Institute Big Data Report 2011
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
@estebanmoro
@estebanmoro
We are what we repeatedly do
Situation Behavior Observation
@estebanmoro
> Big Data, Better answers Improve our understanding of well-known problems Different geo/temporal scales: real time (nowcasting/forecasting), small areas
> Big data, Big new questions Unknown/unsolved problems
@estebanmoro
Mobile phone data
2. Sources of social / geo big data
@estebanmoro
@estebanmoro
Frequency
Sem
antic
s
• Social networks: • Twitter, Facebook,
Foursquare, etc. • Google:
• Points of interest, searchs, etc.
• Financial data • Transfers • Credit card transactions
• Mobile phone: • CDRs (calls/SMS),
network events, etc. • Phone sensors
Geo and Social Data Sources
@estebanmoro
Maps
• Raster images • Googlemaps & OpenStreetMap • Static maps + routes
• http://maps.google.com/maps/api/staticmap • http://open.mapquestapi.com/guidance/v1/
• Cartodb • https://cartodb.com
@estebanmoro
Who
What
WhereWith
whom
When
Mining the social web, O’Reillyhttp://shop.oreilly.com/product/0636920030195.do
Social media data sources
@estebanmoro
Social media data sources
2M tweets geolocalized in Madrid
@estebanmoro
Mining the social web, O’Reillyhttp://shop.oreilly.com/product/0636920030195.do
Social media data sources
Where
Who
How
What
@estebanmoro
Shops & Services Food Professional
Social media data sources
@estebanmoro
Mobile phone data
Where
When
With whom
@estebanmoro
Credit card
Where When
What
@estebanmoro
How much does BigData cost?Sources of data
• Free APIs (http://dev.twitter.com) • Data vendors
• GNIP • Datasift
• Data cost is a function of volume and query complexity. • Volume: 10k tweets = $1 • Complexity: 1 unit = 0.20$ • Typical queries (a word/hashtag) in a
week = $100’s
@estebanmoro
Other sources of BigData
http://insights.wired.com/profiles/blogs/monetizing-data-milking-the-new-cash-cow
Data monetization
@estebanmoro
Other sources of BigData
Data monetization
https://www.commerce360.es http://dynamicinsights.telefonica.com
@estebanmoro
Other sources of BigDataOther sources of data http://insideairbnb.com
@estebanmoro
Other sources of BigDataOther sources of data (pictures, Panoramio)
http://www.sightsmap.com
@estebanmoro
Other sources of BigDataOther sources of data (pictures, Flickr)
https://www.flickr.com/photos/walkingsf/sets/72157627140310742/with/5925795351/
@estebanmoro
Other sources of BigDataOther sources of data (pictures, NASA)
http://www.citylab.com/tech/2014/05/the-economic-data-hidden-in-satellite-views-of-city-lights/371660/
@estebanmoro
Mobile phone data
3. Applications of Big Data
@estebanmoro
What can we do with social/geo bigdata
• Basically:
• a) Build modes of user behavior: • Geo-social activity • Geo-individual recommendation • Geomarketing • Fraud detection • Insurance dynamical pricing
• b) Build models of areas activity
• Optimal distribution of resources (retail, banks)
• Event detection • Measure fluxes between areas (traffic,
transport, health, etc.) • Macro-economical indexes of areas
@estebanmoro
2015
250 participants :: 140 institutions, 32 countries, 5 continents
Organized by
IV Conference on the scientific analysisof mobile phone datasets
@estebanmoro
2015 Crowds: Real time event detection in cities Estimating attendance of events
Cities: Energy consumption Predicting crime hotspots Health catchment areas Census estimation
Economies: Loan Repayment Food consumption and poverty indices Microcredit approval Labor market
Societies: Spread of diseases Social influence Privacy Product adoption Marketing
Mobility: Mobility prediction Impact of Sharing Economy Optimization of public transportation
Mobility
Content
Activity
Social
@estebanmoro We Are Social @wearesocialsg • 293
ACTIVE INTERNET USERS
TOTAL POPULATION
ACTIVE SOCIAL MEDIA ACCOUNTS
MOBILE CONNECTIONS
ACTIVE MOBILE SOCIAL ACCOUNTS
FIGURE REPRESENTS MOBILE SUBSCRIPTIONS, NOT UNIQUE USERS
FIGURE REPRESENTS ACTIVE USER ACCOUNTS, NOT UNIQUE USERS
FIGURE REPRESENTS ACTIVE USER ACCOUNTS, NOT UNIQUE USERS
FIGURE REPRESENTS TOTAL NATIONAL POPULATION, INCLUDING CHILDREN
FIGURE INCLUDES ACCESS VIA FIXED AND MOBILE CONNECTIONS
JAN 2015
A SNAPSHOT OF THE COUNTRY’S KEY DIGITAL STATISTICAL INDICATORS
MILLION MILLION MILLION MILLION MILLION
• Sources: Wikipedia; InternetLiveStats, InternetWorldStats; Facebook, Tencent, VKontakte, LiveInternet; GSMA Intelligence
46.5
URBANISATION: 77%
35.7
PENETRATION: 77%
22.0
PENETRATION: 47%
50.3
vs. POPULATION: 108%
17.8
PENETRATION: 38%
DIGITAL IN SPAIN
@estebanmoro We Are Social @wearesocialsg • 299
JAN 2015 TOP ACTIVE SOCIAL PLATFORMS
• Source: GlobalWebIndex, Q4 2014. Figures represent percentage of the total national population using the platform in the past month.
SURVEY-BASED DATA: FIGURES REPRESENT USERS’ OWN CLAIMED / REPORTED ACTIVITY
SOCIAL NETWORK
MESSENGER / CHAT APP / VOIP
42%!
33%!
20%!
17%!
12%!
11%!
10%!
9%!
9%!
7%!
FACEBOOK MESSENGER
SKYPE
GOOGLE+
SHAZAM
@estebanmoro We Are Social @wearesocialsg • 295
JAN 2015 TIME SPENT WITH MEDIA
SURVEY-BASED DATA: FIGURES REPRESENT USERS’ OWN CLAIMED / REPORTED ACTIVITY
AVERAGE DAILY USE OF THE INTERNET
VIA A PC OR TABLET (INTERNET USERS)
AVERAGE DAILY USE OF THE INTERNET VIA A
MOBILE PHONE (MOBILE INTERNET USERS)
AVERAGE DAILY USE OF SOCIAL MEDIA
VIA ANY DEVICE (SOCIAL MEDIA USERS)
AVERAGE DAILY TELEVISION VIEWING
TIME (INTERNET USERS WHO WATCH TV)
• Source: GlobalWebIndex, Q4 2014. Based on a survey of internet users aged 16-64.
NOTE THAT AVERAGE TIMES ARE BASED SOLELY ON PEOPLE WHO USE EACH MEDIUM, AND DO NOT FACTOR NON-USERS
3H 58M 1H 51M 1H 54M 2H 31M
@estebanmoro
Opinion: Political opinion Product/Brand opinion
Cities: Tourism activity Event detection
Economies: Unemployment Microcredit approval Human resources
Social: Influencer detection Community analysis Social mobilization
Mobility: Tourism in cities World-wide transport
Mobility
Content
Activity
Social
@estebanmoro
Dynamic population estimation
Deville, P, et al. (2014). Dynamic population mapping using mobile phone data. PNAS 111(45), 15888–15893. http://doi.org/10.1073/pnas.1408439111
@estebanmoro
Purchasing behavior during holidays
BBVA + MIT
@estebanmoro
Mobility inside cities
Habidatum
@estebanmoro
Mobility inside cities
Habidatum
@estebanmoro
Mobility between cities
A. Llorente, E. Moro et al (2014)
@estebanmoro
Event detection
Orange
@estebanmoro
Transport
http://cargocollective.com/juanfrans
@estebanmoro
Tourism http://www.centrodeinnovacionbbva.com/bbvatourism
@estebanmoro
Real state
http://www.urbandataanalytics.com/2014/03/12/las-edades-de-madrid/
@estebanmorohttps://mcorella.cartodb.com/viz/2858ca72-e1ec-11e5-bfd8-0ea31932ec1d/public_maphttp://analytics.afi.es/AfiAnalytics/noticias/1503332/1491511/0/es-tu-casa-grande-o-pequena-y-las-de-tu-barrio.html
Real state
@estebanmoro
Real state
http://www.datanami.com/2015/08/12/inside-the-zestimate-data-science-at-zillow/
@estebanmoro
Real state
http://www.amazon.com/Zillow-Talk-Rules-Real-Estate/dp/1455574740
@estebanmoro
HealthPrediction of air quality in cities (http://www.bsc.es/caliope/es)
@estebanmoro
Health
Correlation between content in social networks and symptoms
60 80 100 120 140
0100
200
300
tagl[, 1]
(tagl
[, 3]
/tagl
[, 2]
) * 1
e+05
/4
60 80 100 120 140
0200400600800
tagl[, 1]
(tagl
[, 3]
/tagl
[, 2]
) * 1
e+05
/4
60 80 100 120 140
0200
600
1000
tagl[, 1]
(tagl
[, 3]
/tagl
[, 2]
) * 1
e+05
/4
flu
Allergy
headache
Weeks since Jan 2012)
Inci
denc
e (p
er 1
00k
user
s)
60 80 100 120 140
0500
1000
1500
tagl[, 1]
(tagl
[, 3]
/tagl
[, 2]
) * 1
e+05
/4
feverheadache
flu
Incidence
alta media baja
@estebanmoro
Health
Correlation between content in social networks and symptoms
ARTEM Artemisia Pollen count of Artemisia grains / m3 of air
Pollen Spanish Aerobiology Committee
ALTER Alternaria Pollen count of Alternaria grains / m3 of air
Pollen Spanish Aerobiology Committee
Table 2. Abbreviated, full name, description, type and source of all indicators analyzed. All are represented as density variables of the system.
3. Results
Firstly, all time series captured and built are introduced by figure 1 that shows all them, which are categorized in circulatory, respiratory and digestive deaths, secondly in official ILI cases and ILI related searches in Google, and finally for the related health time series for symptoms, treatments and ILI and common cold related mentions in first person in Twitter. There is another group of time series, which can be seen as factors or possible predictors of health time series, they explain the composition and quality of the air, these time series are pollutants and pollens.
Figure 3. Time series correlation matrix with statistical significance and clustered in three groups, a first group at the top left for autumn-winter season, a second group at the center of the matrix for summer season and a final group for big particle on air. Blank entries correspond to statistically insignificant correlations with %95 confidence.
To determine whether all time series captured were correlated, a pearson correlation matrix was calculated where it can be seen at figure 3 a clear positive correlation between each health related time series, they also correlate positively with some pollutants and pollens such as NO, NO2, NOX, CO and C6H6, for pollutants, and ALNUS, CUPRE, FRAXI, MERCU and ULNUS. All these time series have a similar seasonality during cold months of the year and they form a clear group within the correlation matrix. There is a second group with a peak seasonality during the hottest months of the year, this group is mostly form by pollens and O3. And finally, there is a third cluster where variables correlate between each other very strongly, however, the correlation with the rest of time series is zero or very small.
Figure 4. Geo Spatial representation of the logarithmic transformation of total mentions of health related time series from Twitter between 2013 and 2014.
The next step for having a deeper insight from time series, a spatial analysis is represented in figure 4 that shows the logarithmic transformation of total number of health related mentions on Twitter by Spanish municipalities. Big cities are shown as those with highest proportion, this is due to scale-free processes where big populations are nodes of attraction which produce a high number of mentions. Moreover, it can be appreciated that between the health related mentions, headache symptoms have a sparser distribution over whole geographic level, followed by ILI and common cold and fever related mentions, and finally, respiratory related mentions are concentrated in high dense populations.
@estebanmoro
Health
Correlation between content in social networks and symptoms
0
500
1000
1500
2000
1_Mon 2_Tue 3_Wed 4_Thu 5_Fri 6_Sat 7_Sundias
fraction
1600
1700
1800
1900fraction
Incid
ence
(per
100
000
user
s)
days
0
100
200
300
400
500
1_Mon 2_Tue 3_Wed 4_Thu 5_Fri 6_Sat 7_Sundias
fraction
400
420
440
460
480fraction
days
Headache backache
@estebanmoro
Political opinion
• Ejemplo: identificación de partidarios durante las campañas políticas Catalan elections 2010
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
0
0
PSCCiUPSCCiUPSCCiUPSCCiUERCPPCICV
PSCCiUERCPPCICVC'sSOL
PSCCiUERCPPCICVC'sSOLPPTPACMACORI
@estebanmoro
Political opinion
General Strike Spain March 12
@estebanmoro
References
• Reviews on mobile phone applications
• Blondel, V. D., Decuyper, A., & Krings, G. (2015). A survey of results on mobile phone datasets analysis. EPJ Data Science, 4(1), 10. http://doi.org/10.1140/epjds/s13688-015-0046-0
• MOBILE PHONE NETWORK DATA FOR DEVELOPMENT. (2013). UN Global Pulse
• Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-scale dynamics of mobile telephone calls. The European Physical Journal B, 88(6). http://doi.org/10.1140/epjb/e2015-60106-6
• Naboulsi, D., Fiore, M., Ribot, S., & Stanica, R. (n.d.). Large-scale Mobile Traffic Analysis: a Survey. IEEE Communications Surveys & Tutorials, 1–1. http://doi.org/10.1109/COMST.2015.2491361
• Conferences
• NetMob http://netmob.org
• NetSci http://netsci2016.net
@estebanmoro
References
• Mining the Social web, O’Reilly http://shop.oreilly.com/product/0636920030195.do
• Aplicaciones
• Pinheiro, C. A. R. 2011. Social network analysis in telecommunications. John Wiley & Sons.
• Morselli, C., ed. 2013. Crime and Networks. Routledge.
@estebanmoro
Mobile phone data
3. Tools for social/geo big data
@estebanmoro
• There are many frameworks to study social networks
• In general we have:
• Analysis platforms: they implement most of the algorithms for graph analysis:
• Local metrics (degree, clustering)
• Centrality metrics (betweenness, closeness, etc.) • Community finding algorithms
• Visualization libraries
• Display graphs in different forms (layout, colors, etc.)
• Graph databases: allow the storage (distributed), queries and some type of analysis for (big) graph data.
Libraries
@estebanmoro
3 layers of graph technologies
66
@estebanmoro
• Network data can be stored in many databases
• However in the last years, the interest in graph databases has grown steadily
Graph databases
67
http://db-engines.com/en/ranking_categories
@estebanmoro
• They are databases that uses graph structures for queries. Data is represented using nodes, edges and properties of them
• Each node knows its neighbors • They implement in a very easy
way queries on graphs lie: • Find the neighbors of a node
• Find the path between two nodes
• Those queries in a typical relational database require several “joins”:
Graph databases
68
http://neo4j.com/developer/graph-db-vs-rdbms/
@estebanmoro
• Some examples
• Neo4j (comercial/open-source): problably the more used. It has its own query lenguage (Cypher). It can be accessed from many other languages (R, pyhton, java) http://neo4j.com
• Sparksee (commercial): built for high-performance and scalability. http://sparsity-technologies.com
• Titan (Apache): distributed graph database, built to store, query graphs with billions of nodes and edges. http://thinkaurelius.github.io/titan/
Graph databases
69
@estebanmoro
• It can be used using API Rest (HTTP) • It has his own query language: Cypher
Neo4J
70
@estebanmoro
Neo4j Cypher
71
http://neo4j.com/developer/cypher-query-language/
@estebanmoro
1. You can download the full Panama papers database in Neo4J format 2. https://offshoreleaks.icij.org/pages/database 3. Count number of nodes / number of relationships
Application: Panama papers
72
@estebanmoro
1. Show the relationships of the President of Azerbaijan (Ilham Aliyev) and his children 2. https://panamapapers.icij.org/20160404-azerbaijan-hidden-wealth.html 3. Search for all the officers named “ Aliyev"
Application: Panama papers
73
@estebanmoro
1. Show all the companies (entities) related to them
Application: Panama papers
74
@estebanmoro
• Built in many programming languages • Boost Graph Library (BGL) is
probably the most known and old. Built in C++ and optimized to be general, fast and efficient.
• SNAP (Standford Network Analysis), writen in C++ and optimized for massive graphs. (Jure Leskovec)
• NetworkX (python): library for studying graphs and networks. Reasonable efficient for large networks and their visualizationhttps://networkx.github.io
Analysis Libraries
75
@estebanmoro
• Graph-tool (python): module for manipulation and statistical analysis of graphs. Based heavily on BGL to have same performance. (Tiago P. Peixoto) https://graph-tool.skewed.de
• igraph (python, C y R): library written in C but also exists as a Python and R packages. It implements most algorithms. http://igraph.org
• networkDynamic (R): to analyze temporal networks
Analysis libraries
76
@estebanmoro
• Other platforms for the analysis of massive graphs (distributed)
• Giraph (Apache): graph processing with high scalability. Used by Facebook, compatible with Hadoop. http://giraph.apache.org
• Pregel (Comercial): Google’s graph platform
• GraphLab (Commercial): graph-based, high performance, distributed computational framework (including Machine Learning Toolkits) https://graphlab.org
• GraphX (Apache): distributed graph processing framework on top of Apache Spark. Has many powerful algorithms for graph analysis.http://spark.apache.org/graphx/
Analysis libraries
77
@estebanmoro
• Most of the analysis libraries contain visualization tools or modules to visualize graphs.
• Apart from those, there are other tools specialized in the visualization of graphs • Gephi is problably the most known one: is an
interactive visualization software (includes some analysis metrics). Works in Windows Linux and MacOSX. It is the „photoshop“ for graphs ☺ http://gephi.org
• Pajek is program in Windows to visualize and analyze big graphs. http://vlado.fmf.uni-lj.si/pub/networks/pajek/
• Linkurious graph visualization on top of Neo4j http://linkurio.us
Visualization libraries
78
@estebanmoro
• Graphviz: open-sourced library to visualize graph data http://www.graphviz.org
• Sigma.js is a javascript library to visualize graphs on the web. http://sigmajs.org
• Vis.js is a general javascript visualization library also with tools to visualize graphs. http://visjs.org/
• lightning-viz.org provides API-based access to reproducible web visualizations
• D3.js also have some graph visualization tools. Examples:
• http://christophergandrud.github.io/networkD3/
• http://bl.ocks.org/mbostock/4062045
• https://flowingdata.com/2012/08/02/how-to-make-an-interactive-network-visualization/
Visualization libraries
79
@estebanmoro
• Allows to modify and customize the visualization of graphs in an interactive way
• It has many layout algorithms
• Contains some graph metrics: • Centrality
• PageRank
• Connected components
• Etc.
• Allows to import/export graphs in many different formats.
Gephi
80
@estebanmoro
• About graph databases • Wikipedia: http://en.wikipedia.org/wiki/Graph_database • Libro: Graph Databases (O’Reilly) http://graphdatabases.com • Graph database ranking: http://db-engines.com/en/ranking/graph+dbms
• About Neo4J • Learn Neoj4j: book http://neo4j.com/book-learning-neo4j/ • Graphacademy (de Neo4j): http://neo4j.com/graphacademy/ has some online courses
• About a • Igraph: Statistical Analysis of Network Data with R (libro) http://www.amazon.com/
Statistical-Analysis-Network-Data-Use/dp/1493909827/ • GraphX: A gentle introduction to GraphX in Spark http://www.sparktutorials.net/
analyzing-flight-data:-a-gentle-introduction-to-graphx-in-spark
Some references
81
@estebanmoro
• About visualization
• Gephi: • Learn how to use Gephi https://gephi.org/users/
• Introduction to Network Analysis and Visualizationhttp://www.martingrandjean.ch/gephi-introduction/
Some references
82
@estebanmoro
Simple examples
83
@estebanmoro
igraph
@estebanmoro
igraph
@estebanmoro
igraph
@estebanmoro
igraph
@estebanmoro
igraph
@estebanmoro
igraph
@estebanmoro
References
• Online material▪ The igraph book (incompleto)▪ igraph wikidot▪Manual sencillo en español
• Books▪Statistical Analysis of Network Data with R
@estebanmoro
NetworkX
@estebanmoro
NetworkX
@estebanmoro
NetworkX
@estebanmoro
NetworkX
@estebanmoro
NetworkX
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
@estebanmoro
networkDynamic
References
About temporal networks ▪ Holme, P., & Saramaki, J. (2012). Temporal networks. Physics Reports, 519(3), 97–
125. ▪ Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-
scale dynamics of mobile telephone calls. The European Physical Journal B, 88(6). http://doi.org/10.1140/epjb/e2015-60106-6
About the networkDynamic, tsna and ndtv packages ▪ Package examples for networkDynamic https://cran.r-project.org/web/packages/
networkDynamic/vignettes/networkDynamic.pdf ▪ Package Vignette for ndtv https://cran.r-project.org/web/packages/ndtv/vignettes/
ndtv.pdf ▪ Package Vignette for tsna https://cran.r-project.org/web/packages/tsna/vignettes/
tsna_vignette.html Tutorials ▪ Temporal network tools in statnet: networkDynamic, ndtv and tsna