Post on 15-Jan-2016
transcript
Toolbox of a data scientist: multiple approaches to work with behavioural data
Philippe J. Giabbanelli, PhD
Data Insight Meetup, February 5th 2015
2PJ Giabbanelli
Outline
Toolbox of a data scientist: multiple approaches to work with behavioural data
Toolbox of a data scientist: multiple approaches to work with behavioural data
Toolbox data scientistbehavioural data
1 – What’s data science?
2 – What questions can we ask of behavioural data?
3 – How do we use data science tools to get answers?
Food behaviours Drinking behaviours Insurgencies
What’s data science?Visualization Data miningSimulation and modelling
3PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
Imagine that people have completed some kind of questionnaire. Typically you get an Excel spreadsheet. And you’d like to understand what relates to the target behaviour.
4PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
Tableau
5PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
Imagine that you have a very complex system, where tons of variables interact… You may want to look at it as a network.
Gephi
PJ Giabbanelli 6Toolbox of a data scientist: multiple approaches to work with behavioural data
7PJ Giabbanelli
What if you have a lot of text instead?
Toolbox of a data scientist: multiple approaches to work with behavioural data
Here I am primarily concerned with visualization as seen from a data scientist’s viewpoint. I would use…
Tool Data
Tableau, Qlik, Spotfire Relational (spreadsheet)
Gephi or Visone Network
Datawatch Streaming relational
Many-eyes A bit of everything
GeoTime Spatial data over time
Jigsaw, CZSaw, InSpire, Leximancer,
Text
$
$
$
$$
$
$
Viz as data scientist ≠ Making pretty pictures
If you’re producing a visual for an audience, you show what you found. When you start with viz as a
data scientist, you want to find something!Visual Capitalist
9PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
Abusing the tool
If you watch CSI, you’ll see that when they search for a fingerprint match, the software shows all fingerprints it has!
Wasting computer resources for useless displays
Proper statistical
testingIf it looks like your data is normally distributed, that must be it, right?
Relying on visuals instead of doing proper statistics
10PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
Abusing the tool
When all you have is a hammer, everything starts looking like a nail.
What’s data science?Visualization Data miningSimulation and modelling
11PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
What’s data science?
?
??
12PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
Imagine that you’re working for CSI (again!) and you want to identify the dude in the picture.
When you know what you’re after, and it can be mathematically expressed, data mining helps.
PJ Giabbanelli 13
Rules
Communication
Often Very often
Dai
lyW
eekl
yN
ever
Binge drinker Non-binge drinker
A: rules
If ≥ oftenIf < often
B: comm.
If<daily
D: rules
If ≥ very oftenIf < very often
C: comm. B
A
D
C
Never
If ≥ daily If < weekly If ≥ weekly
Toolbox of a data scientist: multiple approaches to work with behavioural data
What’s data science?
Suggested tools: RapidMiner, Weka$ $
PJ Giabbanelli 14Toolbox of a data scientist: multiple approaches to work with behavioural data
What’s data science?
Data mining involves automatically testing lots of hypotheses by searching for combinations of
variables that might show a correlation.
Which variables are in the winning combination? You partly do data mining to answer this question…
A. WoodData
Manager
« For every variable that you seek to collect, provide a detailed rationale. »
V. LoEthics Board
What’s data science?Visualization Data miningSimulation and modelling
15PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
16PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data
I offered coupons to some customers. Would they spend more? Who should I target?
I raised prices of fast foods. Would it curb obesity? Who would benefit the most?
I put people on antiretroviral therapy when they don’t have AIDS. Would it help? For whom?
There are lots of big questions for which you don’t necessarily have all the data. Also, methods that help you understand what happened may not be helpful to know what may happen if…
What’s data science?Imagine that you want to change the urban environment to see if it helps people exercise more.
PJ Giabbanelli 17Toolbox of a data scientist: multiple approaches to work with behavioural data
You hopefully won’t be doing that.
Rather you might want to create a virtual environment that simplifies reality so you
can test your hypothesis safely.
What’s data science?
PJ Giabbanelli 18Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 19Toolbox of a data scientist: multiple approaches to work with behavioural data
What’s data science?
There are lots of ways to do modelling, depending on desired
spatial & individual resolution.
The most common approaches are agent-based modelling and
system dynamics.
Tool Approach
Anylogic ABM / SD
NetLogo ABM
Vensim, iThink SD$
$
$
PJ Giabbanelli 20
Also: The emergence of Computational Sociology (J. of Math. Soc., ‘95); Why model? (JASS ’08)
What’s data science?
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 21Toolbox of a data scientist: multiple approaches to work with behavioural data
Visualization Modelling & Simulation Data mining & Machine Learning
Data Science as a Technique
Applications
Defense Health
Chronic diseases Infectious diseases
PJ Giabbanelli 22
Why?
Tell me what people will do in the future!
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 23
Applications of Data Science
How would climate change policies impact the health of Canadians by 2030? Simulated data for 2030
Dietary patterns Built environment Socio-economics
Inputs Outputs
Systems modelExpected
health impacts
Physical health
Well-being
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 25
Applications of Data Science
There are many reasons other than prediction to do data science.
Explaining
To simulate far into the future, you need to understand what you have now and how it changes.
2014 2024 2044
1 - Explain 2 - Predict
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 26Toolbox of a data scientist: multiple approaches to work with behavioural data
Applications of Data Science
There are many reasons other than prediction to do data science.
Explaining
“Electrostatics explains lightning,
but we cannot predict when or where the next bolt will strike.”
“Plate tectonics explains earthquakes,
But does not permit us to predict the time and place of their occurence"
PJ Giabbanelli 27Toolbox of a data scientist: multiple approaches to work with behavioural data
Applications of Data Science
There are many reasons other than prediction to do data science.
Explaining
Schelling’s model of segregation
A preference that one's neighbors be of the same color, or even a preference for a mixture "up to some
limit", could lead to total segregation.
PJ Giabbanelli 28Toolbox of a data scientist: multiple approaches to work with behavioural data
Applications of Data Science
There are many reasons other than prediction to do data science.
What are the core dynamics in my problem?
Where are the gaps? Where do I need to collect data?
What would happen if?
How can we best do monitoring and surveillance?
PJ Giabbanelli 29
Illuminate core dynamics
“There is increasing evidence that social influence and social network structures are significant factors in obesity.”
Eating Exercising
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 30
Illuminate core dynamics
To which extent could social influences account for the dynamics of obesity?
Toolbox of a data scientist: multiple approaches to work with behavioural data
Let’s tackle the question using modelling & simulation.
PJ Giabbanelli 31
Illuminate core dynamics
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 32
Illuminate core dynamics
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli
Motivating question: to which extent is this model supported by interviewees?
33Toolbox of a data scientist: multiple approaches to work with behavioural data
Let’s tackle this question using interactive visualizations.
Illuminate core dynamics
PJ Giabbanelli
We measured the strength of a relationship between two factors as the number of responses in the interviews that used words relevant to both factors.
34Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 35
Explaining
ProcessYou select peers with
whom to drink……and then, their drinking
habits influence yours.
Structure
Can we explain why people engage in binge drinking? Let’s start with modelling and simulation, and make some hypotheses.
Toolbox of a data scientist: multiple approaches to work with behavioural data
PJ Giabbanelli 36
If we assume:
• that individuals select similar peers
• that individuals are prompted to drink if at least a fraction of their peers drink
• that one’s context known from drinking motives may deter/promote drinking
Then we can correctly infer the behaviour of half of the binge drinkers and 4 out of 5 non binge drinkers.
Explaining
Toolbox of a data scientist: multiple approaches to work with behavioural data
But without making any assumptions ourselves, if we just used data mining we would get roughly the same accuracy. The computer would build an explanation for us.
March 2011: Emergence Escalation Early 2012: Militarisation
Monitoring
The situation might change as you are intervening.
How can you monitor changes and adapt?
PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data 37
Visualizations allows the analyst to interactively explore the data and improve the model.
The model guides the analyst in the exploration of the new data.
PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data 38
PJ Giabbanelli
There is a lot of potential in the tight coupling of techniques (e.g., modelling / interactive visualizations) but currently you’d have to come up with a technical solution yourself for that.
Toolbox of a data scientist: multiple approaches to work with behavioural data 39
PJ Giabbanelli 40Toolbox of a data scientist: multiple approaches to work with behavioural data
Visualization Modelling & Simulation Data mining & Machine Learning
Defense Health
Chronic diseases Infectious diseases
Interdisciplinary: shock of cultures
Getting good quality data
Needing to understand a very wide range of tools
Continuously need to improve the tools
Data science in the world
Challenges
Challenges – Need new tools
PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data 41
Challenges – Interdisciplinary
PJ Giabbanelli Toolbox of a data scientist: multiple approaches to work with behavioural data 42
Challenges – Interdisciplinary
PJ Giabbanelli
In my field, good papers are published in conferences.
In my field, good papers are published in journals.
In my field, we just put data on our website for others.
In my field, we own the data and selectively share it.
Why don’t I just pick a book and learn your whole field?
Why don’t I just watch a couple videos to learn your job?
We need to build mutual trust and accomodate each other in a system that’s unsupportive.
Toolbox of a data scientist: multiple approaches to work with behavioural data 43
Challenges – Getting good data
PJ Giabbanelli
There is a lot of data out there. But most is unstructured (text, video…)
and hard to deal with.
There are public repositories for data but a lot of that are lists of junk,
localisations, or population-level data split at best per age and gender.
http://ukdataservice.ac.ukhttp://data.gouv.frhttp://data.govhttp://adsfree.comhttp://kaggle.com
Toolbox of a data scientist: multiple approaches to work with behavioural data 44
Challenges – Getting good data
PJ Giabbanelli
Kaggle
Toolbox of a data scientist: multiple approaches to work with behavioural data 45
PJ Giabbanelli
Investigator ScientistUniversity of Cambridge
(@Addenbrooke’s)
Get in touch? giabba@sfu.ca
FounderVancouver Computational
Modelling
• PJ Giabbanelli. Modelling the spatial and social dynamics of insurgency. Security Informatics ‘14
(Simulation & Modelling in Defense)
• Pratt, Giabbanelli & Mercier. Detecting unfolding crises with visual analytics and conceptual maps: emerging phenomena and big data. Proc of IEEE ISI ‘13(Visual Analytics + Simulation & Modelling in Defense)
• Crutzen & Giabbanelli. Using classifiers to identify binge drinkers based on drinking motives. Substance use & misuse ‘14. (Data mining in health)
• Giabbanelli et al. Modeling the influence of social networks and environment on energy balance and obesity. Journal of Computational Science ‘12.
(Simulation & Modelling in Health)
Toolbox of a data scientist: multiple approaches to work with behavioural data 46