Making an impact with data science
Jordan Engbers, PhDChief Scientist, Desid Labs Inc.
CTO, Systolik Inc.
Outline
Who am I?
What is data science?
Making data products
Where do you go from X?
Are you doing good?
The Goal
To have a discussion around how to create meaningful impact with data
science
Who am I?
How did I get here?
2004
Bioinformatics
Multidisciplinary Program- Computer Science- Biomedical Science
Bioinformatics
2004 2008
Neuroscience
Just starting … no bitterness yet
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data Science
Big Data
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data Science
Data ScienceData AnalyticsPredictive Analytics
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data Science
- Data management for clinical researchers
- International clinical trials- Software development- Data science with clinical
registries and administrative health data (THIN)
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data Science
2015
Desid Labs Inc.
Data Science consulting company offering end-to-end data science services
Science-as-a-Service
desidlabs.com
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data Science
2015/16
Desid Labs Inc.Systolik Inc.
Taking Apps to Heart
Cardiovascular Information Systems
Focus on Analytics within Cardiovascular Care
systolik.com
my random walkmusic
ministry
bioinformatics
neuroscience
clinical data science
entrepreneur
web programming
humanities
development
biology
informatics
business
healthcare
computation
machine learning
big data
Take Away
There is no set path to becoming a data scientist
Focus on:
Developing a scientific mindset
Strengthening your “metaskills”
Exploring many disciplines
Should you listen to me?
I am not speaking as an authority
I am here to share what I have learned and to help move people forward in data science
So:
- Don’t take what I say at face value- Test for yourself- Challenge what you hear- Come up with new and better ideas
What is Data Science?
http://higheredublog.com/data-science-as-a-masters-a-brief-overview/
Com
pute
r Sci
ence
Math + Statistics
Domain Expertise
software research
machine learning
data scientist(unicorn)
science
http://www.kdnuggets.com/2015/02/history-data-science-infographic.html
What is Data Science?
Wikipedia that:
“...interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics…”
“...Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, including signal processing, probability models, machine learning,statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing.”
What is a Data Scientist?
“Data scientists use their data and analytical ability to
find and interpret rich data sources;
manage large amounts of data despite hardware, software, and bandwidth constraints;
merge data sources;
ensure consistency of datasets;
create visualizations to aid in understanding data;
build mathematical models using the data; and
present and communicate the data insights/findings.”
Is data science just a set of methodologies?
The purpose of a scientific discipline
Do the following descriptions make sense?
- Astronomy is the field of science that uses telescopes- Chemistry is about mixing chemicals and torturing undergrads- Statistics uses maths
Nope.
- Astronomy is the study of celestial objects and processes that allows us to understand the universe
- Chemistry examines the composition, structure, properties and change of matter to help us understand the physical world
- Statistics allows us to use data more effectively by studying the collection, analysis, interpretation, and organization of data….
Methods are invented to serve the field, not as a purpose in themselves.
Is data science just statistics “rebranded”?
"Data scientist is just a sexed up word for statistician." - Nate Silver
“Statistical modelling - two cultures” - Leo Breiman
“50 Years of Data Science” - David Donoho
Summary, data science is just an expanded form of statistics
But see:
“What ‘50 years of data science’ leaves out” - Sean Owen, Cloudera
What is the purpose of data science?
Data Science is about decisions
We democratize data access to empower all employees to make data-informed decisions, give everybody the ability to use experiments to correctly measure the impact of their decisions, and turn insights on user preferences into data products that improve the experience of using Airbnb
- Scaling Knowledge at Airbnb
That is more than statistics:
- Need to understand business processes- Requires data engineering approaches to provide the
environment- Requires software engineering to create platforms to measure
the impact and develop the data products
Data science is the scientific discipline focused on determining how data can drive better decisions across a wide set of domains
Scientific discipline - not just data analysis, but science
“...determining how data...” - methodologies, statistics, computer science
“...can drive better decisions…” - domain knowledge, science, engineering, social sciences...
How does a focus on decisions change our approach?
1) Takes the focus away from specific methodologies (we do deep learning too!) to using the appropriate methodologies to achieve a larger overarching goal - better decisionsa) Side effect is we get to use a larger array of disciplines
i) Systems theoryii) Psychology
2) Focus on making good data products that change decisions a) Focusing on data products takes us away from “scripts” and towards an
engineered approach to data product manufacturing
Data science is not rebranded statistics.
Data science is a multidisciplinary discipline that seeks to understand how data can be used to improve decision
making.
Statistics is just a part of the approach.
Making Data Products
What is a data product?
Desired OutcomeDecisionExperienceWorld
learning
data information knowledge wisdom
data product
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Data products are the mechanism by which data science creates impact
Scientific Method
Framework for finding value in data
Data is a raw resource. Converting data to a data product requires experimentation, exploration and learning. This is the domain of science.
Agile Development
Process for creation in the face of uncertainty
Agile processes allow software teams to meet changing requirements, but stay on track and create effective products.
Engineered Products
Practices for ensuring high quality products
It is one thing to make an R script to analyse a dataset. It is another to have a resilient, auditable, scalable data product.
Desid Labs Approach
“Data science - more than just R scripts”- unofficial Desid Labs motto
Levels of data products
Reporting
Dashboards
Prediction
AI (Autonomous)
Intelligent Decision-making Support Systems
Other dimensions
Complexity of UI
Complexity, size, and speed of data, information, and knowledge (3V’s)
This branches into the field of AI and decision making
Start with Herbert Simon
Learning from the other doctors (MD, not PhD)
Clinical Decisions Rules (Dr. Ian Stiell)
1) Derivation2) Validation
a) Cross-validation (should be standard practice!)b) Prospective validation - this is the real experiment
3) Implementation4) Studying barriers to adoption
These steps help determine the validity of your data product
More than just R scripts
“It’s one thing to create an excellent fraud detection model in R, and quite another to build:
● Fault-tolerant ingest of live data at scale that could represent fraudulent actions
● Real-time computation of features based on the data stream● Serialization, versioning and management of a fraud detection model● Real-time prediction of fraud based on computed features at scale● Learning over all historical data● Incremental update of the production model in near-real-time● Monitoring, testing, productionization of all of the above”
- Sean Owen, Cloudera
These are the sorts of things to think about when it comes to implementing your data product
Where do you go from X?
Coursera
Act
iviti
es Data PreparationIntelligence Gathering
Wha
t is th
e que
stion
?
Whe
re is
the da
ta?
Wha
t is th
e data
?
Get the
data
Store t
he da
ta
Transfo
rm th
e data
Load
the d
ata
Modeling
Featur
e eng
ineeri
ng
Preproc
essin
g
Machin
e lea
rning
algo
rithms
Valida
tion (
Phase
I - C
ross V
alida
tion)
Design Production
Visuali
zatio
n
Reduc
ing fe
ature
set
Creatin
g a pl
an fo
r integ
rating
Movem
ent to
prod
uctio
n stac
k
Version
ing an
d man
agem
ent
Monito
ring,
testin
g, de
ploym
ent
Kaggle
Hackathon
Research & Open Data
Data Science Job
Act
iviti
esS
kills
&
Kno
wle
dge
Data PreparationIntelligence Gathering
Wha
t is th
e que
stion
?
Whe
re is
the da
ta?
Wha
t is th
e data
?
Get the
data
Store t
he da
ta
Transfo
rm th
e data
Load
the d
ata
Modeling
Featur
e eng
ineeri
ng
Preproc
essin
g
Machin
e lea
rning
algo
rithms
Valida
tion (
Phase
I - C
ross V
alida
tion)
Design Production
Visuali
zatio
n
Reduc
ing fe
ature
set
Creatin
g a pl
an fo
r integ
rating
Movem
ent to
prod
uctio
n stac
k
Version
ing an
d man
agem
ent
Monito
ring,
testin
g, de
ploym
ent
Domain Knowledge
Data mungingDistributed computingStorageSamplingDigital signal processingHandling missing dataFilteringDatabases
Machine LearningAlgorithmic ComplexityGPU optimizationProgrammingStatisticsProbabilities
Web developmentPsychologyUI/UXSoftware engineering
DevopsTestingDebuggingEnterprise languagesCloud computing
Learn by doing
1) Figure out where you are in the spectrum2) Determine what experience you need to expand in either
direction3) Find projects that will give you that experience
a) Online competitionsb) Hackathonsc) Freelance workd) Your own projectse) Data journalismf) Data for Good (!)
Post production
Treat your data product as an hypothesis about the world
● Collect prospective data on its use● Perform cohort analyses on people who make decisions based
on the data● Consider A/B testing● Consider canary testing● Set a point where you will analyze the data (X people, X
amount of time)● Answer the question - did it make a difference?● Did it make the right difference?
Are you doing good?
“...science and technology have been unable to keep pace with the second-order effects caused
by their first-order victories.”
- Gerald Weinberg
How do we know that our data products are having the desired effect?
Data is cleaned, features determined, model created (AUC: 0.88!), implementation tested, UI designed, UX tested, integrated into production system, monitored.
Everything is done
Pat on the back - walk away
Next month’s headline:
What happened?
- An algorithm is only as good as its data- An algorithm learns from the data - data is an
representation of the real world including its flaws- The real world is complex and there can be non-linear
effects
Obviously Data for Evil (Commission)
Predatory advertising
Surveillance of dissidents, activists
Identity theft
Social Engineering
Gray areas
Web lining
Databases in elections to determine wedge issues
Surveillance for security reasons
Targeted advertising
Data for Good … right? (Omission)
Model to determine who will respond best to social assistanceWhat if the data is from an area with strong historical racism?(Don’t use variables/features that could be impart racial bias)
Automatic tagging of photosWhat are the consequences of the algorithm being wrong?(Need to balance sensitivity and specificity)
Apps to help first-responder (geolocation)Will providing a service to some people limit access based on arbitrary technology choices?
How Big Data Enables Economic Harm to Consumers, Especially to Low-Income and Other Vulnerable Sectors of the Population
Algorithms aren’t biased - but data is
Historical data encompasses our societal biases
Algorithms learn from that data and inherit these biases
https://www.fordfoundation.org/ideas/equals-change-blog/posts/can-computers-be-racist-big-data-inequality-and-discrimination/
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899
https://www.propublica.org/article/when-big-data-becomes-bad-data
https://theconversation.com/big-data-algorithms-can-discriminate-and-its-not-clear-what-to-do-about-it-45849
So what do we do?
Possibilities:
● Strengthen User Control of Personal Data● Enforce Structural Changes in Market to Increase Competition● Directly Regulate Big Data Platforms to Prohibit Harmful Practices● Investing in the technical capacity of public interest lawyers, and developing a
greater cohort of public interest technologists● Pressing for “algorithmic transparency.”● Exploring effective regulation of personal data● Ethical code of conduct for data science
These are strategic suggestions - they suggest the what, but not the how
We need a solution that keeps pace with the tech
1) Systematic scientific process should be appliedEquivalent of peer review
2) Agile development and testingEnsure models are implemented correctly
3) Systems modelingUnderstand the second-order effects of the system
4) MonitoringValidation of our model in the world
Conclusions
Data science is about decisions.
The creation of data products involves many disciplines
Determine where you are at, then expand your skills
Approach data science with care and thought - it is as easier to hurt than help
If you are interested in specifics about methodologies, sign up for the Desid Labs
newsletter:
desidlabs.com