Principal Data Scientist
Booz Allen Hamilton
http://www.boozallen.com/datascience
Kirk Borne@KirkDBorne
The Journey to Data Science Maturity:Sailing the Seven C’s
Get these slides here: http://www.kirkborne.net/ApexSystems2019/
This talk is *not* about this…
2
The unicorns of the new data world…
https://www.figure-eight.com/figure-eight-2018-data-scientist-report/ htt
p:/
/ww
w.m
arke
tin
gdis
tille
ry.c
om
/20
14
/11
/29
/is-
dat
a-sc
ien
ce-a
-bu
zzw
ord
-mo
der
n-d
ata-
scie
nti
st-d
efin
ed/
… but this talk is about this…
3
https://twitter.com/dez_blanchfield/status/645139875440668672
OUTLINE
• Big Data Preliminaries
• Data Literacy & Ethics
• Data Science
• Data Storytelling
• The “7 C’s” (actually 12)
4
Source: https://www.expertsystem.com/government-data-mining/
http://www.boozallen.com/datascience @KirkDBorne
OUTLINE
• Big Data Preliminaries
• Data Literacy & Ethics
• Data Science
• Data Storytelling
• The “7 C’s” (actually 12)
5
Source: https://www.expertsystem.com/government-data-mining/
http://www.boozallen.com/datascience @KirkDBorne
Oceans of Data: Sink or Surf!
https://hadoopilluminated.com/hadoop_illuminated/Big_Data.html
6
We’ve had data and databasesfor years!
So, why all the fuss right now?
• Data Science and Machine Learning are hot right now due to the enormous interest in the new massive data collections for their potential to enable significant new discoveries and to deliver new value (and opportunities) to organizations.
• Data Science is ready for “prime time” because it engages three technologies that are now sufficiently mature:1) Massive data collection
2) More powerful computing (and virtual / distributed computing = Cloud)
3) Sophisticated more powerful ML algorithms.
7
Defining Big Data
• We collect evidence (data) to answer our questions about the world around us … How? Why? What if?
– We are curious creatures… and that is how we end up in a world of BIG DATA!
• Big Data refers to data collections in which “everything is now being quantified and data-fied” (= full-population samples of everything = The End of Demographics!)
– Examples: Social networks (Twitter, YouTube), search & online histories, web logs, financial and e-commerce transactions, environment & health monitors (wearable devices, EHRs), IoT, Astronomy,…
– Huge quantities of data are now being collected and used everywhere.
Source for graphic: http://hinalockim.blogspot.com/2012/08/6th-week-cognitive-learning.html8
• Statistics = the practice (and science) of collecting and analyzing numerical data.
• Machine Learning (ML) = mathematical algorithms that learn from experience(= pattern recognition from previous evidence).
• Data Mining = application of ML algorithms to data.
• Artificial Intelligence (AI) = application of ML algorithms to robotics and machines = taking actions based on data ( #bots ).
• Data Science = application of scientific method to discovery from data (including statistics, ML, and more: visual analytics, computer vision, computational modeling, semantics, graphs, network analysis, NLU, data indexing schemes [Google!], …).
• Analytics = the products of machine learning & data science. For example: Health Analytics, Marketing Analytics, Behavior Analytics, Predictive Analytics (predictive models).
9
“The 2 most important
things in Data Science are
the Data and the Science!”
Some Quick Definitions :
An “Easy Button” for Extracting Value from Data through Data Science
• Pattern Discovery (Detection)– D2D: data-to-discovery
• Pattern Recognition– D2D: data-to-decisions
• Pattern Exploration– D2D: data-to-dollars (innovation)
• Pattern Exploitation– D2V: Data-to-Value (action)
– D2A: Data-to-Action (value)
10
Example: Google! (now Alphabet Inc.)• Google’s mission statement (since their beginning):
– “To organize the world's information and make it universally accessible & useful.”
– Ka-Ching!
– This was achieved through… Mathematical Algorithms for search! (PageRank)
• Recent advances in “search” (pattern recognition) algorithms:– Voice-based search and response (Google Assistant, Siri, Cortana, Alexa)
– “Search” for data! (not just keywords and text, but patterns in data)http://diginomica.com/2016/08/16/thoughtspots-search-for-data-analytics-finds-results-fast/
– Deep Learning used to find those patterns (in video, images, audio, networks,…)
– But… – Sometimes the algorithm gets it wrong …
“Good judgment comes from experience, and experience comes from bad judgment.”Actually, you want to get at least two models wrong in gradient descent optimization!
11
OUTLINE
• Big Data Preliminaries
• Data Literacy & Ethics
• Data Science
• Data Storytelling
• The “7 C’s” (actually 12)
12
Source: https://www.expertsystem.com/government-data-mining/
http://www.boozallen.com/datascience @KirkDBorne
Data Literacy
13
(Jordan Morrow, Qlik)http://www.dataliteracynetwork.org/definitions.html
“Data Literacy includes the ability to read, work with, analyze, and argue with data.”
Source: http://bit.ly/2mEzJsr
Data Literacy in 2 parts: Data Science and Data Ethics
http://www.kirkborne.net/cds151/
14
1) How to use data correctly
2) How to use data ethically
htt
p:/
/dilb
ert.
com
/str
ip/2
00
0-1
1-1
3h
ttp
://d
ilber
t.co
m/s
trip
/20
08
-05
-07
Huge quantities of data are now being used everywhere!•
“With great power comes great responsibility.”
– Spiderman’s uncle (…or… Voltaire)
Source: http://verix.com/6-new-articles-on-harnessing-the-power-of-big-data/
15
Huge quantities of data are now being used everywhere!•
“We were so preoccupied with whether or not we could, we
didn't stop to think if we should.” – Dr. Ian Malcom (Jurassic Park)
16
Data Ethics Example:
• App purchases
Source: https://www.youtube.com/watch?v=9nazm3_OXac
17
Data Literacy Matters!
17Source: https://lovestats.wordpress.com/dman/
Quote from H.G. Wells (1903; writer) …
“Statistical thinking will one day be as
necessary for efficient citizenship as
the ability to read and write.”
Well, that day is here now!
Statistical & Data Literacy Matter!
1818
Quote from Ronald Coase (economist) …
“If you torture your data long enough,
it will confess to anything.”
1919
Quote from Steven Wright (comedian) …
“42.7% of all statistics are made up
on the spot.”
2020
Quote from somebody (?) …
“It is now beyond any doubt that
cigarettes are the biggest cause of
statistics”
2121
https://www.geckoboard.com/learn/data-literacy/statistical-fallacies/
In our rush to build and validate our predictive models, we often are too quick to overlook our own cognitive biases and other data fallacies, such as:“Correlation does not imply Causation!”https://bit.ly/2pPnUSu
22
Feature Selection is important in order to disambiguate different classes.
More importantly,Class Discovery depends on choosing the right projection and selecting the right features!
Feature Selection and Projection
23
Source: https://www.quora.com/How-was-classification-as-a-learning-machine-developed
Your chosen data attributes represent a low-dimension projection of the full truth – the feature space (dimensions) in which you explore your data is a form of cognitive bias –… it matters!
Projection Matters
24
Source: http://www.transformativeinsights.co.nz/blog/new-perspective-on-conflict
The Data Science of Feature-rich Chocolate Brownieshttp://www.datasciencebowl.com/data-science-of-chocolate-brownies/
High-Variety Data enables better (and tastier) analytics models
Variety is the spice of discovery!
25
OUTLINE
• Big Data Preliminaries
• Data Literacy & Ethics
• Data Science
• Data Storytelling
• The “7 C’s” (actually 12)
26
Source: https://www.expertsystem.com/government-data-mining/
http://www.boozallen.com/datascience @KirkDBorne
Our mission as data scientists:To discover Value in Big Data through the ethical application of Data Science and Machine Learning
27
28
Recommended Reading:The Data Science Playbook
https://www.boozallen.com/s/insight/publication/data-science-playbook.html
29Source for graphic: https://data-flair.training/blogs/machine-learning-applications/
Predictive Analytics is currently the most significant application of Machine Learning (*)
(*) The set of mathematical algorithms that learn (patterns) from experience (data)
30Source for graphic: https://www.altexsoft.com/blog/datascience/machine-learning-strategy-7-steps/
Predictive Analytics is everywhere in Business Data and Machine Learning (AI) Strategy Discussions
Typical Machine Learning Applications
in Our Lives:
PREDICT
OPTIMIZE
DISCOVER
DETECT
Your Purchase Preferences, Recommender Systems,
Credit Scoring, Smart Phone auto-complete …
Your Thermostat, Your Commute Time and Routing,
Personalized Learning …
Your Health Issues (wearables), Your Best Deal
(Bed & Breakfast or Restaurant) …
Your Social Sentiment, Identify Theft,
Credit Card Fraud …
31
Typical Machine Learning Applications
in the Enterprise:
PREDICT
OPTIMIZE
DISCOVER
DETECT
Predict outcomes, events, prices, costs, risks,
product demand …
Optimize processes, products, and people
(delivery of services, supplies, personnel) …
Discover insights in social media, documents,
quarterly business reports, customer call records...
Detect fraud, anomalies in safety events,
behaviors, outbreaks, data usage (GDPR),
systems (cybersecurity breaches) …
32
4 Types of Insights Discovery from Data:
33
(Graphic by S. G. Djorgovski, Caltech)
1) Class Discovery: Find the categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).
2) Correlation (Predictive and Prescriptive Power) Discovery: (insights discovery) – Find trends, patterns, dependencies in data that reveal the governing principles or behavioral patterns (the object’s “DNA”).
3) Outlier / Anomaly / Novelty / Surprise Discovery: Find the new, surprising, unexpected one-in-a-[million / billion / trillion] object, event, or behavior.
4) Association (or Link) Discovery: (Graph and Network Analytics) – Find both the usual and the unusual (interesting) data associations / links / connections across the entities in your domain.
34
How does a Data Scientist build amodel of a complex dynamic system?
34
35
We might start by modeling a complex system like this…
35
36
We can add more features to model the system with higher fidelity …
36
37
We can add more features to model the system with higher fidelity …
37
Pattern Discovery is easy, but Pattern Exploitation requires more data science…
Source for graphic: http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html
38
Generalization is key!
(The Goldilocks model)
The most generally useful model captures the fundamental pattern in the data and takes into account the natural variance in the data.
Developing insight and scientific intuitionare connected and essential, and slow...
39
Insight: the capacity to gain an accurate
and deep intuitive understanding of a
person or thing.
Developing Scientific Intuition: "People
are particularly bad at Statistical Thinking,
which requires careful weighing of
evidence by the slower analytic mind."https://www.sigmaxi.org/news/article/2018/12/12/from-the-president-developing-scientific-intuition
https://amzn.to/2RI1mlS
Remember this …
40
41
Recommended Reading:The Field Guide To Data Science
https://www.boozallen.com/s/insight/publication/field-guide-to-data-science.html
5 Levels of Analytics Maturity
in Data-Driven Applications1) Descriptive Analytics
– Hindsight (What happened?)
2) Diagnostic Analytics
– Oversight (real-time / What is
happening? Why did it happen?)
3) Predictive Analytics
– Foresight (What will happen?)
42
5 Levels of Analytics Maturity
in Data-Driven Applications1) Descriptive Analytics
– Hindsight (What happened?)
2) Diagnostic Analytics
– Oversight (real-time / What is
happening? Why did it happen?)
3) Predictive Analytics
– Foresight (What will happen?)
4) Prescriptive Analytics
– Insight (How can we optimize what
happens?) (Follow the dots / connections in
the graph!)
5) Cognitive Analytics– Right Sight (the 360 view , what is the right
question to ask for this set of data in this
context = Game of Jeopardy)
– Finds the right insight, the right action, the
right decision,… right now!
– Moves beyond simply providing answers, to
generating new questions and hypotheses.
43
Find a function (i.e., the model) f(d,t)
that predicts the value of some
predictive variable y = f(d,t) at a future
time t, given the set of conditions found
in the training data {d}.
=> Given {d}, find y.
Find the conditions {d’} that will produce a
prescribed (desired, optimum) value y at a
future time t, using the previously learned
conditional dependencies among the
variables in the predictive function f(d,t).
=> Given y, find {d’}.
Predictive vs Prescriptive:What’s the Difference?
44
PREDICTIVE PRESCRIPTIVEAnalyticsAnalytics
Find the conditions {d’} that will produce a
prescribed (desired, optimum) value y at a
future time t, using the previously learned
conditional dependencies among the
variables in the predictive function f(d,t).
=> Given y, find {d’}.
Predictive vs Prescriptive:What’s the Difference?
45
Confucius says…
“Study your past to know
your future”
PREDICTIVE PRESCRIPTIVEAnalyticsAnalytics
Find a function (i.e., the model) f(d,t)
that predicts the value of some
predictive variable y = f(d,t) at a future
time t, given the set of conditions found
in the training data {d}.
=> Given {d}, find y.
Find the conditions {d’} that will produce a
prescribed (desired, optimum) value y at a
future time t, using the previously learned
conditional dependencies among the
variables in the predictive function f(d,t).
=> Given y, find {d’}.
Predictive vs Prescriptive:What’s the Difference?
46
Confucius says…
“Study your past to know
your future”
Baseball philosopher Yogi Berra says…
“The future ain’t what it
used to be.”
PREDICTIVE PRESCRIPTIVEAnalyticsAnalytics
Find a function (i.e., the model) f(d,t)
that predicts the value of some
predictive variable y = f(d,t) at a future
time t, given the set of conditions found
in the training data {d}.
=> Given {d}, find y.
OUTLINE
• Big Data Preliminaries
• Data Literacy & Ethics
• Data Science
• Data Storytelling
• The “7 C’s” (actually 12)
47
Source: https://www.expertsystem.com/government-data-mining/
http://www.boozallen.com/datascience @KirkDBorne
Zoom deeper into your data for bothPredictive and Prescriptive Power Discovery!
48(from the Booz Allen “Field Guide to Data Science”)
“What is going on in that neighborhood
on Saturday evenings between 6pm and 8pm?”
49Source for graphic: https://www.boozallen.com/s/insight/publication/field-guide-to-data-science.html
◼ Classic Textbook Example of Data Mining (Legend?): Data
mining of grocery store logs indicated that men who buy
diapers also tend to buy beer at the same time.
Association Discovery Example #1
50
◼ Amazon.com mines its customers’ purchase logs to
recommend books to you: “People who bought this book also
bought this other one.”
Association Discovery Example #2
51
◼ Netflix mines its video rental history database to recommend
rentals to you based upon other customers who rented similar
movies as you.
Association Discovery Example #3
52
◼ Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
◼ Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many of {one particular product}
compared to everything else.
Association Discovery Example #4
53
◼ Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
◼ Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many strawberry pop tarts compared
to everything else.
Association Discovery Example #4
54
Strawberry pop tarts???
http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html
http://bit.ly/1gHZddA55
OUTLINE
• Big Data Preliminaries
• Data Literacy & Ethics
• Data Science
• Data Storytelling
• The “7 C’s” (actually 12)
57
Source: https://www.expertsystem.com/government-data-mining/
http://www.boozallen.com/datascience @KirkDBorne
http
s://w
ww
.pin
tere
st.c
om
/pin
/24
86
83
21
06
47
83
12
64
/
58
Curious
Source: https://infocus.dellemc.com/william_schmarzo/design-thinking-innovation/
59
Creative (design thinking)
https://jaywalker-digital.ch/en/ebook-how-to-apply-design-thinking-to-data-science/
https://blog.westmonroepartners.com/when-design-thinking-meets-data-science/
Computational
60Data Scientists survey results: https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html
Collaborative
61
Source: https://www.boozallen.com/s/insight/publication/data-science-playbook.html
Critical Thinker
62Source: https://plus.google.com/collection/UVYWTB
Source: https://www.adam-eason.com/critical-thinking-importance-ways-improve/
Source: https://successatschool.org/advicedetails/964/critical-thinking-skills
Community Focus
63https://datasciencebowl.com/
Courageous Problem-Solver
64
https://middlesexconsulting.com/lesson-from-winston-churchill/
Cool under pressure(tolerance for ambiguity)
65
https://www.slideserve.com/mave/shades-of-gray-ambiguity-tolerance-statistical-thinking
Consultative (customer-focus)
66
Creating value at the “pull” of the customer!
“A pull strategy becomes more important than
push because you want to create enough value
so that the customer comes to you!”https://www.marketing91.com/pull-strategy-in-marketing/
Compassion (empathy)
67
Source: https://www.jitbit.com/news/customer-service-skills/
Communicator (data storytelling)
68
Continuous Life-long Learner
70
… or just follow this guy on Twitter …
@KirkDBorne
71
Booz Allen Hamilton
SAILING THE “7 SEAS” OF DATA SCIENCE:The Individual’s Journey to Data Science Maturity
The “Seven” Seas (C’s):1) Cognitively Curious (ask questions … the right questions!)2) Creative (design thinker)3) Courageous problem-solver (rocks the culture, willingness to fail)4) Cool under pressure (tolerance for ambiguity)5) Continuous life-long learner (hackathons, online classes, …)6) Communicator (data storyteller)7) Collaborative (“data science is a team sport”)
+ 5 more: 8) Critical Thinker 9) Computational 10) Consultative 11) Community-focus12) Compassion (Empathy)
71
72
DATA SCIENTISTS ARE EXPLORERS –– EXPLORING VAST AND ENDLESS SEAS OF DATA!
“If you want to build a ship,
don’t drum up people to
gather wood and don’t
assign them tasks and work,
but rather teach them to
yearn for the vast and
endless sea.”- Antoine de Saint-Exupery h
ttp
s://
ww
w.p
inte
rest
.co
m/p
in/3
77
10
61
68
77
22
98
09
2/
http://www.nytimes.com/2008/04/11/world/europe/11exupery.html
Come for the Data. Stay for the Science!
Thank you!Dr. Kirk Borne, Principal Data Scientist, Booz Allen Hamilton
Twitter: @KirkDBorne or Email: [email protected]
Get slides here: http://www.kirkborne.net/ApexSystems2019/
73
http://www.boozallen.com/datascience @KirkDBorne