1
Introduction to exploratory statistics
Jean Paul [email protected]
linkedin.com/in/jean-paul-maalouf
Illustrated with XLSTAT
www.xlstat.com
Oct. 19, 2016
2
PLAN
• XLSTAT: who are we?
• Statistics: categories
• Reminder: Variables, individuals, Descriptive Statistics
• Toward exploratory data analysis: scatter plot colored by group
• Exploratory statistics & Data Mining
• Principal Component Analysis (PCA): concept and practice
• Agglomerative Hierarchical Clustering (AHC): concept and practice
All the data in this class were made up unless
otherwise specified
3
XLSTAT: Who are
we?
XLSTAT is a user-friendly
statistical add-on software
for Microsoft Excel®
4
XLSTATA growing software and team
Thierry Fahmy
develops a
user-friendly
solution for
data analysis:
XLSTAT is born
XLSTAT
realizes its first
sale on the
Internet
New version,
VBA interface,
C++
computations, 7
languages
New products,
new website,
growing and
dynamic team
The company
Addinsoft is
created
New offers
adapted to
business needs
XLSTAT 365
Cloud version of
XLSTAT for Excel
365
1993 2000 2009 2016
201520061996
XLSTAT Free
Free limited
Edition
5
XLSTAT in a few numbers
200+ statistical features
General or field-oriented solutions
50k users
Across the world. Companies, education, research
16 employees
Always receptive to the needs of users
120k visits/month on the website
Easy tutorials available in 5 languages
7 languages 400 downloads/day
6
Statistics: 4
categories
7
Statistics: 4 categories
Description Exploration Tests Modeling
I want to summarize
small data sets (1-3
variables) using
simple statistics or
charts (mean,
standard deviation,
boxplots...)
I want to easily extract
information from a
large data set
without necessarily
having a precise
question to answer.
(PCA, AHC...)
I want to accept /
reject a very precise
hypothesis assuming
error risks. (t tests,
ANOVA, correlation
tests, chi-square...)
I want to understand
the way a phenomenon
evolves according to a
set of parameters.
(regression, ANOVA,
ANCOVA...)
Nov. 9 Nov. 30
Recording (valid until
Oct. 21)
8
Reminder:
variables,
individuals,
descriptive
statistics
9
Variables, individuals
Variable
An element that can take different values
Qualitative variable
A variable that cannot be quantified. Examples:
socioprofessional category, geographical origin,
type of licence, blood type..
Quantitative variable
A variable that can be quantified. Examples: invoice
amount, number of likes on Facebook, sugar
concentration, height...
Individual
Elementary statistical unit. Can be described with
variables. Examples: customers, surveyed people,
patients, laboratory mice...
10
Data set: online shoe selling platform
Variables
Indiv
iduals
11
Descriptive statisticsCommonly used tools according to the situation
1 qual. variableFlat sorting, mode, pie charts
1 quant. variableCenter (mean / median) ; dispersion
(variance / std. deviation / quartiles) ;
box plot
1 qual. variable x 1 qual. variableCross tabulation (contingency table)
1 quant. variable x 1 quant. variableScatter plot
1 quant. variable x 1 qual. variableQuantitative descriptive statistics per
category of the qualitative variable; multiple
box plot chart
1 quant. variable x 1 quant. variable
x 1 qual. variable
Scatter plot with points colored according
to the categories of the qualitative variable
12
Toward
exploratory data
analysis: scatter
plot colored by
group
13
Toward exploratory data analysis: scatter plot
colored by group
- Invoice amount decreases with time spent
on the website.
- Plutonians spend more money on the website
compared to others.
- Martians and humans form a relatively
homogeneous group
- ...
14
Imagine having the same kind of reasoning
on a higher number of variables... Time for Exploratory statistics (or Exploratory
Data Analysis)
15
Example: Principal Component Analysis (PCA)We want to analyze multiple variables (dimensions) at a time the same way we did with the 2D scatter plot.
16
Exploratory
statisticsI want to easily extract information
from a large data set without
necessarily having a precise question
to answer.
17
Exploratory statistics: a few words
Exploratory statistics
Look for information in a multi-variables data set, without having very
precise expectations. Exploratory tools are part of Data Mining.
First thing you can do: concentrate the information of big
datasets in a few dimensionsExamples: Principal Component Analysis, Correspondence Analysis…
Second thing you can do: classification ( = clustering = segmentation)Examples: Agglomerative Hierarchical Clustering, k-means…
18
Principal
Component
Analysis (PCA)I’d like to summarize a big data set in a
few simple charts
- Relationships among
variables
We’ll be able to investigate:
- Proximity among individuals
- How individuals relate to
variables
19
PCA: concept
Initial dataset
+
Amount of
information
-
Artificial data set synthesized by PCA
The information is re-distributed in a
way to concentrate most of it on a few
dimensions.
PCA jargon:
dimension
= axis
= factor
information
= variability
= inertia
20
How PCA looks like in realityChart 1: correlation circle
- Acute angle: positively-linked variables
(e.g. weight & height)
- Right angle: uncorrelated variables (e.g.
height & shoe size)
- Obtuse angle: negatively-linked
variables (e.g. weight & time spent on
site)
Vector length reflects
representativeness in the
selected plan (F1/F2 here)
21
Interpreting the axesChart 1: correlation circle
- F1 reflects:
- High weight & height (right)
- Long time spent on site (left)
- F2 is strongly related to shoe size:
- Big shoes (top)
- Small shoes (bottom)
22
How PCA looks like in realityChart 1: correlation circle ; chart 2: observations
Weight+
Height+
time on site-
Weight-
Height-
time on site+
23
PCA: explorations ...
Weight increases with height Shoe size is unrelated to weight / height
Time spent on site decreases with weight & height Derrick has big feet. Shaun has small feet.
Looks like there are two clusters in the data And so on...
PCA tutorial link
PCA works only with quantitative data. Click here to check out other exploratory methods.
24
It was easy to detect two clusters of
customers. Nice for marketing!
Weight+
Height+
time on site-
Weight-
Height-
time on site+
But what if groups were not that
easy to define visually?
According to our PCA, customers can
be split into two clusters characterized
by height, weight and time spent on site.
This may help us define tailored
marketing campaigns.
25
Agglomerative
Hierarchical
Clustering (AHC)
I want to cluster ( = classify =
segment) individuals in homogeneous
groups ( = segments = clusters =
classes)
26
Agglomerative Hierarchical Clustering (AHC)
How to cluster consumers into different groups?
Illustration with 2 variablesEXAMPLE: sensory analysis, chocolate consumers survey
27
AHC – how it works on 2 variables
xx
x
19 groups18 groups17 groups16 groups15 groups14 groups8 groups9 groups7 groups6 groups5 groups4 groups3 groups2 groups1 group
Choosing a
“cutting” level
Segments
are now
defined
Age
This can obviously be
generalized over
more than 2 variables
28
Agglomerative Hierarchical Clustering (AHC)What it looks like in XLSTAT:
The higher the “vertical
distance” between two
individuals (or groups), the
more different the
individuals.
Here we could split the
individuals into 3 or 4
homogeneous groups
Art
uro
Trac
yJo
rdan
Co
rnel
ius
An
ita
Elen
aC
and
ice
Jake
Juan
aK
rist
enD
ana
Mar
lon
Mo
na
Car
roll
Cri
stin
aH
op
eD
uan
eP
hili
pJo
eEd
mu
nd
Mau
rice
Mar
sha
Sam
Pe
dro
Co
nra
dSo
ph
ieB
ryan
tA
nn
eM
elin
da
Kar
laC
asey
Ro
sem
ary
Tam
iD
ori
sSa
mu
elSa
lvad
or
Trav
isR
and
all
Kev
inD
erek
Kri
sta
Fran
kJo
dy
Cly
de
Dan
aR
ose
mar
ieC
ame
ron
Ro
ger
Mik
e Al
Max
Jon
ath
anA
na
Gab
riel
Bec
kyFa
yeA
me
liaSa
raJe
rom
eD
om
inic
Stac
yJo
nat
ho
nA
lfre
do
Terr
ell
Pat
tiLe
ahP
ablo
Ran
dal
Bra
nd
iEd
ith
Tim
my
Mar
yB
yro
nC
lau
de
Gw
end
oly
nM
ich
eal
Eula
Joey
Bra
nd
on
Eliz
abet
hD
avid
Bo
bb
yC
aro
lC
od
yO
pal
Shel
iaD
on
Alis
on
Will
isIr
vin
Ted
Cec
elia
Shir
ley
Mu
riel
Luke
Wilb
ur
Lisa
Dar
rel
Sher
riSh
eryl
0
50
100
150
200
250
Dis
sim
ila
rity
Dendrogram
29
Agglomerative Hierarchical Clustering (AHC)3-cluster split:
Okay. And now what?
Let’s describe the 3 groups to see how we
could take action on a marketing scale
AHC tutorial link
Art
uro
Trac
yJo
rdan
Co
rnel
ius
An
ita
Elen
aC
and
ice
Jake
Juan
aK
rist
enD
ana
Mar
lon
Mo
na
Car
roll
Cri
stin
aH
op
eD
uan
eP
hili
pJo
eEd
mu
nd
Mau
rice
Mar
sha
Sam
Pe
dro
Co
nra
dSo
ph
ieB
ryan
tA
nn
eM
elin
da
Kar
laC
asey
Ro
sem
ary
Tam
iD
ori
sSa
mu
elSa
lvad
or
Trav
isR
and
all
Kev
inD
erek
Kri
sta
Fran
kJo
dy
Cly
de
Dan
aR
ose
mar
ieC
ame
ron
Ro
ger
Mik
e Al
Max
Jon
ath
anA
na
Gab
riel
Bec
kyFa
yeA
me
liaSa
raJe
rom
eD
om
inic
Stac
yJo
nat
ho
nA
lfre
do
Terr
ell
Pat
tiLe
ahP
ablo
Ran
dal
Bra
nd
iEd
ith
Tim
my
Mar
yB
yro
nC
lau
de
Gw
end
oly
nM
ich
eal
Eula
Joey
Bra
nd
on
Eliz
abet
hD
avid
Bo
bb
yC
aro
lC
od
yO
pal
Shel
iaD
on
Alis
on
Will
isIr
vin
Ted
Cec
elia
Shir
ley
Mu
riel
Luke
Wilb
ur
Lisa
Dar
rel
Sher
riSh
eryl
0
50
100
150
200
250
Dis
sim
ila
rity
Dendrogram
30
How can I describe
segments?
Things become quite
straightforward when you extract
class membership in the CAH
results
31
Describing the segments
Split individuals into classes and run
descriptive statistics on each
segment
Use Class membership as a
supplementary variable in a PCA
Use Parallel Coordinates Plots
Things you can do
32
Describing clusters: descriptive statistics
Consumers from
clusters 1 & 3 are
more loyal to
brands than those
from cluster 2
Consumers from
cluster 2 are
younger
33
Describing clusters: parallel coordinates plot
Cluster 3: older consumers, loyal to
brands, who prefer bitter chocolate
and are not online buyers...
Cluster 2: younger consumers, prefer
frozen chocolate, are sensitive to
prices and care less about brands
Consequences :
- Promote branded bitter chocolate
to older consumers
- Promote cheaper chocolates to
younger consumers
- …
…
Tutorial link
Brand loyalty Price sensitivity Online buyer Bitter Frozen Crunchy Age
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Parallel coordinates plot
1 2 3
34
In summary...
Description Exploration Tests Modeling
I want to summarize
small data sets (1-3
variables) using
simple statistics or
charts. Leads to
hypotheses.
I want to easily extract
information from a
large data set without
necessarily having a
precise question to
answer. Leads to
hypotheses.
I want to validate /
reject a very precise
hypothesis assuming
error risks. (t tests,
ANOVA, correlation
tests, chi-square...)
I want to understand
the way a phenomenon
evolves according to a
set of parameters.
(regression, ANOVA,
ANCOVA...)
Nov. 9 Nov. 30
Recording (valid until
Oct. 21)
35
Exploratory statistics: Take Home
Message
Exploratory statistics
Allow to gain insight into large data sets
They give a synthetic view of large data sets
Examples: Principal Component Analysis, Correspondence Analysis, MDS…
They allow clustering data sets
Examples: Agglomerative Hierarchical Clustering, k-means
Click here to choose an appropriate exploratory data analysis tool according to
your situation
36
Data exploration inspired us many hypotheses. Are they valid?
Statistical tests
See you on Nov. 9!
www.xlstat.com/fr/formation
37
Thanks for attending!All the tools we saw are available in all XLSTAT solutions
Survey time…