Date post: | 11-Jun-2015 |
Category: |
Science |
Upload: | antonio-vetro |
View: | 487 times |
Download: | 2 times |
Technische Universität München
Dr. Antonio Vetrò
With feedback and some material from:Dr. Daniel Méndez FernándezProf. Dr. Dr. h.c. Manfred BroyProf. Dr. Angelo VulpianiProf. Dr. Francesco Sylos Labini
Limitations of traditional scientific approaches
and new paradigms Examples from statistics, physics, and implications for Software Engineering
Research
2
Outline
Limitations on the usage of p value (Perspective from Statistics)And implications for software engineering research
Limitations on predictions based on past data (Perspective from Physics)And implications for software engineering research
Actions
3
Outline
Limitations on the usage of p value (Perspective from Statistics)And implications for software engineering research
Limitations on predictions based on past data (Perspective from PhysicsAnd implications for software engineering research
Actions
4
Hypothesis Testing
• Research Hypothesis• a statement of what the researcher believes will be the outcome
of an experiment or a study.• Statistical Hypotheses
• a more formal structure derived from the research hypothesis: a null hypothesis is set up and rejected in favor of the alternative one if not supported by the data
• memento: we can reject hypotheses but not confirm them !
Definitions from: Marco Torchiano, Empirical Software Engineering Course for PhD students, Politecnico di Torino.Example: own.
Example:• Research Hypothesis:
• Java classes containing code smells are more bug prone• Statistical hypothesis :
Given• X = counts nr bugs from a sample of classes with code smells• Y = counts nr bugs from sample of classes without code
smells
H0 : μx≤ μy versus HA: μx≥ μy (one tailed)
5
Hypothesis testing in our example
Example:• Research Hypothesis:
• Java classes containing code smells are more bug prone• Statistical Hypothesis we aim to reject:
• Nr bugs in classes with code smells ≤ Nr bugs in classes without code smells
Distribution nr of bugs in classes with code smells (123)
Distribution nr of bugs in classes without code smells (250)
373 datapoints Data: Hadoop v. 0.14.0 from: Antonio Vetro’, “Hadoop data”, DOI 10.13140/2.1.2413.1204
The P-value: recalling the basics (1/2)
• P-value is the probability that we would have seen our data just by chance if the null hypothesis is true.
• In our example: • Nr bugs in classes with code smells < Nr bugs in classes without code smells
• H0 : μx≤ μy is TRUE (classes with code smells have less bugs)
• Rationale: we want to know the probability of observing μx> μy ,
by chance • E.g., P-value < 0.001 means: P(empirical data/null hypothesis)
<.0001
• In hypothesis testing, a null hypothesis (H0) is rejected in favor of the alternative one (HA) when p-value is lower than a pre defined threshold of making an error
Figure: http://en.wikipedia.org/wiki/One-_and_two-tailed_tests
Error and Power
Type-I Error (also known as “α”): – Rejecting the null when the effect isn’t real.– the probability of finding an effect that isn’t real (false positive). – If we require p-value<.05 for statistical significance, this means that 1/20 times
we will find a positive result just by chance.
Type-II Error (also known as “β “):– Failing to reject the null when the effect is real.– the probability of missing an effect (false negative).
POWER (the flip side of type-II error: 1- β): – The probability of seeing a true effect if one exists. – The probability of not making a type II error– When we design studies, we typically aim for a power of 80% (allowing a false
negative rate, or type II error rate, of 20%).
Your Statistical Decision
True state of null hypothesis
H0 True(example: classes with code
smells are not more bug prone )
H0 False(example: classes without
code smells are re more bug prone)
Reject H0(ex: we conclude classes
without code smells are re more bug prone)
Type I error (α) Correct
Do not reject H0(ex: we conclude classes with code smells are not more bug
prone )
Correct Type II Error (β)
The P-value: recalling the basics (2/2)
Compute p value in our example* :
Source: Probability and Statistics for Engineers and Scientists, Sheldon Ross* Normal distributions and unknown variances. n and m are sample sizes for X and Y. 95% confidence level
• If the value of the test statistic T is v , then p value is
The P-value: recalling the basics (2/2)
Compute p value in our example* :
Source: Probability and Statistics for Engineers and Scientists, Sheldon Ross* Normal distributions and unknown variances. n and m are sample sizes for X and Y. 95% confidence level
• If the value of the test statistic T is v , then p value is
Conf. level = 0.95 , t0.95,371 =1.649
1.187 0.208
1.622
6.98
6.891e-12
6.891e-12
6.98 > 1.622
Hypothesis rejected , but this tells nothing about causality. We need explanations
The P-value and significance of the test
By convention, p-values of <0.05 are often accepted as “statistically significant” in the scientific literature; but this is an arbitrary cut-off.
A cut-off of p<0.05 means that in about 5 of 100 experiments, a result would appear significant just by chance (“Type I error”).
The “correct” level of significance to use in a given situation depends on the individual circumstances involved in that situation.
For instance, if rejecting a null hypothesis H0 would result in large costs that would be lost if H0 were indeed true, then we might elect to be more stringent and so choose a significance level of 0.05 or 0.01.
Also, if we initially feel strongly that H0 was correct, then we would require very stringent data evidence to the contrary for us to reject H0. (That is, we would set a very low significance level in this situation.)
For exploratory studies, we might be less conservative and choose level 0.10
11
Low p values: alternative actions
In the praxis, a low p value implies to reject the hypothesis as false
Indeed, we have two more options :
– Reject the observation as an outlier– Accept that we have done a rare observation, which is
still possible given the hypothesis
Next slide explains why
P val : general limitation
“It cannot work backwards and make statements about the underlying
reality. That requires another piece of information: the odds that a real
effect was there in the first place.”
Probabilities after the experiment are computed with Bayes Factor
13
Multiple hypotheses in the era of Big Data
Large availability of data permits to test multiple hypotheses:
# hypotheses were actually null
# hypotheses were actually non null
Total # hypotheses
total # rejected hypothesestotal # not rejected hypotheses
14
Single case testing situation
Type I error
Power1 - β
Power
Type I error
15
Multiple case testing situation
False discovery rateFDR = a / R
Goal: control FDR
16
Why it is important to control FDR (1/2)
Source: E. J. Candès.
1000 hypotheses, 100 potential discoveries
Case A
Case B
17
Why is it important to control FDR (2/2)
Control Per-Comparison Type I Error (PCER)
– a.k.a. “uncorrected testing,” many type I errors– P (FDi > 0) ≤ α marginally for all 1 ≤ i ≤ N
Control Familywise Type I Error (FWER) – e.g.: Bonferroni: use per-comparison significance level α/m– Guarantees P(FD > 0) ≤ α – Very stringent, many type II errors
Control False Discovery Rate (FDR)
– First defined by Benjamini & Hochberg (BH, 1995, 2000) – see algorithm
– Guarantees FDR ≡ E ( FD / D) ≤ α …
Source: Christopher R. Genovese, Dept. of Statistics Carnegie Mellon University
18
Benjamini & Hockenmberg in a nutshell
Modified from: http://www.unc.edu/courses/2007spring/biol/145/001/docs/lectures/Nov12.html
P vals p(1) p(2) p(3) … p(n)
k 1 2 3 … n
threshold α* / m 2α* / m 3α* / m … nα* / m
p(1) < p(2) < p(3) <… < p(n)
Compare each p-value p(k) against its corresponding threshold value kα* / m
Let Decision rule: If k^ ≥ 1 then reject the hypotheses that correspond to p1, p2, ... , pk and fail to reject the hypotheses that correspond to the rest.
21
k pvals threshold1 0,013 0,005 02 0,016 0,01 03 0,017 0,015 04 0,019 0,02 15 0,021 0,025 16 0,037 0,03 07 0,041 0,035 08 0,045 0,04 09 0,048 0,045 010 0,052 0,05 011 0,06 0,055 012 0,068 0,06 013 0,087 0,065 014 0,102 0,07 015 0,106 0,075 016 0,109 0,08 017 0,118 0,085 018 0,136 0,09 019 0,148 0,095 020 0,149 0,1 0
Fix and control FDQ: example Let’s assume we have multiple (independent) studies
investigating on the bug proneness of classes with code smells, with same constructs
We simulate the obtained p values extracting 20 random values from a uniform distribution with min=0.001 and max=0.025. We set our desiredα* = 0.10
Publication bias may still affect results!
22
Observations / food for thoughts
Controlling FDR increases power while maintaining control over the error
Useful technique for putting together findings for multiple studies In software engineering especially convenient for simulations,
but…
What about experimental and observational research ? Often in software engineering replicability of studies is difficult It is very difficult to obtain same conditions and run multiple tests
Other Pitfalls of p val and control actions (I)
1. Unimportant effects may be statistically significant if a study is large (and therefore, has a small standard error and extreme precision).
Pay attention to effect size and confidence intervals.
2. Statistical significance does not imply a cause-effect relationship.
Interpret results in the context of the study design.
3. A significance level of 0.05 means that your false positive rate for one test is 5%: if you run more than one test, your false positive rate will be higher than 5%, i.e. 1- (0.95)n_tests
Control study-wide type I error by planning a limited number of tests. Distinguish between planned and exploratory tests in the results. Correct for multiple comparisons (See correction methods previous slides).
Other pitfalls of p val and control actions (II)
4. Results that are not statistically significant should not be interpreted as “evidence of no effect,” but as “no evidence of effect” Studies may miss effects if they are insufficiently powered (lack precision).
Design adequately powered studies and interpret in the context of study power if results are null.
5. “the effect was significant in the treatment group, but not significant in the control group” does not imply that the groups differ significantly
Use proper statistical test for within group and between group differences
25
Food for thoughts (see last slide for actions)
The root cause of our problem is a philosophy of scientific inference that is supported by the statistical methodology in dominant use. This philosophy might best be described as a form of “naïve inductivism,” a belief that all scientists seeing the same data should come to the same conclusions.
Goodman, S. N. Epidemiology 12, 295–297 (2001).
27
Outline
Limitations on the usage of p value (Perspective from Statistics)And implications for software engineering research
Limitations on predictions based on past data (Perspective from Physics)And implications for software engineering research
Actions
28
Summary 1st part of discussion
Statistical tests are a powerful method to give scientific foundations and provide evidence to theories with the observations it is possible to get
However they have some limitations– not all assumptions often hold (e.g., normality in parametric
tests)– often we don’t have enough data to draw proper conclusions
Nowadays large availability of data allows the application of better techniques to reduce the error in drawing conclusions from statistical tests– New paradigm: control on false discovery rate rather than
type I error– However not always applicable to Software Engineering, due
to the intrinsic complexity of the phenomena involved and multiple confounding factors
Examples from Physics can help understanding why
29
Inductive approach in the era of Big Data
Paradigm: “Collect data first, ask question later” (is that science ?) It translates in “inference from data” with no a priori questions
– E.g.: regression , patterns and hypothesis building from historical data Often applied for prediction purposes
Important questions for prediction:– Which are the relevant variables ?– What kind of laws regulates the system ?– What kind of perspective do we take: deterministic or probabilistic ?
Different situationsA. Evolution laws in the system exist and are known B. Evolution laws in the system exist and are not known C. We don’t know whether the system has some laws
30
Deterministic approach The problem of chaos (Poincaré work)
31
The problem of chaos (Poincaré work): an example from physics
That means: a system can be predicted with a tolerance Δ only within a certain time which is dependent on λ
In a deterministic chaotic system we have that:
(Lyapunov exponent λ>0):
Let’s take a simple prediction function (Logistic map):
Its prediction error doubles at every step:
Example with logistic map:Despite the very similar initial conditions (|x(0) − x′(0)| = 4 × 10−6 ), after t=16 the two trajectories are completely different (“Butterfly effect”)
λ = ln 2 ≃ 0.693
Stability, high predictability
Chaos, low predictability
Source: Sylos Labini
32
Prediction window
Predicting eclipses and tides is easier because those phenomena are less chaotic, i.e. Lyapunov exponent λis lower (and so the predictability window is large). That’s why ancient populations (such as Maya) could understand the periodicity of the planets’ movement, without having a physical reference model
The atmosphere is much more chaotic system, and the predictability window is quite short (i.e. λis higher ) see Lorentz efforts
Source: Sylos Labini
33
Problems and their relations to Software Engineering:– Not always the equations of the phenomena are known (do
they exist?)– Often we even don’t have a set of variables which describe
the phenomenon
However large available data on the past can be used to predict the future with a certain probability , or to discover patterns in data
34
Probabilistic approach : method of the analogs
Predicting the future from the past: An old problem from a modern perspectiveCecconi, F. and Cencini, M. and Falcioni, M. and Vulpiani, A., American Journal of Physics, 80, 1001-1008 (2012), DOI:http://dx.doi.org/10.1119/1.4746070
• Most predictions algorithm work under the following basic idea• We know the past , i.e. a series (x1, x2, ...., xM ) where xj = x(j∆t)• We want to forecast the future, i.e. xM+t
• We look back in the past to find a situation similar to the present (time M), i.e. a vector xk with k<M and |xk−xM|< ε
• Predict at time M + t , i.e.
35
ConsiderationsIn an ergodic system, the average return timeof a set A is proportional to a system’s characteristic time τ0 and inverse proportional to the probability of A (Lak’s lemma):
Which in a system of linear dimensions O(ε) is inverse proportional to the number of variables involved*
Good news: to find an analog in the past with precision ε , we must go back in time of
Bad news: to find an analog, the length of the series should be of the same order
(eg precision 5%, t = 6x107 )
Which are the relevant variables ?
Do we take a deterministic or probabilistic perspective ?
What kind of laws regulates our systems ?
* For details see Cecconi, F. and Cencini, M. and Falcioni, M. and Vulpiani, A. DOI:http://dx.doi.org/10.1119/1.4746070
Considerations (continued)
In very complex systems (e.g. earthquakes) the state vector is not known a priori, and data are not enough to get appreciable precision in predictions (because it is almost impossible to find an analogue back in the past)
Example: – the case of Google Flu parabole
36
The Parable of Google Flu: Traps in Big Data Analysis
David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani
SCIENCE, Vol. 343, 14 March 2014
ILI : influenza-like illness CDC : Centers for Disease Control and Prevention, which bases its estimates on surveillance reports from laboratories across the United States
37
Implications for Software Engineering Research
In Software Engineering the space of variables is very large and not known a priori
This is related to the initial important questions:• Which are the relevant variables ?• What kind of laws regulates the system ?• What kind of perspective do we take: deterministic or
probabilistic ?
It is extremely difficult to find similar analogies in the past: SW projects are barely comparable to each other
Even projects with same people and same objectives would follow always different processes, obey different psychological factors, etc.
As a consequence experiments are difficult to reproduce or replicate
38
Reproducibility vs replicability (for experiments)
Reproducibility (requires change): ceteris paribus Replicability ( avoids change) : “poor substitute for
reproducibility” ?
Nature initiative: – no space limitations on Methods sections– statisticians help review papers and measures– encourage raw data online– checklist for life science submissions
Other ongoing initiatives:– The Recomputation Manifesto – ARRIVE – Animal Research: Reporting In Vivo Studies – National Institutes of Health of the United States (NIH)
39
Outline
Limitations on the usage of p value (Perspective from Statistics)And implications for software engineering research
Limitations on predictions based on past data (Perspective from Physics)And implications for software engineering research
Actions
40
Actions (proposals): let’s discuss
a) Stop aiming at absolute generalization, focus on specific studies:– Focus on systems’ underlying mechanisms– Provide “engineering solutions” which solve very specific
problems
b) Don’t give up to generalization and, being aware of the limitations of the different approaches, stress on:– rigor and transparency on the methodology of data
collection and analysis (aim at reproducibility first rather then replicability)
– aim for a universal language, i.e. a commonly agreed set of variables to represent the phenomena under study
– provide details on (standardised) context information
c) Develop first theories (see Oberseminar on the Role of Mathematics and Logical Theories )
– delivers also explanations on the underlying mechanisms– empirically test them or provide sound evidence
…
41
Open questions / food for thoughts
Which software engineering phenomena can we study applying empirical methods ?
In which circumstances is it useful to study those phenomena with empirical methods?
What else should we check ( methodology, relevance, etc ) before starting an empirical evaluation ?
42
References and sourcesPerspective from Statistics: R. Foygel Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. Sheldon M. Ross, Introduction to probability and statistic for engineers and
scientists, ELSEVIER Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Probability &
Statistics for Engineers & Scientists Bradley Efron, Large-Scale Inference , Empirical Bayes Methods for Estimation,
Testing, and Prediction, ISBN: 9781107619678, Jan 2013 Goodman, S. N. Epidemiology 12, 295–297 (2001).
Perspective from Physics Cecconi, F. and Cencini, M. and Falcioni, M. and Vulpiani, Predicting the future from
the past: An old problem from a modern perspective, A., American Journal of Physics, 80, 1001-1008 (2012), DOI:http://dx.doi.org/10.1119/1.4746070
Francesco Sylos Labini, Big Data Complexity and Scientific Method Chris Anderson, The End of Theory L.F. Richardson, Weather Prediction by Numerical Process (Cambridge University
Press, 1922)
Replicabilility Nature, Reproducibility initiative (checklist here) Replicability is not Reproducibility: Nor is it Good Science, Chris Drummond Recomputation manifesto