Statistics Overview

transcript

What is statistics? Inference and uncertainty: This is what

statistics is all about.

Statistics consists of a body of methods for collecting and analyzing data. (Agresti & Finlay, 1997)

Developed for interpreting and drawing conclusions from collected data

The major objective of statistics is to make inferences about population from the analysis of the sample data

What does statistics provide?

Design: Planning and carrying out research studies

Description: Summarizing and exploring data

Inference: Making predictions and generalizing about phenomena represented by the data

Population vs. sample

Steps in Planning Statistical analysis

Terms and Terminologies

Population- Total group of samples or individuals that the researcher is interested to study.

Sample- A group of individuals selected from the population

Parameter- is a characteristic of a population

Statistic- is a characteristic of a sample

Variable- characteristic or attribute that can assume different values.

Variate- A random variable taken from a known probability distribution


Descriptive statistics- describe the relationship between variables. E.g. Frequencies, means, standard deviation

Exploratory statistics- Usually represented in the form of graphs to see the patterns in a datum.

Inferential statistics- are used to draw inferences about a population from a random sample


Qualitative variable- Also known as categorical variable. Usually measured on a nominal scale.

Quantitative variable- They are measured on a numeric scale. Ordinal, interval and ratio scales are quantitative

Discrete variable- countable in a finite amount of time.

Continuous variable- would (literally) take forever to count. In fact, you would get to forever and never finish counting them


Random sampling

Systematic sampling

Convenience sampling

Stratified sampling

Cluster sampling

Sampling Error- is the difference between the sample measure and the corresponding population measure

Descriptive vs. Inferential statistics

Descriptive statistics consist of methods for organizing and summarizing information (Weiss, 1999)

Inferential statistics consist of methods for drawing and measuring the reliability of conclusions about population based on information obtained from a sample of the population. (Weiss, 1999)

Types of Statistical Approaches

Descriptive Statistics- Describes your data

- How many?

- How much?

Exploratory Statistics- represented in the form of graphs

- Is there any pattern?

- Are data points clustered or stretched?

Types of Statistical Approaches

Inferential Statistics

- Are there any differences?

- What is the relationship?

- What is the effect?

- Model building

- What determines what?

Distributions

Positively skewed Symmetric

Negatively skewed

Distributions

Normal Probability distribution

Mean, Median and mode are same

Bell-shaped curve symmetrical around mean

Probability area under the curve will be 1

Denoted by

Normal Probability Distribution

Areas under a normal distribution curve

Common types of Probability distributions

Other important types of distribution

1.Poisson2.Binomial

Poisson Distribution

used to represent the number of successive independent events of a specified type with low probability of occurrence (< 10%) in some specified interval of time or space.

Example cases of flu

Denoted by

Poisson Distribution

Binomial Distribution

An experiment that consists of n independent, repeated trials, each of which can end in only one of two ways arbitrarily labeled success or failure.

The probability that any trial ends in a successis p (and hence q = 1 p for a failure).

Denoted by

where in

Binomial Distribution

Central Limit Theorem

Sampling distribution of means

As the sample size n increases without limit,the shape of the distribution of the samplemeans taken with replacement from apopulation with mean m and standarddeviation will approach a normaldistribution.

This distribution will have a mean m and astandard deviation /n


Importance of Central limit theorem

- we can describe the sampling distribution from any variable without actually having to infinitely sample the population of raw scores.

Types of Variables

Nominal

Ordinal

Interval

Ratio

Types of Variables

Sampling Techniques

Random sampling

Systematic sampling

Stratified sampling

Cluster sampling

Other sampling techniques- Convenience sampling, Sequential sampling, Double sampling and multi-stage sampling

Theory of Probability

Experiment

Outcome

Sample space

Event

Theory of Probability

P [A]= No of Possible outcomes in which an event A occurs

Total No of possible outcomes in the sample space

Where P [A]= Probability that an Event B will occur

P(A)= 0 to 1

P(A)+P(B)+.+P(n)= 1

P(AorB) = P(A)+P(B) >> Disjoint event

P(AandB) = P(A)*P(B) >> Joint eventIndependent events

Theory of probability

P(AUB) = P(A)+P(B)- P(AB) >> contingent joint event

P(AB) = P(A)+P(B)- P(AUB) >> contingent joint event

P(A|B) = P(AB)/P(B) >>conditional probability for A

P(B|A) = P(AB)/P(A) >>conditional probability for B

Definition of ProbabilityA probability measure is a rule, say P, which associateswith each event contained in a sample space S a number suchthat the following properties are satisfied:

1 For any event, A, P(A) 0.2 P(S) = 1 (since S contains all the outcomes, S alwaysoccurs).3 P(not A)+P(A)=1.4 If A and B are mutually exclusive events (that cannot

occur simultaneously) and independent events (that are not linked inany way), then P(A or B) = P(A) + P(B) andP(A and B) = 0

Note: Many elementary probability theorems (rules) follow directlyfrom these definitions.

Confidence Intervals

The range around any hypothetical value of mean () within which 95% of the means of all samples of size n taken from that population will occur.

Denoted by

Where 95% confidence interval for mean, when population variable X is normally distributed and known

Understanding Z-statistic


Distribution of the Z statistic (the ratio of the difference of population mean and sample mean divided by the Standard error of the mean (SEM) obtained by taking the means of a large number of small samples from a normal distribution). The 95% confidence interval obtained by taking the means of a large number of small samples from a normally distributed population with known statistics is indicated by the black horizontal bar enclosed within 1.96 SEM. By chance 95% of the sample means will be within the range 1.96 to +1.96 , with the remaining 5% outside this range


With larger sample sizes, the 95% confidence intervals get smaller

P-Value

It is defined as the probability of getting the observed result, or a more extreme result, if the null hypothesis is true. In other words it is the measure of the likelihood of the result given the null hypothesis is true or the statistical significance of the claim.

range from 0 to 1

P-Value

"P=0.030" is a shorthand way of saying "The probability of getting 17 or fewer male chickens out of 48 total chickens, IF the null hypothesis is true that 50 percent of chickens are male, is 0.030.

It is a usual convention in biology to use

a critical P-value of 0.05 (often called alpha, )

P-Value

This p-value measures how likely it was that you would have gotten your sample results if the null hypothesis were true.

The farther out your test statistic is on the tails of the standard normal distribution, the smaller the p-value will be, and the more evidence you have against the null hypothesis being true

Interpreting P-value

If the p-value is greater than or equal to , you fail to reject Ho.

If the p-value is less than , reject Ho.

p-values on the borderline (very close to ) are treated as marginal results


- Heres how to interpret your results for any given alpha level

To make a proper decision about whether or not to reject Ho, you determine your cutoff probability for your p-value before doing a hypothesis test; this cutoff is called an alpha level ().

Typical values for are 0.05 or 0.01


- How to interpret your results if you use an alpha level of 0.05

If the p-value is less than 0.01 (very small), the results are considered highly statistically significant reject Ho.

If the p-value is between 0.05 and 0.01 (but not close to 0.05), the results are considered statistically significant reject Ho

If the p-value is close to 0.05, the results are considered marginally significant decision could go either way

If the p-value is greater than (but not close to) 0.05, the results are considered non-significant dont reject Ho

Biological vs statistical hypotheses

Biological and statistical hypothesis-

- "Sexual selection by females has not caused male chickens to evolve bigger feet than females

- Male chickens dont have a different average foot size than females

Statistical Hypothesis Statistical Hypothesis- statement about the

probability distribution of populations using one or more data samples

Hypothesis H0: All data samples originate from the same population (or the single data sample is consistent with a given theoretical distribution).

Hypothesis H1: Some data samples do not originate from the same population (or the single data sample is not consistent with the given theoretical distribution).

Statistical Inference and Hypothesis Testing

What do we mean by chance?

What do we mean unlikely?

What do we mean by effect?

Hypothesis and Significance Testing

Hypothesis- is a statement about some characteristic of a variable or a collection of variables. (Agresti & Finlay, 1997).

Significance test- is a way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis

The Process of Hypothesis Testing

Sample selected at random from very different population may not necessarily be different. Simply by chance the samples from populations 1 and 2 are similar, so you might mistakenly conclude the two populations are also similar

The Mechanism of Hypothesis Testing

The Mechanism of Hypothesis Testing

Even a random sample may not necessarily be a good representative of the population. Two samples have been taken at random from the same population. By chance, sample 1 contains a group of relatively large fish, while those in sample 2 are relatively small.

Type I & Type II errors

Test Statistics and your decision

Type I & Type II errors

Four possible results of hypothesis testing

Parametric statistics

Also known as classical statistics

Parametric tests are designed for analysingdata from a known distribution

ANOVA (1920s and 30s), Multiple Regression (1800s), T-tests (1900s), Pearson Correlation (1880s) are parametric statistical methods

Parametric statistics

General Assumptions of Parametric Statistical Tests

1. The sample of n subjects is randomly selected from the population.

2. The variables are continuous and from the normal distribution

3. The measurement of each variable is based on interval or ratio data

Non parametric Statistics Sometimes called distribution free statistics

Do not require data to be normally distributed

In general, a less powerful test than the analogous parametric test

No normality assumption

Uses less information

Spearmans Rho (1904), Kendalls Tau (1938), Kruskal-Wallis (1950s), Wilcoxon Signed-Ranks Matched Pairs (1940s)

Parametric vs Non Parametric

Parametric test Non-parametric analogT-test (unpaired) Wilcoxon rank sum testPaired t-test Wilcoxon signed rank testANOVA Kruskal-Wallis testRepeated measures ANOVA

Friedman test

The parametric tests are called parametric because, when we calculate the p-value, we use the parameters of the normal distribution: mean and standard deviation

The non-parametric tests do not estimate these parameters, but instead are based on ranks

Hypothesis and Statistical Tests

main goal of a statistical or Hypothesis test-

what is the probability of getting a result like my observed data, if the null hypothesis were

true

Evaluate and compare groups of data

To determine whether hypothesis can be retained or rejected and modified

can refer to a single group

can also refer to two groups

Steps for a hypothesis Test

1. Set up the null and alternative hypotheses: Ho and Ha.

2. Take a random sample of individuals from the population and calculate the sample statistics (means and standard deviations).

3. Convert the sample statistic to a test statistic by

changing it to a standard score (all formulas for test statistics are provided later in this chapter).

4. Find the p-value for your test statistic.

5. Examine your p-value and make your decision.

Structure of Hypothesis Tests

1. Choose the appropriate test.

2. Establish the null and alternate hypotheses.

3. Decide on an acceptable error rate .

4. Compute the test statistic from the data.

5. Compute the p-value.

6. Reject the null hypothesis if p .

Sampling Distributions

Major parametric test statistics -

Z distribution

T distribution

Chi-square

F distribution

Sample size is the key

Sampling test Distributions

Four common probability distributions of sample statistics z, t, chi-square, and F

Z distribution

Represents the probability distribution of a random variable that is the ratio of the difference between a sample statistic and its population value to the standard deviation of the population statistic

Students t Distribution

Chi-square Distribution

represents the probability distribution of a variable that is the square of values from a standard normal distribution

bounded by 0 and infinity

used for interval estimation of population variances

can also be used to determine the probability of obtaining a sample difference (or one smaller or larger) between observed values and those predicted by a model

F Distribution

represents the probability distribution of a variable that is the ratio of two independent chi-square variables, each divided by its df (degrees of freedom) (Hays 1994).

Because variances are distributed as 2, the F distribution is used for testing hypotheses about ratios of variances.

bounded by zero and infinity.

Used to determine the probability of obtaining a sample variance ratio (or one larger) for a specified value of the true ratio between variances

Hypothesis Testing

Null Hypothesis(H)&Alternate Hypothesis(H)

H: = / H: (Two-tailed test)

H: = / H: (one-tailed test)

Types of hypothesis tests

Associations and Differences

Relationship between variables Associations and Differences

Association- The relationship between a wing length and weight of a growing bird

Difference- The relationship between the mean tail length of Gull-billed Tern and the mean tail length of Common Tern.

Difference of mean tests

One sample t-test

Two independent samples t-test

t= SE /n

where t represents the effect size or test statistic

Paired samples t-test

K-independent samples (n>2)

- ANOVA (Analysis of Variance)

One way ANOVA

Two way ANOVA

Difference of mean tests (Non parametric)

- One sample

Runs test

- Two independent samples

Kolmogrov-Smirnov test

Mann Whitney U test

Difference of mean tests (Non parametric)

- Paired samples

Wilcoxon signed Ranks test

Mc Nemars test

Marginal Homogeneity test

- K independent samples

Kruskall- Wallis test

Friedmans Rank test

Test of Proportions, ratios and indices

Chi-square test

Goodness of fit

Correlation

Pearsons product moment correlation (r)

To investigate linear relationships between two independent variables

r -1 to +1

Correlation

Scatter plots with various correlations

Regression

Prediction is made on the assumption the hypothesis is correct

Simple linear regression

Investigate relationships- Dependent and independent variable

Best fit linear line describes relation between X and Y

Regression coefficient/ Coefficient of determination (R)

Regression

Regression lines by gender and parity status for predicting weight at 1 month of age in term babies

Classification of some hypothesis tests

Summary of Statistical Tests

Common Errors of statistical analysis

Samples are not random

Sample size is too low for any meaningful interpretation

Non-independence of sample data

Overuse of non-parametric statistics, even with low sample size

Failure to do a graphical exploration

Common Errors of statistical analysis

Power analysis and effect size

Interpreting simple correlation as cause and effect

Use of complex model and multivariate statistics without verifying the merit of the data

Power of a test

Measure of likelihood of a test reaching a correct conclusion

Statistics Overview

Documents