Introduction to Basic Statistics

Post on 03-Jan-2016

33 views 0 download

Tags:

description

Introduction to Basic Statistics. S x. x. =. n. Mean. The mean is the sum of the values of a set of data divided by the number of values in that data set. (pronounced “X-bar”). Mean. S x. x. =. n. x = individual data value n = # of data values in the data set - PowerPoint PPT Presentation

transcript

IntroductionIntroductiontoto

Basic StatisticsBasic Statistics

S xnx =

Mean

(pronounced “X-bar”)

The meanmean is the sum of the values of a set of data divided by the number of values in that data set.

x = individual data value

n = # of data values in the data set

S = summation of a set of values

S xnx =

Mean

3 7 12 17 21 21 23 27 32 36 44

Data Set:

Sum of the values = 243

Number of values = 11

Mean = 24311

= 22.09S xnx = =

Mean

The most frequently occurring value in a set of data is the modemode.

Symbol… M

Mode

27 17 12 7 21 44 23 3 36 32 21

Data Set:

3 7 12 17 21 21 23 27 32 36 44

Data Set:

Mode = 21

The most frequently occurring value in a set of data is the modemode.

Mode

Note: If two numbers of equal frequency stand out, then the data set is “bimodal.” If more than two numbers of equal frequency stand out, then the data set is “multi-modal.”

The most frequently occurring value in a set of data is the modemode.

Mode

The medianmedian is the value that occurs in the middle of a set of data that has been arranged in chronological order.

Symbol… x pronounced “X-tilde”~

Median

Data Set:

Median = 21

Median

The medianmedian is the value that occurs in the middle of a set of data that has been arranged in chronological order.

27 17 12 7 21 44 23 3 36 32 21

Note: A data set that contains an odd # of values always has a Median. For an even # of values, the two middle values are averaged with the result being the Median.

Median

3 7 12 17 21 21 23 27 32 36 44Data Set:

Median = 21

The rangerange is the difference between the largest and smallest values that occur in a set of data.

Range = 44-3 = 41

Symbol… R

Range

3 7 12 17 21 21 23 27 32 36 44Data Set:

Standard Deviation

Two classes took a recent quiz. There were 10 students in each class, and each class had an average score of 81.5

Since the averages are the same, can we assume that the students in both classes all did pretty much the same on the exam?

The answer is… No.

The average (mean) does not tell us anything about the distribution or variation in the grades.

Here are Dot-Plots of the grades in each class:

Mean

So, we need to come up with some way of

measuring not just the average, but also the

spread of the distribution of our data.

Why not just give an average and the range of data (the highest and

lowest values) to describe the distribution

of the data?

Well, for example, lets say from a set of data, the

average is 17.95 and the range is 23.

But what if the data looked like this:

Here is the average

And here is the range

But really, most of the numbers are in this area, and are not evenly distributed throughout the range.

The Standard Deviation is a number that

measures how far away each number in a set of data is from their mean.

If the Standard Deviation is large,

it means the numbers are spread out from their mean.

If the Standard Deviation is small, it means the numbers are

close to their mean.small,

large,

Here are the scores

on the math quiz for Team

A:

72

76

80

80

81

83

84

85

85

89

Average: 81.5

The Standard Deviation measures how far away each number in a set of data is from their mean.For example, start with the

lowest score, 72. How far away is 72 from the mean of 81.5?72 - 81.5 = - 9.5

- 9.5

- 9.5

Or, start with the lowest score, 89. How far away is 89 from the mean of 81.5?89 - 81.5 = 7.5

7.5

So, the first step to finding the Standard Deviation is to find all the

distances from the mean.

72

76

80

80

81

83

84

85

85

89

-9.5

7.5

Distance from Mean

So, the first step to finding the Standard Deviation is to find all the

distances from the mean.

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

Next, you need to square each of

the distances

to turn them all

into positive

numbers

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

90.25

30.25

Distances

Squared

Next, you need to square each of

the distances

to turn them all

into positive

numbers

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

90.25

30.25

2.25

2.25

0.25

2.25

6.25

12.25

12.25

56.25

Distances

Squared

Add up all of the

distances

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

90.25

30.25

2.25

2.25

0.25

2.25

6.25

12.25

12.25

56.25

Distances

Squared

Sum:214.5

Divide by (n - 1) where n represents the amount of numbers you have.

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

90.25

30.25

2.25

2.25

0.25

2.25

6.25

12.25

12.25

56.25

Distances

Squared

Sum:214.5

(10 - 1)

= 23.8

Finally, take the Square

Root of the average distance

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

90.25

30.25

2.25

2.25

0.25

2.25

6.25

12.25

12.25

56.25

Distances

Squared

Sum:214.5

(10 - 1)

= 23.8= 4.88

This is the Standard Deviation

72

76

80

80

81

83

84

85

85

89

- 9.5

- 5.5

- 1.5

- 1.5

- 0.5

1.5

2.5

3.5

3.5

7.5

Distance from Mean

90.25

30.25

2.25

2.25

0.25

2.25

6.25

12.25

12.25

56.25

Distances

Squared

Sum:214.5

(10 - 1)

= 23.8= 4.88

Now find the

Standard Deviation

for the other class

grades

57

65

83

94

95

96

98

93

71

63

- 24.5

- 16.5

1.5

12.5

13.5

14.5

16.5

11.5

- 10.5

-18.5

Distance from Mean

600.25

272.25

2.25

156.25

182.25

210.25

272.25

132.25

110.25

342.25

Distances

Squared

Sum:2280.5(10 - 1)

= 253.4= 15.91

Now, lets compare the two classes again

Team A Team B

Average on the Quiz

Standard Deviation

81.5 81.5

4.88 15.91

A histogram is a common data distribution graph that is used to show the frequency with which specific values, or values within ranges, occur in a set of data.

An forensic engineer might use a histogram to show the most common, or average, dimension that exists among a group of identical manufactured parts.

Histogram

0 1 2 3 4 5 6-1-2-3-4-5-6

0

3

-1

3

2

-1

-1

1

2

-3

0

1

0

1

-2

1

2

-4

-1

1

0

-2

0

0

Specific values, called data elements, are plotted along the X-axis of the graph.

Histogram

0 1 2 3 4 5 6-1-2-3-4-5-6

Data Elements

Large sets of data are often divided into limited number of groups. These groups are called class intervals.

Histogram

-5 to 5

Class Intervals6 to 16-6 to -16

The number of data elements is shown by the frequency, which occurs along the Y-axis of the graph.

HistogramF

req

uen

cy

1

3

5

7

-5 to 5 6 to 16-6 to -16

“Is the data normal?”

Translation…does the greatest frequency of the data values occur about the mean value?

Normal Distribution

Fre

qu

ency

Data Elements

0 1 2 3 4 5 6-1-2-3-4-5-6

Mean Value

Normal Distribution

“Is the process normal?”

Further Translation…does the data form a bell shape curve when plotted on a histogram?

Normal Distribution

Fre

qu

ency

Data Elements

0 1 2 3 4 5 6-1-2-3-4-5-6

Normal Distribution

Basic Biostat 5: Probability Concepts 47

Chapter 5: Chapter 5: Probability ConceptsProbability Concepts

Basic Biostat 5: Probability Concepts 48

In Chapter 5:

5.1 What is Probability?

5.2 Types of Random Variables

5.3 Discrete Random Variables

5.4 Continuous Random Variables

5.5 More Rules and Properties of Probability

Basic Biostat 5: Probability Concepts 49

Definitions• Random variable ≡ a numerical quantity that

takes on different values depending on chance• Population ≡ the set of all possible values for a

random variable• Event ≡ an outcome or set of outcomes

• Probability ≡ the relative frequency of an event in the population … alternatively… the proportion of times an event is expected to occur in the long run

Basic Biostat 5: Probability Concepts 50

Example• In a given year: 42,636

traffic fatalities (events) in a population of N = 293,655,000

• Random sample population

• Probability of event = relative freq in pop= 42,636 / 293,655,000 = .0001452 ≈ 1 in 6887

Basic Biostat 5: Probability Concepts 51

Example: Probability• Assume, 20% of population has a condition• Repeatedly sample population• The proportion of observations positive for the condition approaches 0.2 after a very large number of trials

Basic Biostat 5: Probability Concepts 53

Random Variables• Random variable ≡ a numerical quantity

that takes on different values depending on chance

• Two types of random variables– Discrete random variables (countable set of

possible outcomes) – Continuous random variable (unbroken

chain of possible outcomes)

Basic Biostat 5: Probability Concepts 54

Example: Discrete Random Variable

• Treat 4 patients with a drug that is 75% effective

• Let X ≡ the [variable] number of patients that respond to treatment

• X is a discrete random variable can be either 0, 1, 2, 3, or 4 (a countable set of possible outcomes)

Basic Biostat 5: Probability Concepts 55

Example: Discrete Random Variable

• Discrete random variables are understood in terms of their probability mass function (pmf)

• pmf ≡ a mathematical function that assigns probabilities to all possible outcomes for a discrete random variable.

• This table shows the pmf for our “four patients” example:

x 0 1 2 3 4

Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164

Random Variable

• A random variable x takes on a defined set of values with different probabilities.

• For example, if you roll a die, the outcome is random (not fixed) and there are 6 possible outcomes, each of which occur with probability one-sixth.

• For example, if you poll people about their voting preferences, the percentage of the sample that responds “Yes on Proposition 100” is a also a random variable (the percentage will be slightly differently every time you poll).

• Roughly, probability is how frequently we expect different outcomes to occur if we repeat the experiment over and over (“frequentist” view)

Random variables can be discrete or continuous

• Discrete random variables have a countable number of outcomes– Examples: Dead/alive, treatment/placebo, dice, counts,

etc.

• Continuous random variables have an infinite continuum of possible values. – Examples: blood pressure, weight, the speed of a car,

the real numbers from 1 to 6.

Probability functions

• A probability function maps the possible values of x against their respective probabilities of occurrence, p(x)

• p(x) is a number from 0 to 1.0.• The area under a probability function is

always 1.

Discrete example: roll of a die

x

p(x)

1/6

1 4 5 62 3

xall

1 P(x)

Probability mass function (pmf)

x p(x)

1 p(x=1)=1/6

2 p(x=2)=1/6

3 p(x=3)=1/6

4 p(x=4)=1/6

5 p(x=5)=1/6

6 p(x=6)=1/61.0

Cumulative distribution function

x P(x≤A)

1 P(x≤1)=1/6

2 P(x≤2)=2/6

3 P(x≤3)=3/6

4 P(x≤4)=4/6

5 P(x≤5)=5/6

6 P(x≤6)=6/6

• For evidence yielding full single source DNA profiles, laboratories will typically use random match probabilities which are based on genotype frequency estimates, while others will use a likelihood ratio under the primary hypothesis that the suspect is the source of the DNA profile versus the alternate

• hypothesis where an unrelated untested individuals from the general population was the DNA donor.