Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | teagan-miller |
View: | 33 times |
Download: | 0 times |
IntroductionIntroductiontoto
Basic StatisticsBasic Statistics
S xnx =
Mean
(pronounced “X-bar”)
The meanmean is the sum of the values of a set of data divided by the number of values in that data set.
x = individual data value
n = # of data values in the data set
S = summation of a set of values
S xnx =
Mean
3 7 12 17 21 21 23 27 32 36 44
Data Set:
Sum of the values = 243
Number of values = 11
Mean = 24311
= 22.09S xnx = =
Mean
The most frequently occurring value in a set of data is the modemode.
Symbol… M
Mode
27 17 12 7 21 44 23 3 36 32 21
Data Set:
3 7 12 17 21 21 23 27 32 36 44
Data Set:
Mode = 21
The most frequently occurring value in a set of data is the modemode.
Mode
Note: If two numbers of equal frequency stand out, then the data set is “bimodal.” If more than two numbers of equal frequency stand out, then the data set is “multi-modal.”
The most frequently occurring value in a set of data is the modemode.
Mode
The medianmedian is the value that occurs in the middle of a set of data that has been arranged in chronological order.
Symbol… x pronounced “X-tilde”~
Median
Data Set:
Median = 21
Median
The medianmedian is the value that occurs in the middle of a set of data that has been arranged in chronological order.
27 17 12 7 21 44 23 3 36 32 21
Note: A data set that contains an odd # of values always has a Median. For an even # of values, the two middle values are averaged with the result being the Median.
Median
3 7 12 17 21 21 23 27 32 36 44Data Set:
Median = 21
The rangerange is the difference between the largest and smallest values that occur in a set of data.
Range = 44-3 = 41
Symbol… R
Range
3 7 12 17 21 21 23 27 32 36 44Data Set:
Standard Deviation
Two classes took a recent quiz. There were 10 students in each class, and each class had an average score of 81.5
Since the averages are the same, can we assume that the students in both classes all did pretty much the same on the exam?
The answer is… No.
The average (mean) does not tell us anything about the distribution or variation in the grades.
Here are Dot-Plots of the grades in each class:
Mean
So, we need to come up with some way of
measuring not just the average, but also the
spread of the distribution of our data.
Why not just give an average and the range of data (the highest and
lowest values) to describe the distribution
of the data?
Well, for example, lets say from a set of data, the
average is 17.95 and the range is 23.
But what if the data looked like this:
Here is the average
And here is the range
But really, most of the numbers are in this area, and are not evenly distributed throughout the range.
The Standard Deviation is a number that
measures how far away each number in a set of data is from their mean.
If the Standard Deviation is large,
it means the numbers are spread out from their mean.
If the Standard Deviation is small, it means the numbers are
close to their mean.small,
large,
Here are the scores
on the math quiz for Team
A:
72
76
80
80
81
83
84
85
85
89
Average: 81.5
The Standard Deviation measures how far away each number in a set of data is from their mean.For example, start with the
lowest score, 72. How far away is 72 from the mean of 81.5?72 - 81.5 = - 9.5
- 9.5
- 9.5
Or, start with the lowest score, 89. How far away is 89 from the mean of 81.5?89 - 81.5 = 7.5
7.5
So, the first step to finding the Standard Deviation is to find all the
distances from the mean.
72
76
80
80
81
83
84
85
85
89
-9.5
7.5
Distance from Mean
So, the first step to finding the Standard Deviation is to find all the
distances from the mean.
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
Next, you need to square each of
the distances
to turn them all
into positive
numbers
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
90.25
30.25
Distances
Squared
Next, you need to square each of
the distances
to turn them all
into positive
numbers
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Distances
Squared
Add up all of the
distances
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Distances
Squared
Sum:214.5
Divide by (n - 1) where n represents the amount of numbers you have.
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Distances
Squared
Sum:214.5
(10 - 1)
= 23.8
Finally, take the Square
Root of the average distance
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Distances
Squared
Sum:214.5
(10 - 1)
= 23.8= 4.88
This is the Standard Deviation
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance from Mean
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Distances
Squared
Sum:214.5
(10 - 1)
= 23.8= 4.88
Now find the
Standard Deviation
for the other class
grades
57
65
83
94
95
96
98
93
71
63
- 24.5
- 16.5
1.5
12.5
13.5
14.5
16.5
11.5
- 10.5
-18.5
Distance from Mean
600.25
272.25
2.25
156.25
182.25
210.25
272.25
132.25
110.25
342.25
Distances
Squared
Sum:2280.5(10 - 1)
= 253.4= 15.91
Now, lets compare the two classes again
Team A Team B
Average on the Quiz
Standard Deviation
81.5 81.5
4.88 15.91
A histogram is a common data distribution graph that is used to show the frequency with which specific values, or values within ranges, occur in a set of data.
An forensic engineer might use a histogram to show the most common, or average, dimension that exists among a group of identical manufactured parts.
Histogram
0 1 2 3 4 5 6-1-2-3-4-5-6
0
3
-1
3
2
-1
-1
1
2
-3
0
1
0
1
-2
1
2
-4
-1
1
0
-2
0
0
Specific values, called data elements, are plotted along the X-axis of the graph.
Histogram
0 1 2 3 4 5 6-1-2-3-4-5-6
Data Elements
Large sets of data are often divided into limited number of groups. These groups are called class intervals.
Histogram
-5 to 5
Class Intervals6 to 16-6 to -16
The number of data elements is shown by the frequency, which occurs along the Y-axis of the graph.
HistogramF
req
uen
cy
1
3
5
7
-5 to 5 6 to 16-6 to -16
“Is the data normal?”
Translation…does the greatest frequency of the data values occur about the mean value?
Normal Distribution
Fre
qu
ency
Data Elements
0 1 2 3 4 5 6-1-2-3-4-5-6
Mean Value
Normal Distribution
“Is the process normal?”
Further Translation…does the data form a bell shape curve when plotted on a histogram?
Normal Distribution
Fre
qu
ency
Data Elements
0 1 2 3 4 5 6-1-2-3-4-5-6
Normal Distribution
Basic Biostat 5: Probability Concepts 47
Chapter 5: Chapter 5: Probability ConceptsProbability Concepts
Basic Biostat 5: Probability Concepts 48
In Chapter 5:
5.1 What is Probability?
5.2 Types of Random Variables
5.3 Discrete Random Variables
5.4 Continuous Random Variables
5.5 More Rules and Properties of Probability
Basic Biostat 5: Probability Concepts 49
Definitions• Random variable ≡ a numerical quantity that
takes on different values depending on chance• Population ≡ the set of all possible values for a
random variable• Event ≡ an outcome or set of outcomes
• Probability ≡ the relative frequency of an event in the population … alternatively… the proportion of times an event is expected to occur in the long run
Basic Biostat 5: Probability Concepts 50
Example• In a given year: 42,636
traffic fatalities (events) in a population of N = 293,655,000
• Random sample population
• Probability of event = relative freq in pop= 42,636 / 293,655,000 = .0001452 ≈ 1 in 6887
Basic Biostat 5: Probability Concepts 51
Example: Probability• Assume, 20% of population has a condition• Repeatedly sample population• The proportion of observations positive for the condition approaches 0.2 after a very large number of trials
Basic Biostat 5: Probability Concepts 53
Random Variables• Random variable ≡ a numerical quantity
that takes on different values depending on chance
• Two types of random variables– Discrete random variables (countable set of
possible outcomes) – Continuous random variable (unbroken
chain of possible outcomes)
Basic Biostat 5: Probability Concepts 54
Example: Discrete Random Variable
• Treat 4 patients with a drug that is 75% effective
• Let X ≡ the [variable] number of patients that respond to treatment
• X is a discrete random variable can be either 0, 1, 2, 3, or 4 (a countable set of possible outcomes)
Basic Biostat 5: Probability Concepts 55
Example: Discrete Random Variable
• Discrete random variables are understood in terms of their probability mass function (pmf)
• pmf ≡ a mathematical function that assigns probabilities to all possible outcomes for a discrete random variable.
• This table shows the pmf for our “four patients” example:
x 0 1 2 3 4
Pr(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
Random Variable
• A random variable x takes on a defined set of values with different probabilities.
• For example, if you roll a die, the outcome is random (not fixed) and there are 6 possible outcomes, each of which occur with probability one-sixth.
• For example, if you poll people about their voting preferences, the percentage of the sample that responds “Yes on Proposition 100” is a also a random variable (the percentage will be slightly differently every time you poll).
• Roughly, probability is how frequently we expect different outcomes to occur if we repeat the experiment over and over (“frequentist” view)
Random variables can be discrete or continuous
• Discrete random variables have a countable number of outcomes– Examples: Dead/alive, treatment/placebo, dice, counts,
etc.
• Continuous random variables have an infinite continuum of possible values. – Examples: blood pressure, weight, the speed of a car,
the real numbers from 1 to 6.
Probability functions
• A probability function maps the possible values of x against their respective probabilities of occurrence, p(x)
• p(x) is a number from 0 to 1.0.• The area under a probability function is
always 1.
Discrete example: roll of a die
x
p(x)
1/6
1 4 5 62 3
xall
1 P(x)
Probability mass function (pmf)
x p(x)
1 p(x=1)=1/6
2 p(x=2)=1/6
3 p(x=3)=1/6
4 p(x=4)=1/6
5 p(x=5)=1/6
6 p(x=6)=1/61.0
Cumulative distribution function
x P(x≤A)
1 P(x≤1)=1/6
2 P(x≤2)=2/6
3 P(x≤3)=3/6
4 P(x≤4)=4/6
5 P(x≤5)=5/6
6 P(x≤6)=6/6
• For evidence yielding full single source DNA profiles, laboratories will typically use random match probabilities which are based on genotype frequency estimates, while others will use a likelihood ratio under the primary hypothesis that the suspect is the source of the DNA profile versus the alternate
• hypothesis where an unrelated untested individuals from the general population was the DNA donor.