Lecture 1 1
Lecture I
Definition 1. Statistics is the science of
collecting, organizing, summarizing and analyzing
the information in order to draw conclusions.
It is a process consisting of 3 parts.
May 14, 2012
Lecture 1 2
First part: collecting data
• Identify the research objective: the group that
it is to be study is called population. A member
of the population is called individual, element
or subject.
• Collect the information needed to answer the
questions posed: typically look at a subset of the
population called sample.
Example: what is a placebo, experimental group,
treatment, control group, double-blind
experiment, single-blind experiment
May 14, 2012
Lecture 1 3
Second part: Analyze and present
the information
This step is called descriptive statistics, or
exploratory data analysis. Uses tables, charts,
graphs, etc to describe the data collected.
Third part: Draw conclusions from
the information
This part is called inferential statistics.
We can not learn everything about the population
just by looking at a sample!!! But we might be
able to say something with a certain level of
confidence.
May 14, 2012
Lecture 1 4
Types of data
Definition 2. The characteristics, that we
decided we are interested to study, of the
individual within the population are called
variables.
Definition 3. A characteristic of a population is
called a parameter, while a characteristic of a
sample is called a statistics
Definition 4. An observation is the set of
values of the variables for a given individual.
May 14, 2012
Lecture 1 5
Variables can be classified into two groups:
Definition 5. Qualitative or categorical
variables allow for classification of individuals
based on some attribute or characteristics.
Quantitative variables provide numerical
measures of individuals. Arithmetic operations
can be performed on the values of a quantitative
variable and provide meaningful results.
Examples:
The distribution of a variable tells us what
values it takes and how often it takes these values.
May 14, 2012
Lecture 1 6
Quantitative variables can be classified into two
types:
Definition 6. A discrete variable is a
quantitative variable whose possible values could
be counted: 0,1,2,3,4,5.
Examples:
A continuous variable is a quantitative variable
that has an infinite number of possible values that
are not countable.
Examples:
The list of observations a variable assumes is
called data. Data could be classified in the same
May 14, 2012
Lecture 1 7
categories as variables.
May 14, 2012
Lecture 1 8
Data can be obtained from four sources:
1. A census
2. Existing sources
3. Survey sampling
4. Designed experiments
May 14, 2012
Lecture 1 9
Definition 7. A census is a list of all
individuals in a population along with certain
characteristics of each individual.
Existing data: Don’t collect data that have
already been collected.
Survey sampling is used when no attempt to
influence the value of the variable of interest.
Examples: Polling, ....
Data obtained from a survey sample lead to an
observational study. Sometimes it is referred
to as expost facto (after the fact) studies because
the value of the variable of interest has already
May 14, 2012
Lecture 1 10
been established.
May 14, 2012
Lecture 1 11
A designed experiment (or experimental
study) applies a treatment to individuals
(referred to as experimental units) and
attempts to isolate the effects of the treatment on
a response variable.
Observational studies are very useful tools for
determining whether there is a relation between
two variables, but it requires a design experiment
to isolate the cause of the relation.
If control is possible, an experiment should be
performed. If control is not possible or necessary,
then observational studies are appropriate.
May 14, 2012
Lecture 1 12
The design of experiments
We will discuss obtaining data through an
experiment
A designed experiment is a controlled study in
which one or more treatments are applied to
experimental units. The experimenter then
observes the effect of varying these treatments on
a response variable. Control, manipulation,
randomization, and replication are the key
ingredients of a well-designed experiment.
The experimental unit, or the subject is the
equivalent of the individual in the sample. It is a
well-defined item upon which a treatment
(condition) is applied.
May 14, 2012
Lecture 1 13
A response variable is a quantitative or
qualitative variable that represents our variable of
interest.
A predictor variable is a characteristic
purported to explain differences in the response
variable.
May 14, 2012
Lecture 1 14
Sampling
How can a researcher obtain accurate information
about the population through the sample while
minimizing the costs?
There are 5 types of sampling:
• simple random sampling
• stratified sampling
• systematic sampling
• cluster sampling
• convenience sampling
In the first four cases the sampling methods are
based on the planned randomness techniques.
The surveyor does not have a choice as to who is
in the study.
May 14, 2012
Lecture 1 15
Simple random sampling
Definition 8. If the population is of size N and
we want a sample of size n (n < N), a simple
random sampling is obtained if every possible
sample of size n has an equally likely chance of
occurring. The sample is then called a simple
random sample.
Examples:
May 14, 2012
Lecture 1 16
How do we obtain such a sample?
1. using a hat if the population is small!
2. using random number if the population is
large:
(a) number the individuals in the population,
from 1 to N . (that means that we have to
have the frame-the list of all individuals
in the population!
(b) select n random numbers from this list
using a table of random numbers or using
your calculator.
May 14, 2012
Lecture 1 17
Using the table:
• Select a starting point.
• Look for numbers that have as many digits as
N has.
• If a number is repeated, discard it.
• If a number is larger than N discard it.
• Stop when you obtain n numbers.
May 14, 2012
Lecture 1 18
Stratified Sampling
Definition 9. A stratified sample is obtained
by separating the population into nonoverlapping
groups called strata and then obtaining a simple
random sample from each stratum. The
individuals within stratum should be
homogeneous (or similar) in some way.
May 14, 2012
Lecture 1 19
Definition 10. A systematic sample is
obtained by selecting every kth individual from
the population. The first individual selected is a
random number between 1 and k.
• Does not require a frame!
• How do we obtain a systematic sample without
a frame? How do we establish k?
May 14, 2012
Lecture 1 20
Cluster sampling
Definition 11. A cluster sample is obtained
by selecting all individuals within a randomly
selected collection or group of individuals.
How do we obtain a cluster sampling?
• randomly select the cluster (using random
sampling for example)
• survey all the individuals in the clusters.
Other questions:
• How do I cluster the population?
• How many individuals in a cluster?
• How many clusters do I sample?
May 14, 2012
Lecture 1 21
Sources of error in sampling
There are two types of errors:
• Sampling errors
• Nonsampling errors
Sampling error is the error that results from
using sampling to estimate information regarding
a population. This type of error occurs because a
sample gives incomplete information about the
population.
We can control the amount of sampling error
through an appropriately designed survey of
experiment.
May 14, 2012
Lecture 1 22
Nonsampling errors or selection bias are
errors that result from the survey process. They
are very difficult to control. Exaples:1)
Incomplete frame (certain segments of the
population are underrepresented)
2) Nonresponse of the individuals selected
3) Inaccurate responses (trained interviewers are
needed to avoid this)
4) Data entrance errors
5) Biase in the selection of individuals
6) Poorly designed questions: Do you use an open
question or a closed question?
7) Poorly worded question: the question needs to
be balance, not vague, and with the right order of
the words.
May 14, 2012
Lecture 1 23
Organizing Categorial Data
We are interested in the number of individuals
that occur in each category.
Definition 12. A frequancy distribution lists
the number of occurances (or the count) for each
category of data. The relative frequency is the
proportion or percent of observations within
each category and is found using the formula
Relative frequency =frequency
sum of all frequencies
A relative frequency distribution lists the
relative frequency of each category of data.
Examples:
May 14, 2012
Lecture 1 24
Definition 13. A bar graph is constructed by
labeling each category of data on the horizontal
axis and the freequency or relative frequency of
the category on the vertical axis. A rectangle of
equal width is drawn for each category. The
height of the rectangle is equal to the category’s
frequency or relative frequency.
A Pareto chart is a bar graph whose bars are
drawn in decreasing order of frequency or relative
frequency.
Definition 14. A side-by-side bar graph is
used when we want to compare two sets of data.
Carefull!: We should use relative frequencies
when drawing a side-by-side chart!!! (Why?)
May 14, 2012
Lecture 1 25
Examples:
May 14, 2012
Lecture 1 26
Definition 15. A pie chart is a circle divided
into sectors. Each sector represents a category of
data. The area of each sector is proportional to
the frequency of the category.
Remark: The size of the angle of the sectors of
the pie chart is given by
percetange × 360◦
May 14, 2012