Post on 04-Jun-2018
transcript
8/13/2019 Data Prep and Descriptive Stats
1/51
Data Preparation
8/13/2019 Data Prep and Descriptive Stats
2/51
Steps in Data Preparation
Editing
Coding
EnteringData
Data Tabulation
Reviewing Tabulations
Statistically adjusting the data
8/13/2019 Data Prep and Descriptive Stats
3/51
Editing
Carefully checking survey data for Completeness (no omissions) Legibility (non-ambiguous)
Right informant Consistency e.g. charging something when the person does not
own a charge card
Accuracy. Most important purpose is to eliminate or at
least reduce the number of errors in the rawdata.
8/13/2019 Data Prep and Descriptive Stats
4/51
1. Ideally re-interview respondent
2. Eliminate all unacceptable surveys (case wisedeletion) (if sample is large and few unacceptable)
3. In calculations only the cases with complete
responses are considered (pair wise deletion)(means that some statistics will be based ondifferent sample sizes)
4. Code illegible or missing answers into a a no valid
response category5. substitute a neutral value - typically the mean
response to the variable, therefore the meanremains unchanged
Solutions
8/13/2019 Data Prep and Descriptive Stats
5/51
Coding
The process of systematically and consistentlyassigning each response a numerical score.
The key to a good coding system is for the codingcategories to be mutually exclusive and the entiresystem to be collectively exhaustive.
To be mutually exclusive, every response must fitinto only one category.
To be collectively exhaustive, all possible
responses must fit into one of the categories. Exhaustive means that you have covered the entire
range of the variable with your measurement.
8/13/2019 Data Prep and Descriptive Stats
6/51
Coding M issing Numbers: When respondents failto complete portions of the survey.
Whatever the reason for incomplete surveys, you
must indicate that there was no response providedby the respondent.
For single digit responses code as 9, 2 digit code
as 99
Coding
8/13/2019 Data Prep and Descriptive Stats
7/51
Coding Open-Ended Questions:When open-endedquestions are used, you must create categories.
All responses must fit into a category
similar responses should fall into the samecategory.
e.g. Who services your car? ______________
Possible categories: self, garage, husband, wife,friend, relative etc.
To make it collectively exhaustive add an other ornone of the above category
Only a few i.e. < 10% should fit into this category
8/13/2019 Data Prep and Descriptive Stats
8/51
Are you: Male Female
How satisfied are you with our product?
___Very Satisfied
___Somewhat Satisfied
___Somewhat Dissatisfied
___Very Dissatisfied
___No opinion
Are you: (1)Male (2)Female
How satisfied are you with our product?
_1__Very Satisfied
_2__Somewhat Satisfied
_3__Somewhat Dissatisfied
_4__Very Dissatisfied
_5__No opinion
This Becomes this
Precoded Questionnaires: Sometimes you can placecodes on the actual questionnaire, which simplifies
data entry.
8/13/2019 Data Prep and Descriptive Stats
9/51
1. Are you solely responsible for taking care of yourautomotive service needs ___ Yes ___ No
2. If No who performs the simple maintenance ___________3. If scheduled maintenance is done on your automobile,
how do you keep track of what has been doneNot tracked
auto dealer recordsmental recollectionother
4. How often is your automobile serviced?Once per month
Once every three monthsOnce every six monthsOnce per yearOther _______________
8/13/2019 Data Prep and Descriptive Stats
10/51
Col.
No
Question
No.
Question Des. Range of permissible values
1-3 ID # N/A 001-200
4 1 Responsible for
Maintenance
0= No. 1=yes, 9= blank
5 2 perform simple
maintenance
0=husband, 1=boyfriend, 2=father, 3=mother,
4=relative, 5=friend, 6=other, 9=blank
5 3 How maintenance
tracked
0=not tracked, 1=auto dealer records, 2=personal
records, 3=mental recollection, 4=other, 9=blank
6 4 How often
maintenance
performed
Once per 0=month, 1= 3 months, 2= 6 months,
3= year, 4= other 9=blank
7 4 Other for how often
Code Book
8/13/2019 Data Prep and Descriptive Stats
11/51
In questions that permit multiple responses, each possible response
option should be assigned a separate column
6. Which magazines do you read, choose all that apply.
Time National Geographic
Readers Digest Chatelaine
MacLean's
Col. No Question No. Question Des. Range of permissible
values
15 6 Time 0 =read, 1= not read
16 6 Readers Dig. 0 =read, 1= not read
17 6 MacLean's
0 =read, 1= not read
18 6 National Geo. 0 =read, 1= not read
19 6 Chatelaine 0 =read, 1= not read
8/13/2019 Data Prep and Descriptive Stats
12/51
For rank order questions, separate columns are also needed
7. Please rank the following brands of toothpaste in order of
preference (1-5)Crest Colgate
Aquafresh Arm & Hammer
Pepsodent
Col.# Q. No. Question Des. Range of permissible values20 7 Crest rank 0 =blank, 1 = most important, 2 =2ndmost
important, 3 =third, 4=fourth, 5= fifth
21 7 Colgate rank 0 =blank, 1 = most important, 2 =2ndmost
important, 3 =third, 4=fourth, 5= fifth
22 7 Acquafresh rank
0 =blank, 1 = most important, 2 =2ndmost
important, 3 =third, 4=fourth, 5= fifth
23 7 A & H rank 0 =blank, 1 = most important, 2 =2ndmost
important, 3 =third, 4=fourth, 5= fifth
25 7 Pepsodent rank 0 =blank, 1 = most important, 2 =2nd
mostimportant, 3 =third, 4=fourth, 5= fifth
8/13/2019 Data Prep and Descriptive Stats
13/51
8/13/2019 Data Prep and Descriptive Stats
14/51
Entering Data
Problems can occur during data entry, such astransposing numbers and inputting an infeasiblecode(e.g out of range)
E.g. Score on range of 1-5 then 0, 6, 7, and 8 areunacceptable or out of range (might be due totranscription error)
Always check the data-entry work.
8/13/2019 Data Prep and Descriptive Stats
15/51
Descriptive Statistics
8/13/2019 Data Prep and Descriptive Stats
16/51
Five types of statistical analysis
Descriptive
Inferential
Differences
Associative
Predictive
What are the characteristics of the respondents?
What are the characteristics of the population?
Are two or more groups the same or different?
Are two or more variables related in a systematic way?
Can we predict one variable if we know one or more
other variables?
8/13/2019 Data Prep and Descriptive Stats
17/51
Summarization of a collection of datain a clear and understandable way
the most basic form of statistics
lays the foundation for all statisticalknowledge
Descriptive Statistics
8/13/2019 Data Prep and Descriptive Stats
18/51
The tradeoff in descriptive statistics
If you use fewer statistics to describe the distribution of a
variable, you lose information but gain clarity.
When should one use fewer statistics?
When dropping the number of statistics would leave moreinformation per remaining statistic.
When the information you drop is unimportant to ones research
question.
8/13/2019 Data Prep and Descriptive Stats
19/51
Type of
Measurement
Nominal
Two
categories
More than
two categories
Frequency tableProportion (percentage)
Frequency table
Category proportions(percentages)
Mode
Type of
descriptive analysis
8/13/2019 Data Prep and Descriptive Stats
20/51
Ratio means
Type of
MeasurementType of
descriptive analysis
OrdinalRank order
Median
Interval Arithmetic mean
8/13/2019 Data Prep and Descriptive Stats
21/51
Data Tabulation
Tabulation: The organized arrangement of data ina table format that is easy to read andunderstand. Tabulate the data to count the number of responses to
each question. Simple Tabulation: The tabulating of results of
only one variable informs you how often eachresponse was given.
Frequency Distr ibution: A distribution of datathat summarizes the number of times a certainvalue of a variable occurs and is expressed interms of percentages.
8/13/2019 Data Prep and Descriptive Stats
22/51
The arrangement of statistical data in a row-and-
column format that exhibits the count ofresponses or observations for each categoryassigned to a variable
How many of certain brand users can be called loyal? What percentage of the market are heavy users and
light users?
How many consumers are aware of a new product? What brand is the Top of Mind of the market?
Frequency Tables
8/13/2019 Data Prep and Descriptive Stats
23/51
More on relative frequency distributions
Rules for relative frequency distributions:
Make sure each observation is in one and only one category.
Use categories of equal width.
Choose an appealing number of categories.
Provide labels
Double-check your graph.
Definitions:
A histogram is a relative frequency distribution of a quantitative
variable A bar graph is a relative frequency distribution of a qualitative
variable
8/13/2019 Data Prep and Descriptive Stats
24/51
643 Netw orking213 print ad
179 Online recruitment site
112 Placement firm
18 Temporary agency
How did you find your last job?
7006005004003002001000
Netw orking
print ad
Online recruitment site
Placement f irm
Temporary agency
55.2 %
18.3 %
15.4 %
9.6 %
1.5 %
WebSurveyor Bar Chart
8/13/2019 Data Prep and Descriptive Stats
25/51
8/13/2019 Data Prep and Descriptive Stats
26/51
How many times per week do you use mouthwash ?
1__ 2__ 3__ 4__ 5__ 6__ 7__
1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7
1 2
2 3
3 5
4 7
5 5
6 3
7 2
0
1
2
3
4
5
6
7
1
2
3
4
5
6
7
8/13/2019 Data Prep and Descriptive Stats
27/51
Normal Distribution
- a b
Curve is basically bell shaped
8/13/2019 Data Prep and Descriptive Stats
28/51
Normal DistributionsCurve is basically bell shapedfrom - to symmetric with scores
concentrated in the middle (i.e. onthe mean) than in the tails.
Mean, medium and modecoincide
They differ in how spread outthey are.
The area under each curve is 1.
The height of a normaldistribution can be specifiedmathematically in terms of twoparameters: the mean () and thestandard deviation ().
8/13/2019 Data Prep and Descriptive Stats
29/51
8/13/2019 Data Prep and Descriptive Stats
30/51
Kurtosis: how peaked a distribution is. A
zero indicates normal distribution, positivenumbers indicate a peak, negative numbers
indicate a flatter distribution)
Peaked
distributionFlat distribution
Thanks, Scott!
8/13/2019 Data Prep and Descriptive Stats
31/51
Summary statistics
central tendency
Dispersion or variabilityA quantitative measure of the degree towhich scores in a distribution are spread
out or are clustered together;
8/13/2019 Data Prep and Descriptive Stats
32/51
Descriptive Analysis: Measures of
Central Tendency
Mode: the number that occurs most oftenin a string (nominal data)
Median: half of the responses fall abovethis point, half fall below this point(ordinal data)
Mean: the average (interval/ratio data)
Mode
8/13/2019 Data Prep and Descriptive Stats
33/51
Mode
the most frequent categoryusers 25%
non-users 75%Advantages:
meaning is obvious
the only measure of central tendency that can be usedwith nominal data.
Disadvantages
manydistributions have more than one mode, i.e. are"multimodal
greatly subject to sample fluctuations
therefore not recommended to be used as the only
measure of central tendency.
Median
8/13/2019 Data Prep and Descriptive Stats
34/51
Median
the middle observation of the datanumber times per week consumers use mouthwash
1 1 2 2 2 3 3 3 3 3 4 4 4 44 4 4 5 5 5 5 5 6 6 6 7 7
Frequency
distribution ofMouthwash
use per week
Heavy userLight user Mode
Median
Mean
8/13/2019 Data Prep and Descriptive Stats
35/51
The Mean (average value)
sum of all the scores divided by the number of scores.
a good measure of central tendency for roughlysymmetric distributions
can be misleading in skewed distributions since it can begreatly influenced by extreme scores in which case otherstatistics such as the median may be more informative
formula = X/N (population)
X = xi/n (sample)
where /X is the population/sample mean
and N/n is the number of scores.
8/13/2019 Data Prep and Descriptive Stats
36/51
Normal Distributions with
different Mean
0- 1 2
8/13/2019 Data Prep and Descriptive Stats
37/51
Minimum, Maximum, and Range (Highestvalue minus the lowest value)
Variance Standard Deviation (A measures distance
from the mean)
Measures of Dispersion or
Variability
8/13/2019 Data Prep and Descriptive Stats
38/51
Distribution of Final Course Grades in MGMT 3220Y
0
5
10
15
20
25
Grade
Freq
uenc
Frequency 3 10 20 23 12
F D C B A
RANGE
- 1 SD
+ 1 SD
8/13/2019 Data Prep and Descriptive Stats
39/51
Variance
The difference between an observed value and themean is called the deviation from the mean
The variance is the mean squared deviation from
the mean
i.e. you subtract each value from the mean,
square each result and then take the average.
Because it is squared it can never be negative
2= S(x- xi)2/n
8/13/2019 Data Prep and Descriptive Stats
40/51
8/13/2019 Data Prep and Descriptive Stats
41/51
Measures of Dispersion
Suppose we are testing the new flavor of a fruit punch
Dislike 1 2 3 4 5 Like Data
1. 3
2. 5
3. 3
4. 5
5. 3
6. 5
x
x
x
x
x
x
X= 42= 1S = 1
2= S(x- xi)2/n S = S(x- xi)2/n
8/13/2019 Data Prep and Descriptive Stats
42/51
Measures of Dispersion
Dislike 1 2 3 4 5 Like Data
1. 5
2. 4
3. 5
4. 5
5. 5
6. 4
x
x
x
xx
x X = 4.62=0.26S = 0.52
2= S(x- xi)2/n S = S(x- xi)2/n
8/13/2019 Data Prep and Descriptive Stats
43/51
Measures of Dispersion
Dislike 1 2 3 4 5 Like Data
1. 1
2. 5
3. 1
4. 5
5. 1
6. 5
x
x
x
x
xx
X= 32=4S = 2
2= S(x- xi)2/n S = S(x- xi)2/n
8/13/2019 Data Prep and Descriptive Stats
44/51
-
123
Normal Distributions
with different SD
8/13/2019 Data Prep and Descriptive Stats
45/51
A statistical technique that involves tabulating theresults of two or more variables simultaneously
informs you how often each response was given
Shows relationships among and between variables frequency distribution for each subgroup compared
to the frequency distribution for the total sample
must be nominally scaled
Cross Tabulation
8/13/2019 Data Prep and Descriptive Stats
46/51
Cross-tabulation
Helps answer questions about whether twoor more variables of interest are linked:
Is the type of mouthwash user (heavy or
light) related to gender?Is the preference for a certain flavor (cherry
or lemon) related to the geographic region(north, south, east, west)?
Is income level associated with gender?
Cross-tabulation determines association not
causality.
8/13/2019 Data Prep and Descriptive Stats
47/51
The variable being studied is called the
dependent variableor response variable.
A variable that influences the dependentvariable is called independent variable.
Dependent and Independent Variables
8/13/2019 Data Prep and Descriptive Stats
48/51
Cross-tabulation
Cross-tabulation of two or more variables ispossible if the variables are discrete:
The frequency of one variable is subdivided by theother variable categories.
Generally a cross-tabulation table has: Row percentages
Column percentages
Total percentages Which one is better?
DEPENDS on which variable is considered asindependent.
8/13/2019 Data Prep and Descriptive Stats
49/51
A contingency table shows the conjoint
distribution of two discrete variables
This distribution represents the probabilityof observing a case in each cell
Probability is calculated as:
Contingency Table
Observed casesTotal cases
P=
8/13/2019 Data Prep and Descriptive Stats
50/51
Cross tabulation
GROUPINC * Gender Crosstabulation
10 9 19
52.6% 47.4% 100.0%
55.6% 18.8% 28.8%15.2% 13.6% 28.8%
5 25 30
16.7% 83.3% 100.0%
27.8% 52.1% 45.5%
7.6% 37.9% 45.5%
3 14 17
17.6% 82.4% 100.0%16.7% 29.2% 25.8%
4.5% 21.2% 25.8%
18 48 66
27.3% 72.7% 100.0%
100.0% 100.0% 100.0%
27.3% 72.7% 100.0%
Count
% with in GROUPINC
% with in Gender% of Total
Count
% with in GROUPINC
% with in Gender
% of Total
Count
% with in GROUPINC
% with in Gender
% of Total
Count
% with in GROUPINC
% with in Gender
% of Total
income
8/13/2019 Data Prep and Descriptive Stats
51/51
Chi-square Test for Independence
The Chi-square test for independence
determines whether two variables are
associated or not.H0: Two variables are independent
H1: Two variables are not independent
Chi-square test results are unstable if cell count is lower than 5