Data Analysis Using SPSSData Analysis Using SPSS
ByBy
Dr.R.RAVANANDr.R.RAVANANAssociate ProfessorAssociate Professor
Department of StatisticsDepartment of StatisticsPresidency CollegePresidency CollegeChennai – 600 005Chennai – 600 005
E-mail: E-mail: [email protected]: 98403 75672 / 94442 21627Mobile: 98403 75672 / 94442 21627
What is SPSS?What is SPSS?
Statistical Package for Social ScienceStatistical Package for Social Science
General Purpose Statistical SoftwareGeneral Purpose Statistical Software
Consists of three componentsConsists of three componentsData Window - data entry and database Data Window - data entry and database (.sav)(.sav)
Output Window - all output from any SPSS Output Window - all output from any SPSS session (.lst)session (.lst)
Syntax Window - commands lines (.sps)Syntax Window - commands lines (.sps)
Data Entry & PreparationData Entry & Preparation
Data entryData entry New or Recalled (SPSS or non-SPSS)New or Recalled (SPSS or non-SPSS)
Data DefinitionData Definition
Data Manipulation and Variable Data Manipulation and Variable DevelopmentDevelopment
Data DefinitionData Definition
Purpose: Purpose: Give meanings to the numbers for ease of Give meanings to the numbers for ease of
reading the outputreading the output
InvolvesInvolves Data FormatData Format Variable NameVariable Name Value LabelsValue Labels Missing ValuesMissing Values
Command: Command: Data Data Data Definition Data Definition
Data ManipulationData Manipulation
Recoding
To give new values to old values (especially reversing negatively worded questions)
To form nominal variable from continuous data
Variable Development
To form new variables combinations of old ones or functions of old ones
Command: Transform Recode/ Compute
Data Analysis - DescriptiveData Analysis - Descriptive
Purpose: To describe each variable - What is the current level of the variable of interest?Command Frequency Means, Minimum, Maximum, Standard Deviation, Quartiles, Standard Deviation Analyze Frequencies /Descriptives
Data Analysis - DescriptiveData Analysis - Descriptive
Frequencies for two or more nominal variables
Analyze Summarize Crosstabulation
Means of variables by subgroups defined by one or more nominal variables
Analyze Compare Means Means (Use of Levels)
Parametric Parametric Test of DifferencesTest of Differences
WhenWhendependent continuous variable and we dependent continuous variable and we want to test differences across groupswant to test differences across groups
CommandCommandAnalyze Analyze Compare Means Compare Means Independent t-test/ Paired t-test/ one-Independent t-test/ Paired t-test/ one-way ANOVAway ANOVA
Non-Parametric Test of Non-Parametric Test of DifferencesDifferences
WhenWhendependent variable ordinal or normal dependent variable ordinal or normal assumption not metassumption not met
CommandCommandAnalyze Analyze Non-parametric Non-parametric 2 2 Independent/ 2 related samples/ k Independent/ 2 related samples/ k independent samples/ k related independent samples/ k related samplessamples
Parametric Two-Way ANOVAParametric Two-Way ANOVA
WhenWhencontinuous dependent variable and continuous dependent variable and related groupsrelated groups
CommandCommandAnalyze Analyze General Linear Model General Linear Model Simple Simple
Note: Fixed Factor EffectNote: Fixed Factor Effect
Bivariate RelationshipBivariate Relationship
WhenWhenCovariation between two variablesCovariation between two variables
Correlation: Correlation: When both are continuous or ordinalWhen both are continuous or ordinal
CommandCommandAnalyze Analyze Correlate Correlate Bivariate (with Bivariate (with
option for Spearman if both ordinal)option for Spearman if both ordinal)
Regression AnalysisRegression Analysis
WhenWhenTo establish relationship between one continuous To establish relationship between one continuous dependent variable and a number of continuous dependent variable and a number of continuous independent variablesindependent variables
CommandCommandAnalyze Analyze Regression Regression Linear (Use Statistics, Save Linear (Use Statistics, Save
options)options)
Issues:Issues:Assumptions of Regression - normality; constant Assumptions of Regression - normality; constant variance, independence of independent variables; variance, independence of independent variables; independence of error termsindependence of error terms
Regression AnalysisRegression Analysis
Issues (cont.)Issues (cont.)Outliers and Leverage ValuesOutliers and Leverage Values
Choice of Selection Method of Independent Choice of Selection Method of Independent Variables - Enter, Backward, Forward, Variables - Enter, Backward, Forward, StepwiseStepwise
Dummy Independent VariablesDummy Independent Variables
OptionsOptionsResidual Analysis; Influence Statistics, Residual Analysis; Influence Statistics, Collinearity Diagnostics, Normality PlotsCollinearity Diagnostics, Normality Plots
Regression AnalysisRegression Analysis
InterpretationInterpretationGoodness of Model: RGoodness of Model: R22, F-statistics, , F-statistics, Adj. RAdj. R22, Standard error, Standard error
Strength of Influence of Independent Strength of Influence of Independent Variables: beta and standardized betaVariables: beta and standardized beta
Reliability AnalysisReliability Analysis
WhenWhenBefore forming composite index to a variable Before forming composite index to a variable from a number of items from a number of items
CommandCommandAnalyze Analyze Scale Scale Reliability Analysis (with Reliability Analysis (with
option for Descriptives item, scale, scale if option for Descriptives item, scale, scale if item deleted)item deleted)
InterpretationInterpretationalpha value greater than 0.7 is good; more alpha value greater than 0.7 is good; more than 0.5 is acceptable; delete some items if than 0.5 is acceptable; delete some items if necessarynecessary
Measures of ReliabilityMeasures of ReliabilityInternal Consistency: (of items in a scale):Internal Consistency: (of items in a scale):
1. Average inter-item correlation 1. Average inter-item correlation If average inter-item If average inter-item correlation > 0.6, then standardize items and add them correlation > 0.6, then standardize items and add them together as an index.together as an index.
2. 2. Cronbach's alpha Cronbach's alpha , which measures " internal consistency , which measures " internal consistency of items in a scale" Garson ,G.D.(1999) and isof items in a scale" Garson ,G.D.(1999) and is
Factor AnalysisFactor Analysis
WhenWhenTo reduce the number of variables to To reduce the number of variables to underlying dimensionsunderlying dimensions
CommandCommandAnalyze Analyze Data Reduction Data Reduction Factor (Option: Factor (Option:
rotation, save factor scores)rotation, save factor scores)
IssuesIssuesAssumptions sufficient correlations between Assumptions sufficient correlations between the variables (Bartlett test; anti-image, KMO the variables (Bartlett test; anti-image, KMO test of sufficiency)test of sufficiency)
Discriminant AnalysisDiscriminant Analysis
WhenWhenDependent Variable is Nominal and the Dependent Variable is Nominal and the Purpose is to predict group membership on Purpose is to predict group membership on the basis of independent variablesthe basis of independent variables
CommandCommandAnalyze Analyze Classify Classify Discriminant (Option: Discriminant (Option:
Classify by summary tables; Select - for Classify by summary tables; Select - for holdout and analysis samplesholdout and analysis samples
IssuesIssuesSimilar to RegressionSimilar to Regression
Discriminant AnalysisDiscriminant Analysis
InterpretationInterpretationGoodness of Analysis: Hits Ratio - Goodness of Analysis: Hits Ratio - compared to maximum chance, compared to maximum chance, proportional chance and Press Q.proportional chance and Press Q.
Univariate Results: To establish the Univariate Results: To establish the discriminating variablesdiscriminating variables
Exercise 1: t –TEST FOR SINGLE MEAN
Problem:
The satisfaction levels of 12 employee’s current job are given below:
Test whether the level of satisfaction are above average level at 1% level
Emp
No11 22 33 44 55 66 77 88 99 1010 1111 1212
Satisfaction level
S HS N HS D S HS N S HS S HS
Solution:
1. Null Hypothesis: The level of satisfaction of employees is equal to average level.
2. Alternate Hypothesis: The level of satisfaction of employees is not equal to average level
3. Test Statistic: t test for single mean is
Exercise 2: t -TEST FOR DIFFERENCE OF TWO MEANS(INDEPENDENT SAMPLE)
Problem:
The Marks obtained by a group of 9 regular students and another group of 11 part-time course students in a test are given below:
Regular 70 78 75 71 73 59 78 69 72
Part -Time 62 70 71 62 60 56 69 64 72 68 66
Examine whether the marks obtained by regular and part-time students differ significantly at 5% level of significance.
Solution:
1. Null Hypothesis: There is no significant difference between the average marks obtained by regular and Part time students2. Alternate Hypothesis: There is a significant difference between the average marks obtained by regular and Part-Time students. 3. Test Statistic: t test for difference of two means is
Exercise 3: PAIRED ‘t’ TEST FOR DIFFERENCE OF TWO MEANS (DEPENDENT SAMPLES)
Problem: A Company arranged an intensive training course for its team of salesmen. A random sample of 10 salesmen was selected and the value (in ‘000) of their sales made in the weeks immediately before and after the course are shown in the following table:
Salesmen 1 2 3 4 5 6 7 8 9 10
Sales Before 12 23 5 18 10 21 19 15 8 14
Sales After 18 22 15 21 13 22 17 19 12 16
Test whether there is evidence of an increase in mean sales.
Solution:
1. Null Hypothesis: There is no significant difference in mean sales of before and after the training course.
2. Alternate Hypothesis: There is significant difference in mean sales of before and after the training course.
3. Test Statistic: Paired t test for difference of two means is
Exercise 4: F-TEST FOR EQUALITY OF TWO VARIANCE
Problem:Time taken by workers in performing a job are given below
Method I 20 16 26 27 23 22
Method II 27 33 42 35 32 34 38
Test whether there is any significance difference between the variance of time distribution.
Solution:
1. Null Hypothesis: There is no significant difference between the variance of method I and method II
with regard to time distribution.
2. Alternate Hypothesis: There is significant difference between the variance of method I and method
II with regard to time distribution.
3. Test Statistic: F test for equality of variance is
Exercise 5: ANOVA (ONE WAY CLASSIFICATION)Exercise 5: ANOVA (ONE WAY CLASSIFICATION)
Problem:Problem:
The Following table gives the yields of 15 sample of plot under three varietiesThe Following table gives the yields of 15 sample of plot under three varietiesof seed.of seed.
Test whether there is significance difference in the average yield of three varieties of Test whether there is significance difference in the average yield of three varieties of seedseed
1. 1. Null HypothesisNull Hypothesis: There is no significant difference between average: There is no significant difference between average yield of three varieties of seedsyield of three varieties of seeds
22. . Alternate HypothesisAlternate Hypothesis: There is a significant difference between the: There is a significant difference between the average yield of three varieties of seedsaverage yield of three varieties of seeds..
Variety AVariety A 2020 2020 2323 1616 2020
Variety BVariety B 1818 2020 1717 1515 2525
Variety CVariety C 2525 2828 2222 2828 3232
Exercise 6 : ANOVA (TWO WAY CLASSIFICATION)
Problem: Perform a two-way ANOVA and test for the difference between varieties as well as blocks to the following data.
Variety Blocks
1 2 3 4
A 52 56 48 44
B 43 41 45 38
C 39 39 41 41
1. Null Hypothesis: There is no significant difference between the mean yields between varieties as well as
blocks.
2. Alternate Hypothesis: There is significant difference between the mean yields between varieties as well as blocks.
Exercise 7: CHI SQUARE TEST FOR GOODNESS OF FIT
Problem: A company keeps records of accidents. During a recent safety review, a random sample of 60 accidents was selected and classified by the day of the week on which they occurred.
Day Monday Tuesday Wednesday Thursday Friday
No of accidents
8 12 9 14 17
Test whether there is any evidence that accidents are more likely
on some days than others.
Solution:
1. Null Hypothesis: Accidents are equally distributed over the days of the week.
2. Alternate Hypothesis: Accidents are not equally distributed over the days of the week
3. Test Statistic: Chi-square test for goodness of fit is
Exercise 8: CHI SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES
Problem:The following table gives the data relating to the condition of child and condition of home. Test whether the two attributes are independent.
Condition of Child
Condition of Home
Clean Dirty
Clean 70 50
Fairly clean 80 20
Dirty 35 45
Solution:
1. Null Hypothesis: There is no association between condition of child and condition of home.
2. Alternate Hypothesis: There is an association between condition of child and condition of home.
3. Test Statistic: Chi-square test for independence of attributes is
Exercise 9: TEST FOR SIGNIFICANCE OF CORRELATION COEFFICIENT
Problem:
Find the correlation coefficient between income and expenditure of the family to the following data. Also test whether correlation coefficient is significant.
Income ( in hundreds)
60 58 45 65 56 38 70
Expenditure (in hundreds)
55 50 40 60 62 45 63
Solution:First find the coefficient of correlation by using the formula
1. Null Hypothesis: There is no relationship between income and expenditure of the family
2. Alternate Hypothesis: There is relationship between income and expenditure of the family3. Test Statistic: t test for coefficient of correlation is
Exercise: 10 REGRESSION ANALYSISProblem:The following table gives the food expenditure, annual income and family size of 10 families. Fit a multiple regression equation of Food Expenditure on annual family Income and family Size..
FamilyFamily Annual Food Annual Food Expenditure (‘000)Expenditure (‘000)
Annual Income(‘000)Annual Income(‘000) Family Size (number in family)Family Size (number in family)
11 5.25.2 2828 33
22 5.15.1 2626 33
33 5.65.6 3232 22
44 4.64.6 2424 11
55 11.311.3 5454 44
66 8.18.1 2929 22
77 7.87.8 4444 33
88 5.85.8 3030 22
99 5.15.1 4040 11
1010 18.018.0 8282 66
The regression model is
Non-Parametric TestNon-Parametric Test
One sample test:One sample test:– Binomial TestBinomial Test– Chi-Square test for goodness of fitChi-Square test for goodness of fit– Kolmogorov-Smirnov one sample testKolmogorov-Smirnov one sample test
Two Independent sample:Two Independent sample:– Fisher Exact testFisher Exact test– Chi-Square test for intendance of attributesChi-Square test for intendance of attributes– Median testMedian test– Mann-Whitney U testMann-Whitney U test– Kolmogorov-Smirnov Two sample testKolmogorov-Smirnov Two sample test
Non-Parametric TestNon-Parametric TestTwo dependent sampleTwo dependent sample
– McNemar testMcNemar test– Sign testSign test– Wilcoxon Matched-Pairs signed rank testWilcoxon Matched-Pairs signed rank test– Walsh testWalsh test
More than two independent samplesMore than two independent samples– Krushkal_Wallis one-way analysisKrushkal_Wallis one-way analysis– Chi-square test for k impendent sampleChi-square test for k impendent sample– Extention of Median testExtention of Median test
More than two dependent samplesMore than two dependent samples– Friedman Two way analysisFriedman Two way analysis– Cochran Q testCochran Q test
Mann-Whitney U testMann-Whitney U testMann-Whitney U test isMann-Whitney U test is
WhereWhere
Wilcoxon testWilcoxon testWilcoxon test isWilcoxon test is
WhereWhere T = Sum of rank with less frequent signT = Sum of rank with less frequent sign
Krushkal-Wallis one-way analysisKrushkal-Wallis one-way analysis
Krushkal - Wallis Krushkal - Wallis test istest is
WhereWhere R R = Sum of rank of each group = Sum of rank of each group
N = Total number of observationsN = Total number of observations
n = Number of observation in each groupn = Number of observation in each group
k = Number of groupsk = Number of groups
Friedman Two way analysisFriedman Two way analysis
Friedman Friedman test istest is
WhereWhere R R = Sum of rank of each items = Sum of rank of each items
N = Total number of observationsN = Total number of observations
k = Number of itemsk = Number of items