MASH SPSS sessions Getting started with SPSS/file/Getting_started_in_SPSS_booklet.pdfA variable is...

Getting started with SPSS Maths and Statistics Help Centre

1

community project encouraging academics to share statistics support resources

All stcp resources are released under a Creative Commons licence

MASH SPSS sessions

Getting started with SPSS


2

Data sets used in this booklet ........................................................................................................................................... 3

Statistical Analysis Cycle ................................................................................................................................................... 4

Introduction to data .......................................................................................................................................................... 4

Data types ..................................................................................................................................................................... 5

What is SPSS? ................................................................................................................................................................ 6

Opening an Excel file in SPSS .................................................................................................................................... 7

Titanic data ............................................................................................................................................................ 9

Exercise 1: Were wealthy people more likely to survive on the Titanic? ............................................................. 9

Labelling values ....................................................................................................................................................... 11

Summarising categorical data ..................................................................................................................................... 13

Output in SPSS ......................................................................................................................................................... 13

Exercise 2: Who are the most dangerous drivers? ............................................................................................. 14

Research question 1: Were wealthy people more likely to survive on the Titanic? ...................................................... 15

Bar Charts .................................................................................................................................................................... 16

Tidying up a bar chart ......................................................................................................................................... 16

Adjusting variables ...................................................................................................................................................... 19

Reducing the number of categories ........................................................................................................................ 19

Changing continuous to categorical variables ........................................................................................................ 20

Exercise 3 ............................................................................................................................................................ 20

Summary statistics and graphs: Continuous data ........................................................................................................... 21

Averages ...................................................................................................................................................................... 21

Measures of spread .................................................................................................................................................... 21

Which summary statistics should be used .................................................................................................................. 23

Ex 4: Comparison of continuous data by group ...................................................................................................... 24

Exercise 5: ........................................................................................................................................................... 25

Research question 2: Which of three diets was best? .................................................................................................... 27

Calculations using variables ........................................................................................................................................ 28

Summary statistics for groups in tables .................................................................................................................. 29

Scatterplots: ............................................................................................................................................................ 30

Summary of descriptive and graphical statistics......................................................................................................... 32

Research question 3: Which variables are strongly related to birthweight? ................................................................. 33

Exercise 6: ........................................................................................................................................................... 33

Exercise 7 ............................................................................................................................................................ 34

Getting SPSS on your home computer............................................................................................................................ 35

MASH contact details ...................................................................................................................................................... 35

Solutions to exercises...................................................................................................................................................... 36


3

Data sets used in this booklet All the data needed for this booklet is contained in the Excel file ‘all_data_for_MASH_workshops. You will

need to download this file from the MASH workshops web page and save this file on your computer in

order to use it. Once saved, close the file.

Save the file somewhere:

Datasets:

Dataset Description

Titanic List of 1309 passengers on board the Titanic when it sank and details about them such as gender, whether they survived, class etc

Diet 78 people were put on one of three diets with the goal being to determine which diet was best.

Birthweight Details for a number of babies and their parents such as weight and length of babies at birth and weight and height of mother.

www.sheffield.ac.uk/mash/workshops


4

Statistical Analysis Cycle

Introduction to data

SECONDARY data is data collected by someone else e.g. using the data from the National Students survey PRIMARY data is data collected by the researcher e.g. by producing a questionnaire. If you are producing a questionnaire think very carefully about the questions.

QUANTITATIVE DATA is numeric and a variety of statistical techniques can be used to summarise and analyse the data.

QUALITATIVE data is collected using open ended questions such as ‘What do you like best about your course?’.

For all types of quantitative data, it is likely that it will end up in a spreadsheet with individuals/ subjects on rows and each column representing a variable e.g. answer to Q1 from a questionnaire or heart beat after running for 5 mins.

A variable is just a measurement which varies between subjects e.g. height or the answer to a question.

One variable per column

One subject per row


5

Data types

In order to choose suitable summary statistics and analysis for the data, it is also important to distinguish between continuous (numerical) measurements and categorical variables. The choice of variable necessary to answer the main research questions should be considered at the planning rather than the analysis stage.

NOMINAL data is categorical data with no order. The labels just name the category. Examples: Department Marital status What is your favourite animal? Dog Cat Horse Hamster Fish Other

ORDINAL data has a recognisable order e.g. 1st, 2nd, 3rd

Likert scales are ordinal e.g. Strongly disagree – strongly agree Can be numbered but the numbers are no different to names The gap between 1st and 2nd may be different to the gap between 2nd and 3rd

DISCRETE data can only take whole numbers

Number of children, how many times have you been on holiday this year CONTINUOUS data can be measured on any scale Examples: height, anything that can have decimals Discrete usually treated as continuous in analysis

In most situations, the key distinction is between continuous/scale/ measurement data and categorical variables. Different summary statistics, charts and statistical tests are needed for the two types of variables. If discrete variables have a fairly large range of numbers, they can be treated as continuous for analysis purposes.

Data Variables

Measurements/ scale

appear as meaningful numbers

Continuous:

takes any value e.g. height

Discrete/ count:

takes whole numbers e.g. Number of children

in a family

Categorical:

appear as categories

Ordinal:

meaningfully ordered e.g. agree strongly - disagree

strongly questions

Nominal:

No meaningful order e.g. eye

colour


6

What is SPSS?

SPSS is similar to Excel but it’s easier to produce charts and carry out analysis. To open SPSS, select IBM

SPSS statistics from ‘All programs’. Before opening, an additional screen appears. You can open a dataset

from this screen but it’s easiest to just select ‘Type in data’ every time. Data can be opened after SPSS is

opened.

Version 21 and below:

In version 22, select ‘New Dataset’ and ‘OK’.


7

Example of data sheet in SPSS

Opening an Excel file in SPSS

Important note: There must be only one row with headings in for SPSS to open an Excel file correctly.

If SPSS is not open, open SPSS. When prompted to open a file, select type in data.

Variable headings can only

appear at the top in the blue

boxes

Unlike Excel, you can only have

one dataset on each page of

SPSS. A new file must be created

for each individual data set.


8

To open any file in SPSS, select File Open Data. Here we are opening the ‘Titanic’ data which is

currently in Excel. Note: The Excel file must not be open on your computer.

SPSS only opens one sheet of data at a time so select the required sheet containing the Titanic data.

Once the data is in SPSS, save the SPSS data file using File Save as. Save again after making changes to the data.

Select ‘Excel’ as ‘Type of file’


9

Titanic data

The ship ‘The Titanic’ sank in 1914 along with most of its’ passengers and crew. The data set that we have

contains information on 1309 passengers.

Exercise 1: Were wealthy people more likely to survive on the Titanic?

Once the data set is open on your computer, give the following variables suitable labels, label the values

for categorical variables and select the correct data type.

Variable

name Variable label Value label Data type

pclass Class 1 = 1st, 2 = 2nd, 3 = 3rd

survived 0 = Died, 1 = survived

Residence Country of Residence 0=American, 1 = British, 2 = Other

age

sibsp Number of siblings/ spouses

parch Number of parents/ children on board

fare Price of ticket

Gender Gender 0 = male, 1 = female

a) Which variables would you use to investigate the research question ‘Were wealthy people more

likely to survive the sinking of the Titanic’?


10

There are two sheets for each dataset. The ‘Data View’ sheet is where the numbers are entered and the

‘Variable View’ sheet is where the variables are named and defined. The option to choose between Data

and Variable View is in the bottom left hand corner. For data in categories, type numbers in the Data View

sheet and then label the numbers in ‘Variable View’.

Select variable view to

label the variables/ values

There should be one row per person

not one row per group

Variable view: Label the variables

The variable name has restrictions. It

can have no spaces or use certain

characters. Use the ‘Label’ column to

give sensible variable descriptions

which will appear in all output. If the

label is blank, the variable name will

appear in output.

For example sibsp is ‘Number of

siblings/ spouses on board’, parch is

‘Number of parents/ children on

board’ and fare is ‘Price of ticket’.


11

Labelling values

It is best to have your categories coded as numbers for analysis in SPSS but for your output, people need to

know what the numbers mean. Go to the ‘Values’ column in Variable View, let the mouse hover until you

see a blue square. Clicking the square gives the ‘Value labels’ box. In the value box, put the number and

the label for that number in the label box. Click on ‘Add’ after each label and ‘Ok’ when finished.

Also, when using secondary data, watch for odd values, such as -99 indicating a missing value. These can

be identified in the missing column so they are not taken into account in any analysis.

Label the categories by

selecting the blue box

0 = Died and 1 = Survived Click on ‘Add’ after each one


12

Note: There are two variables for gender. ‘Sex’ is a string variable (words) whereas ‘Gender’ has 0 for males and 1 for females so should be used during analysis.

Variable Type: SPSS only

analyses Numeric variables.

String means it’s a word. The

width is the number of

numbers/ letters allowed for

that variable.

Decimals: When typing in data, the default number of decimals is 2. Change this to 0 for categorical and discrete data.

The Measure column is where the data type is entered. Continuous/ discrete are called Scale in SPSS. SPSS won’t allow certain analysis for the wrong type of variable.


13

Summarising categorical data The simplest way to summarise a single categorical variable is by using frequencies or percentages.

Analyse Descriptive statistics Frequencies

Output in SPSS

Charts, tables and analysis appear in a separate ‘output’ window in SPSS. The output window is brought to the front of the screen when analysis/ charts etc are requested. The left hand column shows all of the output produced in that session. The output file has to be saved separately to the data file.

To go back to the data file, select it on the bottom toolbar.

Use the Valid Percent column as it

does not include missing values.

Move the variable for the number of parents/ children on board and

survival to the right hand side and click ‘OK’ to run the analysis.

Move the variables to be summarised from the list on the left hand side to the right using the arrow in the middle.


14

As SPSS produces a lot of output for analysis and you may produce several charts before you decide which one is best, copying the output you require for your project and pasting into a Word document is preferable.

Quick question: What percentage of people survived the sinking of the Titanic?

Exercise 2: Who are the most dangerous drivers?

Often we are interested in looking at the relationship between two variables. We start by investigating how age and gender relate to the number of car accidents in the UK. Stacked or multiple bar charts can summarise this type of information. The following multiple bar chart is taken from an article in the Guardian.

http://www.theguardian.com/politics/reality-check/2013/oct/11/dangerous-drivers-how-old-uk-age-18

a) Which gender is most likely to have an accident?

b) Which age group is most likely to have an accident?

c) The point of the chart should have been to look at how likely people were to have an accident by age and gender. What is wrong with the chart regarding addressing this research question?


15

Research question 1: Were wealthy people more likely to survive on the Titanic? In general, using percentages to summarise categorical data is preferable although in the case of small

numbers, percentages can be misleading e.g. ‘100% of people agree that mascara A is better than mascara

B’ when only 2 people have been asked!

Suitable charts for categorical data are bar charts and pie charts.

A contingency table is a way of summarising two categorical variables. However, care needs to be taken

with comparing groups of different sizes.

If class had an effect on survival, a higher percentage of people in one class would have survived. If class

had no effect roughly the same percentage would have survived in each class.

To break down survival by class, a crosstabulation or contingency table is needed. Percentages are usually

preferable to frequencies but remember to include counts for small sample sizes. Choose either row or

column percentages carefully.

Analyse Descriptive statistics Crosstabs

3) Select ‘Cells’ to get the %

options. Choose row %’s

1) Select the

variable class here

and move to the

‘Row’ box. Move

survival to the

column box

2) Move selected

variables using the arrow

4) Select ‘OK’ when finished and the

chart appears in the output

window.


16

Bar Charts

Plotting graphs in SPSS is much easier than in Excel. All graphs can be accessed through

Graphs Legacy Dialogs There is a chart builder option but the legacy dialogs options are more user friendly. To display the information from the cross-tabulation graphically, use either a stacked or clustered bar chart. Both of these can be accessed through

Graphs Legacy Dialogs Bar

Tidying up a bar chart

Double click on the chart to open an editing window.

Selecting this turns the

bars into 100% for each

class

Variable across the x-axis

Variable to split the bars


17

The font in graphs is usually small so adjust the axes titles etc. Select each axis and change the font size to 12. The axis titles and percentages displayed on the bars can also be changed in this way.

Select this to add labels

% is more useful so move it to

the displayed box and remove

count. Use Number Format to

reduce to 0 decimal places


18

Finally, give the chart a title and change the label on the y axis from ‘Count’ to ‘Percentage’.

When finished, close the chart editor to return to the main output window. Right click on the chart in the output window, copy and paste into word. Sometimes you may need to select ‘Copy Special’ to move charts.

Pasting as a picture enables easy resizing of graphs/ output in Word.

It is clear from the bar chart that the percentage of those dying increased as class lowered. 38% of passengers in 1st class died compared to 74% in 3rd class. Is this a significant difference? To answer this, hypothesis testing is needed.


19

Adjusting variables

Reducing the number of categories

Sometimes categories can be merged if not all the information is needed. For example, a common summary is to calculate the percentage who agreed from a Likert scale i.e. % agree or strongly agree compared to everything else.

Use ‘re-code to different variables’ rather than ‘Re-code into same variables’ so that the re-coding can be checked.

If there are numerous variables to be recoded in the same way, transfer several variables at the same time. Each variable needs an individual name though. Click change after each new name.

Here a new variable is created where 0 = 3rd class and 1 = 1st or 2nd class.

Transform Recode into different variables

Select ‘Continue’ and then ‘OK’ to produce the new variable. Then label 0 = 3rd class and 1 = 1st or 2nd class in the value label box in variable view. Finally do a cross-tabulation of the old and new variables to check the re-coding is correct.

All 1st and 2nd class passengers have been correctly recoded as ‘1st or 2nd class.

Give the new

variable a name,

then click ‘Change’

Move ‘class’ across

New value Old value

You must click add after

each change to add to

the Old New box

Old

variable

New variable


20

Changing continuous to categorical variables

Although it is not recommended as information is lost, continuous (scale) variables can be categorised. Here we will create a new variable identifying children of 12 and under within the Titanic data set.

Go to variable view and label 0 as ‘Adult’ and 1 as ‘Child’.

Use ‘Crosstabs’ for the old and new variable to check the re-coding is correct i.e. age vs Child to see all those of 12 and under are classified as a child.

Exercise 3

Were Americans more likely to survive than the British? Produce suitable summary statistics/ charts to

investigate this.

5. You must

click add

after each

change to

add to the

Old New

box

2. Give the new variable a

name, then click ‘Change’

1. Move ‘age’ across

3. Old values of

age up to 12

are now going

to be 1

4. New value


21

Summary statistics and graphs: Continuous data Continuous variables can be summarised using statistics such as the mean, median, standard deviation,

minimum and maximum values. For continuous data, plotting a histogram gives an idea of the shape and

spread of the distribution as well as assessing whether the variable is normally distributed. Box-plots can

also be used and are particularly useful when comparing groups. The minimum and maximum help check

for outliers and possible data entry errors.

Averages

Mode: The value which occurs most often Mean: Sum of the values/ number of values Median: The middle value of ordered data

Measures of spread

Range = maximum value – minimum value = 30 – 7 = 23 Quartiles: These divide the data into 4 parts. 25% of values are below the lower quartile and 25% are above the upper quartile. The median is the 2nd quartile Interquartile range = Upper quartile – lower quartile = 13 – 8 = 5

7 7 8 8 9 10 13 13 13 14 30

Quick question: 2 out of 3 people earn less than the average income

1. True

2. False

Median Lower quartile Upper quartile

50% of subjects below median 25% of subjects above upper quartile


22

Variance: Average of the squared deviations from the mean. A deviation is the difference between a single value and the mean.

1 - nsobservatio no.

sdifference squared of sumdeviation Standard

Calculating means and standard deviation Example of calculating the mean and standard deviation:

X = exam score

Both histograms on the left show approximately the same mean but the second has a much smaller standard deviation as it is less spread out.

Deviations from the mean

Mean

Subject ID

5.66.4210

426

1 - nsobservatio no.

mean thefrom deviations squared of sum SD

1211

132

nsobservatio ofnumber

scores of sum Mean

Outlier contributes most deviation


23

Which summary statistics should be used

Means and standard deviations are commonly used to summarise continuous data although for skewed data, the median and quartiles are more appropriate. Skewed data can be assessed by plotting a histogram of continuous data. For large samples, we would expect a histogram to peak roughly in the middle. If the histogram peaks at one end or the other, the data is skewed. The histogram below shows male height which is normally distributed. This means that most people are in the middle and the spread is fairly symmetrical about the mean. For normally distributed data, the mean and the median are similar.

Positively skewed distribution Negatively skewed distribution Mean > median Mean < median

Quick question solution:

TRUE if you assume average is the mean: Two thirds of people earn less than the MEAN wage. As the chart below shows, the data is very skewed. There are a lot of people earning a low wage and a few very high earners pulling the mean up. In this situation, the median better represents the population as a whole.

Chart from ‘How does your wage compare with an MP’s’ http://news.bbc.co.uk/1/hi/8072031.stm

2 out of 3 people

Mean Median

Normally distributed data

http://news.bbc.co.uk/1/hi/8072031.stm


24

Ex 4: Comparison of continuous data by group

Did the cost of a ticket affect chances of survival?

a) Is there a big difference in average ticket price by group?

b) Which group has data which is more spread out?

c) Is the data skewed?

d) Is the mean or median a better summary measure?

Cost of ticket Survived?Died Survived

Mean 23.35 49.36

Median 10.50 26.00

Standard Deviation 34.15 68.65

Interquartile range 18.15 46.56

Minimum 0.00 0.00

Maximum 263.00 512.33


25

DATA: The data set ‘diet’ contains information on 78 people who undertook 1 of three diets. There is background information such as age and gender as well as weights before and after the diet.

Open the data set from Excel. Go into the Variable View and make sure that each variable is correctly categorised e.g. nominal. Note: continuous is called ‘Scale’ in SPSS. It is important that variables are correctly categorised as SPSS will only carry out some analysis on certain variable types.

There are several ways to produce summary statistics and charts. This option uses ‘Explore’ which contains

the most summary statistics to compare weight before the diet for males and females.

Analyse Descriptive statistics Explore

Exercise 5:

a) Fill in the following table using the summary statistics table in the output.

Female = 0 Male = 1

Minimum -70

Maximum 82

Mean 64

Median 66

Standard Deviation 21.6

b) Interpret the summary statistics by gender. Which group has the higher mean and which group is more spread out?

Put ‘Pre-weight’ as the dependent

variable and ‘Gender’ in the factor list.

The summary statistics will be

produced for each gender separately.


26

A box-plot shows the spread of a distribution of values. The box contains the middle 50% of values.

c) How could the chart be improved and is there anything odd?

Median = central line

Upper quartile

Lower quartile

Outlier


27

Research question 2: Which of three diets was best? Before the next section, change the error of -70 to 70. Outliers should not normally be changed unless

they are clearly data entry errors as in this case.

Give the variables sensible labels and label gender with 0 = Female and 1 = Male.

Re-run explore to see how the change has affected the summary statistics. Which summary statistics have changed the most?

Female with outlier Female after changing outlier

Minimum -70

Maximum 82

Mean 64

Median 66

Standard Deviation 21.6

Change -70

to 70kg


28

Calculations using variables

Producing the charts for gender and weight before the diet was useful for demonstrating SPSS but the main question of interest is ‘Which diet led to greater weight loss?’. How could this be assessed? To answer this, a new variable ‘weight lost’ (weight before – weight after) would be useful. As spaces are

not allowed in variable names, use weightLOST as a name and give a better name in the label section in

variable view.

To do this use Transform Compute variable.

After putting the calculation into the ‘Numeric Expression’ box, select ‘OK’ and the new variable will appear last in the Data and variable view sheets. Before carrying out the official test of a difference, use summary statistics and charts to look at the differences.

Move ‘Preweight’ into box, select ‘-‘ and

then move ‘Weight6week’ across

Selecting ‘All’ gives

you a lot of options for

calculations e.g. mean

of several variables


29

Summary statistics for groups in tables

SPSS has a table function which can produce more complicated tables although it is a little temperamental and frustrating at times!

To open the table window: Analyse Tables Custom Tables

Drag variables to either the row or column bars to include them in the table. If you want to create sub categories, drag the categorical variable to the front of the variable already in the table. By default, SPSS will choose means to summarise continuous (scale) variables and counts to summarise categorical variables. It is vital that variables are correctly defined as scale or categorical.

1) Move ‘WeightLOST’ to the row section and ‘Diet’ to the Columns section. 2) Select the summary statistics you require 3) Choose ‘Columns’ in the ‘Position’ options for a better display.

Which diet seems the best and which diet has the most variation in weight loss?

Selecting the ‘Summary

Statistics’ button opens a

window where options for

statistics displayed can be

chosen.

The summary statistics button

will only highlight when a

variable is selected in the

main window. Here, make

sure weightLOST is highlighted

in yellow in the central

window.

To change the summary statistics to

appear down the side, select rows

instead of columns from the

position box.

Select Standard deviation and

count from the options and click

‘Apply to all’.


30

Scatterplots:

A scatterplot helps assess a relationship between two continuous (scale) variables by plotting a different point for each individual based on their scores on two variables. The closer the points fit a diagonal line, the stronger the relationship.

The scatterplot below shows a negative relationship between a persons’ weight and the number of kilometres they run per week. i.e. the more they run, the lighter they are generally. There is one clear outlier who runs a lot but also weighs a lot.

Things to look for in a scatterplot:

How strong is the relationship? The closer the points form a line, the stronger the relationship.

Is there a negative or positive relationship?

Is the relationship linear? Do the points form a straight line?

Are there any outliers that could be data entry errors?

Outlier

General linear trend

downwards


31

A scatterplot can be colour coded by a third categorical variable using the ‘Set marker by’ option within the

Graphs Legacy Dialogs scatterplot menu.

Here, we will look at the relationship between weight before and weight after the diet with different shapes for males and females.

Double click on the chart to open the edit window. To change the shape of the scatter, click on the scatter, then again on just one of the females to open the properties window. Change the marker type and size.

It is clear from the scatterplot that there is a strong positive relationship between a person’s weight before and after the diet. A positive relationship (uphill scatter) means that as the x (horizontal) variable (weight before diet) increases so does the y (vertical) variable. In a negative relationship, y decreases as x increases.


32

Summary of descriptive and graphical statistics

Variable type Purpose Summary Statistics

Pie Chart or bar chart

One Categorical variable Shows frequencies/ proportions/percentages

Class percentages

Stacked / multiple bar

Two categorical variables

Compares proportions within groups Compare percentages within groups

Histogram One continuous variable Shows distribution of results Mean and Standard deviation

Scatter graph Two continuous variables

Shows relationship between two variables and helps detect outliers

Correlation co-efficient

Line Chart Continuous over time Continuous by group

Displays changes over time Comparison of group means

Frequencies Means

Confidence Interval plot

Continuous dependent/ categorical independent

Comparison of group means Means and Confidence Intervals


33

Research question 3: Which variables are strongly related to birthweight?

Exercise 6:

a) Open the data set ‘birthweight’ from Excel. Label the variables with the labels in the table below.

b) What is the average birthweight? Is birthweight normally distributed?

c) Recode the variable mncig (cigarettes smoked by the mother per day) into the following four

categories: 1 = non-smoker, 2= light smoker (smokes 1 – 10 a day), 3 = Moderate smoker (11 – 20 a

day) and 4 = Heavy smoker (21+ a day).

d) Summarise birthweight by smoking category using suitable statistics and a graph

e) Produce a scatterplot of birthweight and gestational age by smoking category. What is the

relationship between the variables?

Variable Label Variable type

id Baby ID

headcir Head Circumference (cm)

leng Length of baby (inches)

weight Baby's weight

gest Gestational age

mage Maternal age

mnocig No. cigarettes smoked per day by mother

mheight Maternal height

mppwt Mothers pre-pregnancy weight

fage Fathers age

fedyrs Years father was in education

fnocig No. cigarettes smoked per day by father

fheight Fathers height

lowbwt Low birth weight baby 1 = under 5lbs


34

Exercise 7: Enter the following data into SPSS:

Women

Men

Age

housework (hrs per

week) marital status

Hours worked per

week

Age

housework (hrs per

week) marital status

Hours worked

per week

46 6 Married 35

55 10 Married 28

62 8 Married 7

61 0 Married 39

42 30 Married 7

39 2 Married 49

36 25 Married 18

38 3 Married 40

58 30 Married 23

58 4 Married 40

36 21 Married 22

31 6 Married 41

32 10 Married 24

54 7 Married 42

35 14 Married 32

33 4 Separated 45

33 3 Married 36

62 6 Divorced 38

41 12 Married 36

62 6 Widowed 37

31 14 Separated 22

31 2 Never

married 35

50 25 Divorced 10

32 18 Never

married 25

31 15 Widowed 15

42 20 Never

married 35

a) Investigate the relationship between the amount of housework someone carries out per week and

each of the other variables using suitable charts. For scatterplots, have different markers for males

and females.

b) Create a new binary variable from ‘Hours worked per week’ to indicate whether someone is full

time or part time. Classify part time as under 30 hours.

c) Summarise the amount of housework carried out per week by working full/ part time using a table

and a plot and interpret.


35

Getting SPSS on your home computer

Go to the downloading software page and enter your uni login and password

https://cics.dept.shef.ac.uk/software/

To download software or renew license codes click the SPSS Statistics 19-22 button on the page that comes

up. You will receive an email containing a download link, a license code, installation instructions and legal

information. The download sometimes takes a long time!

MASH contact details

Book an appointment or access help sheets via our webpage: https://www.shef.ac.uk/mash

Statistics appointments are 10am – 1pm every day in term time with an additional session 4-7pm Wednesdays. For appointments outside of term time see our website or email [email protected].

https://cics.dept.shef.ac.uk/software/

https://www.shef.ac.uk/mash


36

Solutions to exercises Exercise 1: Identify the type of variables and key questions of interest for the Titanic dataset

Variable

name Variable label Value label Data type

pclass Class 1 = 1st, 2 = 2nd, 3 = 3rd Ordinal

survived 0 = Died, 1 = survived Nominal

Residence Country of Residence 0=American, 1 = British, 2 = Other Nominal

age Scale

sibsp Number of siblings/ spouses Scale

parch Number of parents/ children on board Scale

fare Price of ticket Scale

Gender Gender 0 = male, 1 = female Nominal (binary)

Were wealthy people more likely to survive? Which variables would you use to investigate this question?

Survival is the outcome. Wealthy could be measured using either class or price of ticket.

Exercise 2: Who are the most dangerous drivers?

Males and middle aged people have more

accidents.

This may be because there are more drivers

of males and middle aged drivers on the

road.

%’s are better than frequencies

Given there are different numbers of

drivers in each category and the categories

are different widths, the best way to

summarise is to compare the proportions

within each category having accidents. It

is clear that male drivers consistently have

more accidents and that younger drivers

are more likely to have accidents.

Categories are different widths


37

Exercise 3: Investigate whether nationality and survival were related

56% of Americans survived compared to 32% of British passengers and 32% of other nationalities.

Ex 4: Comparison of continuous data by group

Did the cost of a ticket affect chances of survival?

a) Is there a big difference in average ticket price by group? Yes. The mean and median ticket prices are much higher in the group who survived

b) Which group has data which is more spread out? The standard deviation is double in the group who survived so there is much more variation in that group

c) Is the data skewed? Yes – it’s very positively skewed. There a lot of people with cheap tickets and not so many with expensive tickets

d) Is the mean or median a better summary measure? The median as the data is very skewed.

Cost of ticket Survived?Died Survived

Mean 23.35 49.36

Median 10.50 26.00


Interquartile range 18.15 46.56

Minimum 0.00 0.00

Maximum 263.00 512.33


38

Exercise 5:

a) Fill in the following table using the summary statistics table in the output. Female = 0 Male = 1

Minimum -70 71

Maximum 82 88

Mean 64 79

Median 66 79

Standard Deviation 21.6 5

b) Interpret the summary statistics by gender. Which group has the higher mean and which group is

more spread out? Standard deviation: The standard deviation for men, 5 is much smaller than the standard deviation for

women of 21.6 so the weights for women are more spread out. However, the data entry error needs to be

removed and the statistics run again.

Averages: Females had a mean weight of 64kg and median of 66kg before the diet. There’s quite a

difference between the two measures suggesting that the data may be skewed. Males had a mean and

median pre-weight of 79kg suggesting that the data is normally distributed.

Minimum/ maximum: Are there any extreme outliers? Someone weighed -70kg before the diet which is

clearly an error. Outliers cannot always be removed/ changed but here the real weight is clearly 70kg so

make that adjustment and re-run the analysis. What effect has this had on the summary statistics?

c) How could the chart be improved and is there anything odd? Better labelling of variables. Someone weighed -70kg which is clearly wrong

Before the next section, change the error of -70 to 70. Outliers should not normally be changed unless

they are clearly data entry errors as in this case. Give the variables sensible labels and label gender with 0

= Female and 1 = Male.

Re-run explore to see how the change has affected the summary statistics. Which summary statistics have changed the most?

Female with outlier Female after changing outlier

Minimum -70 58

Maximum 82 82

Mean 64 67

Median 66 67


The mean, standard deviation, minimum and maximum are more influenced by outliers than the median

and interquartile range.


39

Exercise 6: Open the data set ‘birthweight’ from Excel. Label the variables with the labels in the table

below.

All the variables are continuous/ discrete apart from ‘Low birth weight’ which is binary

a) What is the average birthweight? Is birthweight normally distributed?

The smallest baby in the data set was 3.3 pounds and the largest 11.4 pounds. The mean birthweight is

7.52 pounds and the median 7.6. The histogram shows that birthweight is normally distributed.

b) Recode the variable mncig (cigarettes smoked by the mother per day) into the following four

categories: 1 = non-smoker, 2= light smoker (smokes 1 – 10 per day), 3 = Moderate smoker (11 – 20

per day) and 4 = Heavy smoker (21+ per day)

c) Summarise birthweight by smoking category using suitable statistics and a graph

The means of the groups are similar ranging from 6.97 for moderate smokers to 7.73 pounds for

non-smokers. The standard deviations are similar suggesting similar spread of birthweights within

each category.

For the plots, either a confidence interval plot or a boxplot would be useful representations of the

differences between the groups.


40

The boxplots show that the medians for the four

groups are fairly similar and the interquartile range

(middle 50% of the values) is of a similar width. Each

boxplot is fairly symmetrical about the median

suggesting the values are normally distributed within

each group.

Produce a scatterplot of birthweight and gestational

age. What is the relationship between the two?

There is a moderate positive relationship between

gestational age and birthweight but no clear

relationship between smoking and either weight or

gestational age. This means that as gestational age

increases, birthweight tends to increase. There is one

oddity though. A standard pregnancy is 40 weeks.

Most women are induced by 42 weeks but there seem

to be quite a few above 42 weeks. It’s likely that this is

old data perhaps from a time when gestational age

estimation was less accurate.

Exercise 7: Enter the following data into SPSS:

The data should have been entered like this and the categorical numbers labelled.

a) Investigate the relationship between the amount of housework someone carries out per week and

each of the other variables using suitable charts. For scatterplots, have different markers for males

and females.


41

The graph suggests a strong negative relationship between weekly hours of work and hours of housework. This means that the more hours someone works, the less housework they do. For males, the amount of housework they do and the hours they do are less spread out.

There doesn’t appear to be a relationship between

age and the amount of housework someone does.

.

The highest medians are for those never married and those who are divorced. The data for those never married is very skewed. However, the sample size is small so not much can be concluded. How many are in each category?


42

The summary statistics show that there are only 2 or 3 people in most of the categories so using summary statistics could be misleading. Merging suitable categories would be advisable.

Produce a plot comparing those working full/ part time for hours of housework and interpret.

Hours per week on housework

Working status

Part time Full time

Mean 18.73 6.33

Median 18.00 6.00


There is clearly a difference in the amount of

housework carried out per week between those

working full and part time. Those working part time

carry out 19 hours of housework a week on average

compared to 6 hours a week by those working full

time. The amount of housework is more spread out for part time people (SD = 8 compared to SD = 5 for

full time workers).

Date post:	19-Mar-2020
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

MASH SPSS sessions Getting started with SPSS/file/Getting_started_in_SPSS_booklet.pdfA variable is...

Documents