Post on 27-Jan-2017
transcript
Workshop #2
Combined and Comparative Metrics
Human Computer Interaction/COG3103, 2016 Fall Class hours : Monday 1-3 pm/Wendseday 2-3 pm Lecture room : Widang Hall 209 5th December
Independent & Dependent Variables
Workshop #3 COG_Human Computer Interaction 2
• Independent variables:
– The things you manipulate or
control for, e.g.,
– Aspect of a study that you
manipulate
– Chosen based on research question
– e.g.
• Characteristics of participants (e.g.,
age, sex, relevant experience)
• Different designs or prototypes
being tested
• Tasks
• Dependent variables:
– The things you measure
– Describes what happened as a result of
the study
– Something you measure as the result, or
as dependent on, how you manipulate
the independent variables
– e.g.
• Task Success
• Task Time
• SUS score
• etc.
Need to have a clear idea of what you plan to manipulate and what you plan to measure
Designing a Usability Study
Workshop #3 COG_Human Computer Interaction 3
• RQ 1
• Research Question :
– Differences in performance
between males and females
• Independent variable
– Gender
• Dependent variable
– Task completion time
• RQ 2
• Research Question :
– Differences in satisfaction
between novice and expert users
• Independent variable :
– Experience level
• Dependent variable :
– Satisfaction
Types of Data
• Nominal (aka Categorical)
– e.g., Male, Female; Design A, Design B.
• Ordinal
– e.g., Rank ordering of 4 designs tested from Most Visually Appealing to
Least Visually Appealing.
• Interval
– e.g., 7-point scale of agreement: “This design is visually appealing.
Strongly Disagree . . . Strongly Agree”
• Ratio
– e.g., Time, Task Success %
Workshop #1 COG_Human Computer Interaction 4
NORMINAL DATA
• Definition
– Unordered groups or categories
– Without order, cannot say one is better than another
• May provide characteristics of users, independent variables that allow you to segment
data
– Windows versus Mac users
– Geographical location
– Males versus females
• What about dependent variables?
– Number of users who clicked on A vs. B
– Task success
• Usage
– Counts and frequencies
Workshop #1 COG_Human Computer Interaction 5
ORDINAL DATA
• Definition
– Ordered groups and categories
– Data is ordered in a certain way but intervals between measurements are not
meaningful
• Ordinal data comes from self-reported data on questionnaires
– Website rated as excellent, good, fair, or poor
– Severity rating of problem encountered as high, medium, or low
• Usage
– Looking at frequencies
– Calculating average is meaningless (distance between high and medium may
not be the same as medium and low)
Workshop #1 COG_Human Computer Interaction 6
INTERVAL DATA
• Definition
– Continuous data where differences between the measurements are meaningful
– Zero point on the scale is arbitrary
• System Usability Scale (SUS)
– Example of interval data
– Based on self-reported data from a series of questions about overall usability
– Scores range from 0 to 100
• Higher score indicates better usability
• Distance between points meaningful because it indicates increase/decrease in perceived
usability
• Usage
– Able to calculate descriptive statistics such as average, standard deviation, etc.
– Inferal statistics can be used to generalize a population
Workshop #1 COG_Human Computer Interaction 7
Ordinal vs. Interval Rating Scales
• Are these two scales different?
• Top scale is ordinal. You should only calculate frequencies of each
response.
• Bottom scale can be considered interval. You can also calculate
means.
Workshop #1 COG_Human Computer Interaction 8
RATIO DATA
• Definition
– Same as interval data with the addition of absolute zero
– Zero has inherit meaning
• Example
– Difference between a person of 35 and a person 38 is the same as the
difference between people who are 12 and 15
– Time to completion, you can say that one participant is twice as fast as
another
• Usage
– Most analysis that you do work with ratio and interval data
– Geometric mean is an exception, need ratio data
Workshop #1 COG_Human Computer Interaction 9
Statistics for each Data Type
Workshop #1 COG_Human Computer Interaction 10
Confidence Intervals
• Assume this was your time data for a study with 5 participants:
Workshop #1 COG_Human Computer Interaction 11
Does that make a difference in your answer?
Calculating Confidence Intervals
– <alpha> is normally .05 (for a
95% confidence interval)
– <std dev> is the standard
deviation of the set of
numbers (9.6 in this example)
– <n> is how many numbers are
in the set (5 in this example)
Workshop #1 COG_Human Computer Interaction 12
=CONFIDENCE(<alpha>,<std dev>,<n>)
Excel Example
Show Error Bars
Workshop #1 COG_Human Computer Interaction 13
Excel Example
How to Show Error Bar
Workshop #1 COG_Human Computer Interaction 14
Binary Success
• Pass/fail (or other binary criteria)
• 1’s (success) and 0’s (failure)
Workshop #1 COG_Human Computer Interaction 15
Confidence Interval for Task Success
• When you look at task success data across participants for a single
task the data is commonly binary:
– Each participant either passed or failed on the task.
• In this situation, you need to calculate the confidence interval using
the binomial distribution.
Workshop #1 COG_Human Computer Interaction 16
Example
– Easiest way to calculate confidence interval is using Jeff Sauro’s
web calculator:
– http://www.measuringusability.com/wald.htm
Workshop #1 COG_Human Computer Interaction 17
1=success, 0=failure. So, 6/8 succeeded, or 75%.
Chi-square
• Allows you to compare actual and expected frequencies for
categorical data.
Workshop #1 COG_Human Computer Interaction 18
=CHITEST(<actual range>,<expected range>)
Excel Example
Comparing Means
T-test
• Independent samples
(between subjects)
– Apollo websites, task times
T-test
• Paired samples (within
subjects)
– Haptic mouse study
Workshop #1 COG_Human Computer Interaction 19
T-tests in Excel
Independent Samples: Paired Samples:
Workshop #1 COG_Human Computer Interaction 20
=TTEST(<array1>,<array2>,x,y)
x = 2 (for two-tailed test) in almost all cases
y = 2 (independent samples) y = 1 (paired samples)
Comparing Multiple Means
• Analysis of Variance (ANOVA)
Workshop #1 COG_Human Computer Interaction 21
“Tools” > “Data Analysis” > “Anova: Single Factor” Excel example: Study comparing 4 navigation approaches for a website
SUS
• Developed at Digital Equipment Corp.
• Consists of ten items.
• Adapted by replacing “system” with “website”.
• Each item is a statement (positive or negative) and a rating on a five-
point scale of “Strongly Disagree” to “Strongly Agree.”
• For details see
http://www.usabilitynet.org/trump/documents/Suschapt.doc
Workshop #3 COG_Human Computer Interaction 22
SUS
Strongly Disagree
Strongly Agree
1. I think I would like to use this website frequently. O O O O O
2. I found the website unnecessarily complex. O O O O O
3. I thought the website was easy to use. O O O O O
4. I think I would need Tech Support to be able to use this website.
O O O O O
5. I found the various functions in this website were well integrated.
O O O O O
6. I thought there was too much inconsistency in this website.
O O O O O
7. I would imagine that most people would learn to use this website very quickly.
O O O O O
8. I found the website very cumbersome to use. O O O O O
9. I felt very confident using the website. O O O O O
10. I need to learn a lot about this website before I could effectively use it.
O O O O O
Workshop #3 COG_Human Computer Interaction 23
SUS Scoring
• SUS yields a single number representing a composite measure of the
overall usability of the system being studied. Note that scores for
individual items are not meaningful on their own.
• To calculate the SUS score:
– Each item's score contribution will range from 0 to 4.
– For items 1,3,5,7,and 9 the score contribution is the scale position minus 1.
– For items 2,4,6,8 and 10, the contribution is 5 minus the scale position.
– Multiply the sum of the scores by 2.5 to obtain the overall SUS score.
• SUS scores have a range of 0 to 100.
http://www.measuringux.com/SUS_Calculation.xls
Workshop #3 COG_Human Computer Interaction 24
SUS Scoring Example
Workshop #3 COG_Human Computer Interaction 25
Total = 22 SUS Score = 22*2.5 = 55
SUS Usage
– “SUS has been made freely available for use in usability assessment, and
has been used for a variety of research projects and industrial
evaluations; the only prerequisite for its use is that any published report
should acknowledge the source of the measure.”
Workshop #3 COG_Human Computer Interaction 26
SUS Data from 50 Studies
Workshop #3 COG_Human Computer Interaction 27
05
101520253035404550
<=40 41-50 51-60 61-70 71-80 81-90 91-100
Freq
uenc
y
Average SUS Scores
Frequency Distribution of SUS Scores for 129 Conditions from 50 Studies Percentiles:
10th 47.4 25th 56.7 50th 68.9 75th 76.7 90th 81.2 Mean 66.4
http://www.measuringux.com/SUS-sc ores.xls
Combined Metrics
• Often it’s useful to combine different metrics to get an overall usability
measure.
• Challenge is combining metrics that have different scales, e.g.
– Task completion: % correct
– Task time: Seconds
– Subjective rating: SUS score
• Two common techniques:
– Combine using percentages
– Combine using z-scores
Workshop #3 COG_Human Computer Interaction 28
Combine Based on Percentages
• Basic idea is to convert each of the metrics to a percentage and then
average those together.
• For each metric to be transformed, you want:
– 0% to represent the worst possible score
– 100% to represent the best possible score
• Some metrics already are a percentage:
– SUS scores
– % correct tasks
Workshop #3 COG_Human Computer Interaction 29
Sample Data
Workshop #3 COG_Human Computer Interaction 30
Participant # Time per Task (sec) Tasks Completed (of 15) Rating (0-4) 1 65 7 2.4 2 50 9 2.6 3 34 13 3.1 4 70 6 1.7 5 28 11 3.2 6 52 9 3.3 7 58 8 2.5 8 60 7 1.4 9 25 9 3.8
10 55 10 3.6 Averages 49.7 8.9 2.8
Original Data :
Sample Data
Workshop #3 COG_Human Computer Interaction 31
Participant # Time per
Task (sec)
Tasks Completed
(of 15) Rating (0-4) Time %
Tasks % Rating % Average
1 65 7 2.4 38% 47% 60% 48% 2 50 9 2.6 50% 60% 65% 58% 3 34 13 3.1 74% 87% 78% 79% 4 70 6 1.7 36% 40% 43% 39% 5 28 11 3.2 89% 73% 80% 81% 6 52 9 3.3 48% 60% 83% 64% 7 58 8 2.5 43% 53% 63% 53% 8 60 7 1.4 42% 47% 35% 41% 9 25 9 3.8 100% 60% 95% 85%
10 55 10 3.6 45% 67% 90% 67% Averages 49.7 8.9 2.8 57% 59% 69% 62%
Original data with percentage transformations added:
Excel spreadsheet
Combine Using Z-scores
• Another method sometimes used is z-score transformation:
– Convert each participant’s score for each metric to a z-score.
• Z-scores are based on the normal distribution.
• They have a mean of 0 and standard deviation of 1.
• Use the “standardize” function in Excel.
– Average the z-scores for each person to get an overall z-score.
• Make sure all scales go the same direction.
– Must decide whether each score is going to be given equal weight.
Workshop #3 COG_Human Computer Interaction 32
Z-score Transformation Example
Workshop #3 COG_Human Computer Interaction 33
Participant # Time per Task
(sec)
Tasks Completed
(of 15) Rating (0-4) z Time z Time*-1 Z Tasks z Rating Average z
1 65 7 2.4 0.98 -0.98 -0.91 -0.46 -0.78 2 50 9 2.6 0.02 -0.02 0.05 -0.20 -0.06 3 34 13 3.1 -1.01 1.01 1.97 0.43 1.14 4 70 6 1.7 1.30 -1.30 -1.39 -1.35 -1.35 5 28 11 3.2 -1.39 1.39 1.01 0.56 0.99 6 52 9 3.3 0.15 -0.15 0.05 0.69 0.20 7 58 8 2.5 0.53 -0.53 -0.43 -0.33 -0.43 8 60 7 1.4 0.66 -0.66 -0.91 -1.73 -1.10 9 25 9 3.8 -1.59 1.59 0.05 1.32 0.98
10 55 10 3.6 0.34 -0.34 0.53 1.07 0.42 Averages 49.7 8.9 2.8 0.00 0.00 0.00 0.00 0.00
15.57 2.08 0.79 1.00 1.00 1.00 1.00 0.90
=standardize(B2,$B$12,$B$13)
What We have been through
Workshop #3 COG_Human Computer Interaction 34
Ticket buyer: Casual new user, for occasional personal use
Walk-up ease of use for new user
Initial user performance
BT1: Buy special event ticket
Average time on task
3 minutes
Ticket buyer: Casual new user, for occasional personal use
Walk-up ease of use for new user
Initial user performance
BT2: Buy movie ticket
Average number of errors
<1
Ticket buyer: Casual new user, for occasional personal use
Initial customer satisfaction
First impression
Questions Q1–Q10 in questionnaire XYZ
Average rating across users and across questions
7.5/10
Research Question IV Prototype
DV Scales of Measures
Constructs This Week Homework
The Data Set
Workshop #3 COG_Human Computer Interaction 35
Participant # Time per Task (sec) Tasks Completed (of 15) Rating (0-4) 1 65 7 2.4 2 50 9 2.6 3 34 13 3.1 4 70 6 1.7 5 28 11 3.2 6 52 9 3.3 7 58 8 2.5 8 60 7 1.4 9 25 9 3.8 10 55 10 3.6
Averages 49.7 8.9 2.8 Standard Deviation 15.57 2.08 0.79
Percentages
Workshop #3 COG_Human Computer Interaction 36
Participant # Time per Task (s
ec) Tasks Completed (o
f 15) Rating (0-4) Time % Tasks % Rating % Average 1 65 7 2.4 38% 47% 60% 48% 2 50 9 2.6 50% 60% 65% 58% 3 34 13 3.1 74% 87% 78% 79% 4 70 6 1.7 36% 40% 43% 39% 5 28 11 3.2 89% 73% 80% 81% 6 52 9 3.3 48% 60% 83% 64% 7 58 8 2.5 43% 53% 63% 53% 8 60 7 1.4 42% 47% 35% 41% 9 25 9 3.8 100% 60% 95% 85% 10 55 10 3.6 45% 67% 90% 67%
Averages 49.7 8.9 2.8 57% 59% 69% 62% Standard Deviation 15.57 2.08 0.79 0.23 0.14 0.20 0.16
Standardized Evaluation Results
Workshop #3 COG_Human Computer Interaction 37
Participant # Time per Ta
sk (sec) Tasks Completed (of 15)
Rating (0-4) Time % Tasks % Rating % Average z Time z Time*-1 Z Tasks z Rating Average
1 65 7 2.4 38% 47% 60% 48% 0.98 -0.98 -0.91 -0.46 -0.78 2 50 9 2.6 50% 60% 65% 58% 0.02 -0.02 0.05 -0.20 -0.06 3 34 13 3.1 74% 87% 78% 79% -1.01 1.01 1.97 0.43 1.14 4 70 6 1.7 36% 40% 43% 39% 1.30 -1.30 -1.39 -1.35 -1.35 5 28 11 3.2 89% 73% 80% 81% -1.39 1.39 1.01 0.56 0.99 6 52 9 3.3 48% 60% 83% 64% 0.15 -0.15 0.05 0.69 0.20 7 58 8 2.5 43% 53% 63% 53% 0.53 -0.53 -0.43 -0.33 -0.43 8 60 7 1.4 42% 47% 35% 41% 0.66 -0.66 -0.91 -1.73 -1.10 9 25 9 3.8 100% 60% 95% 85% -1.59 1.59 0.05 1.32 0.98 10 55 10 3.6 45% 67% 90% 67% 0.34 -0.34 0.53 1.07 0.42
Averages 49.7 8.9 2.8 57% 59% 69% 62% 0.00 0.00 0.00 0.00 0.00 Standard Deviation 15.57 2.08 0.79 0.23 0.14 0.20 0.16 1.00 1.00 1.00 1.00 0.90
The standardized evaluation results that we can utilize during the iterative process.