[HCI] Week 14 UX Evaluation Workshop Part II

transcript

Workshop #2

Combined and Comparative Metrics

Human Computer Interaction/COG3103, 2016 Fall Class hours : Monday 1-3 pm/Wendseday 2-3 pm Lecture room : Widang Hall 209 5th December

Independent & Dependent Variables

Workshop #3 COG_Human Computer Interaction 2

• Independent variables:

– The things you manipulate or

control for, e.g.,

– Aspect of a study that you

manipulate

– Chosen based on research question

– e.g.

• Characteristics of participants (e.g.,

age, sex, relevant experience)

• Different designs or prototypes

being tested

• Tasks

• Dependent variables:

– The things you measure

– Describes what happened as a result of

the study

– Something you measure as the result, or

as dependent on, how you manipulate

the independent variables

– e.g.

• Task Success

• Task Time

• SUS score

• etc.

Need to have a clear idea of what you plan to manipulate and what you plan to measure

Designing a Usability Study

• RQ 1

• Research Question :

– Differences in performance

between males and females

• Independent variable

– Gender

• Dependent variable

– Task completion time

• RQ 2

• Research Question :

– Differences in satisfaction

between novice and expert users

• Independent variable :

– Experience level

• Dependent variable :

– Satisfaction

Types of Data

• Nominal (aka Categorical)

– e.g., Male, Female; Design A, Design B.

• Ordinal

– e.g., Rank ordering of 4 designs tested from Most Visually Appealing to

Least Visually Appealing.

• Interval

– e.g., 7-point scale of agreement: “This design is visually appealing.

Strongly Disagree . . . Strongly Agree”

• Ratio

– e.g., Time, Task Success %

NORMINAL DATA

• Definition

– Unordered groups or categories

– Without order, cannot say one is better than another

• May provide characteristics of users, independent variables that allow you to segment

– Windows versus Mac users

– Geographical location

– Males versus females

• What about dependent variables?

– Number of users who clicked on A vs. B

– Task success

• Usage

– Counts and frequencies

ORDINAL DATA

• Definition

– Ordered groups and categories

– Data is ordered in a certain way but intervals between measurements are not

meaningful

• Ordinal data comes from self-reported data on questionnaires

– Website rated as excellent, good, fair, or poor

– Severity rating of problem encountered as high, medium, or low

• Usage

– Looking at frequencies

– Calculating average is meaningless (distance between high and medium may

not be the same as medium and low)

INTERVAL DATA

• Definition

– Continuous data where differences between the measurements are meaningful

– Zero point on the scale is arbitrary

• System Usability Scale (SUS)

– Example of interval data

– Based on self-reported data from a series of questions about overall usability

– Scores range from 0 to 100

• Higher score indicates better usability

• Distance between points meaningful because it indicates increase/decrease in perceived

usability

• Usage

– Able to calculate descriptive statistics such as average, standard deviation, etc.

– Inferal statistics can be used to generalize a population

Ordinal vs. Interval Rating Scales

• Are these two scales different?

• Top scale is ordinal. You should only calculate frequencies of each

response.

• Bottom scale can be considered interval. You can also calculate

means.

RATIO DATA

• Definition

– Same as interval data with the addition of absolute zero

– Zero has inherit meaning

• Example

– Difference between a person of 35 and a person 38 is the same as the

difference between people who are 12 and 15

– Time to completion, you can say that one participant is twice as fast as

another

• Usage

– Most analysis that you do work with ratio and interval data

– Geometric mean is an exception, need ratio data

Statistics for each Data Type

Confidence Intervals

• Assume this was your time data for a study with 5 participants:

Does that make a difference in your answer?

Calculating Confidence Intervals

– <alpha> is normally .05 (for a

95% confidence interval)

– <std dev> is the standard

deviation of the set of

numbers (9.6 in this example)

– <n> is how many numbers are

in the set (5 in this example)

=CONFIDENCE(<alpha>,<std dev>,<n>)

Excel Example

Show Error Bars

Excel Example

How to Show Error Bar

Binary Success

• Pass/fail (or other binary criteria)

• 1’s (success) and 0’s (failure)

Confidence Interval for Task Success

• When you look at task success data across participants for a single

task the data is commonly binary:

– Each participant either passed or failed on the task.

• In this situation, you need to calculate the confidence interval using

the binomial distribution.

Example

– Easiest way to calculate confidence interval is using Jeff Sauro’s

web calculator:

– http://www.measuringusability.com/wald.htm

1=success, 0=failure. So, 6/8 succeeded, or 75%.

Chi-square

• Allows you to compare actual and expected frequencies for

categorical data.

=CHITEST(<actual range>,<expected range>)

Excel Example

Comparing Means

T-test

• Independent samples

(between subjects)

– Apollo websites, task times

T-test

• Paired samples (within

subjects)

– Haptic mouse study

T-tests in Excel

Independent Samples: Paired Samples:

=TTEST(<array1>,<array2>,x,y)

x = 2 (for two-tailed test) in almost all cases

y = 2 (independent samples) y = 1 (paired samples)

Comparing Multiple Means

• Analysis of Variance (ANOVA)

“Tools” > “Data Analysis” > “Anova: Single Factor” Excel example: Study comparing 4 navigation approaches for a website

• Developed at Digital Equipment Corp.

• Consists of ten items.

• Adapted by replacing “system” with “website”.

• Each item is a statement (positive or negative) and a rating on a five-

point scale of “Strongly Disagree” to “Strongly Agree.”

• For details see

http://www.usabilitynet.org/trump/documents/Suschapt.doc

Strongly Disagree

Strongly Agree

1. I think I would like to use this website frequently. O O O O O

2. I found the website unnecessarily complex. O O O O O

3. I thought the website was easy to use. O O O O O

4. I think I would need Tech Support to be able to use this website.

O O O O O

5. I found the various functions in this website were well integrated.

O O O O O

6. I thought there was too much inconsistency in this website.

O O O O O

7. I would imagine that most people would learn to use this website very quickly.

O O O O O

8. I found the website very cumbersome to use. O O O O O

9. I felt very confident using the website. O O O O O

10. I need to learn a lot about this website before I could effectively use it.

O O O O O

SUS Scoring

• SUS yields a single number representing a composite measure of the

overall usability of the system being studied. Note that scores for

individual items are not meaningful on their own.

• To calculate the SUS score:

– Each item's score contribution will range from 0 to 4.

– For items 1,3,5,7,and 9 the score contribution is the scale position minus 1.

– For items 2,4,6,8 and 10, the contribution is 5 minus the scale position.

– Multiply the sum of the scores by 2.5 to obtain the overall SUS score.

• SUS scores have a range of 0 to 100.

http://www.measuringux.com/SUS_Calculation.xls

SUS Scoring Example

Total = 22 SUS Score = 22*2.5 = 55

SUS Usage

– “SUS has been made freely available for use in usability assessment, and

has been used for a variety of research projects and industrial

evaluations; the only prerequisite for its use is that any published report

should acknowledge the source of the measure.”

SUS Data from 50 Studies

101520253035404550

<=40 41-50 51-60 61-70 71-80 81-90 91-100

Average SUS Scores

Frequency Distribution of SUS Scores for 129 Conditions from 50 Studies Percentiles:

10th 47.4 25th 56.7 50th 68.9 75th 76.7 90th 81.2 Mean 66.4

http://www.measuringux.com/SUS-sc ores.xls

Combined Metrics

• Often it’s useful to combine different metrics to get an overall usability

measure.

• Challenge is combining metrics that have different scales, e.g.

– Task completion: % correct

– Task time: Seconds

– Subjective rating: SUS score

• Two common techniques:

– Combine using percentages

– Combine using z-scores

Combine Based on Percentages

• Basic idea is to convert each of the metrics to a percentage and then

average those together.

• For each metric to be transformed, you want:

– 0% to represent the worst possible score

– 100% to represent the best possible score

• Some metrics already are a percentage:

– SUS scores

– % correct tasks

Sample Data

Participant # Time per Task (sec) Tasks Completed (of 15) Rating (0-4) 1 65 7 2.4 2 50 9 2.6 3 34 13 3.1 4 70 6 1.7 5 28 11 3.2 6 52 9 3.3 7 58 8 2.5 8 60 7 1.4 9 25 9 3.8

10 55 10 3.6 Averages 49.7 8.9 2.8

Original Data :

Sample Data

Participant # Time per

Task (sec)

Tasks Completed

(of 15) Rating (0-4) Time %

Tasks % Rating % Average

1 65 7 2.4 38% 47% 60% 48% 2 50 9 2.6 50% 60% 65% 58% 3 34 13 3.1 74% 87% 78% 79% 4 70 6 1.7 36% 40% 43% 39% 5 28 11 3.2 89% 73% 80% 81% 6 52 9 3.3 48% 60% 83% 64% 7 58 8 2.5 43% 53% 63% 53% 8 60 7 1.4 42% 47% 35% 41% 9 25 9 3.8 100% 60% 95% 85%

10 55 10 3.6 45% 67% 90% 67% Averages 49.7 8.9 2.8 57% 59% 69% 62%

Original data with percentage transformations added:

Excel spreadsheet

Combine Using Z-scores

• Another method sometimes used is z-score transformation:

– Convert each participant’s score for each metric to a z-score.

• Z-scores are based on the normal distribution.

• They have a mean of 0 and standard deviation of 1.

• Use the “standardize” function in Excel.

– Average the z-scores for each person to get an overall z-score.

• Make sure all scales go the same direction.

– Must decide whether each score is going to be given equal weight.

Z-score Transformation Example

Participant # Time per Task

Tasks Completed

(of 15) Rating (0-4) z Time z Time*-1 Z Tasks z Rating Average z

1 65 7 2.4 0.98 -0.98 -0.91 -0.46 -0.78 2 50 9 2.6 0.02 -0.02 0.05 -0.20 -0.06 3 34 13 3.1 -1.01 1.01 1.97 0.43 1.14 4 70 6 1.7 1.30 -1.30 -1.39 -1.35 -1.35 5 28 11 3.2 -1.39 1.39 1.01 0.56 0.99 6 52 9 3.3 0.15 -0.15 0.05 0.69 0.20 7 58 8 2.5 0.53 -0.53 -0.43 -0.33 -0.43 8 60 7 1.4 0.66 -0.66 -0.91 -1.73 -1.10 9 25 9 3.8 -1.59 1.59 0.05 1.32 0.98

10 55 10 3.6 0.34 -0.34 0.53 1.07 0.42 Averages 49.7 8.9 2.8 0.00 0.00 0.00 0.00 0.00

15.57 2.08 0.79 1.00 1.00 1.00 1.00 0.90

=standardize(B2,$B$12,$B$13)

What We have been through

Ticket buyer: Casual new user, for occasional personal use

Walk-up ease of use for new user

Initial user performance

BT1: Buy special event ticket

Average time on task

3 minutes

Walk-up ease of use for new user

Initial user performance

BT2: Buy movie ticket

Average number of errors

Initial customer satisfaction

First impression

Questions Q1–Q10 in questionnaire XYZ

Average rating across users and across questions

7.5/10

Research Question IV Prototype

DV Scales of Measures

Constructs This Week Homework

The Data Set

Participant # Time per Task (sec) Tasks Completed (of 15) Rating (0-4) 1 65 7 2.4 2 50 9 2.6 3 34 13 3.1 4 70 6 1.7 5 28 11 3.2 6 52 9 3.3 7 58 8 2.5 8 60 7 1.4 9 25 9 3.8 10 55 10 3.6

Averages 49.7 8.9 2.8 Standard Deviation 15.57 2.08 0.79

Percentages

Participant # Time per Task (s

ec) Tasks Completed (o

f 15) Rating (0-4) Time % Tasks % Rating % Average 1 65 7 2.4 38% 47% 60% 48% 2 50 9 2.6 50% 60% 65% 58% 3 34 13 3.1 74% 87% 78% 79% 4 70 6 1.7 36% 40% 43% 39% 5 28 11 3.2 89% 73% 80% 81% 6 52 9 3.3 48% 60% 83% 64% 7 58 8 2.5 43% 53% 63% 53% 8 60 7 1.4 42% 47% 35% 41% 9 25 9 3.8 100% 60% 95% 85% 10 55 10 3.6 45% 67% 90% 67%

Averages 49.7 8.9 2.8 57% 59% 69% 62% Standard Deviation 15.57 2.08 0.79 0.23 0.14 0.20 0.16

Standardized Evaluation Results

Participant # Time per Ta

sk (sec) Tasks Completed (of 15)

Rating (0-4) Time % Tasks % Rating % Average z Time z Time*-1 Z Tasks z Rating Average

1 65 7 2.4 38% 47% 60% 48% 0.98 -0.98 -0.91 -0.46 -0.78 2 50 9 2.6 50% 60% 65% 58% 0.02 -0.02 0.05 -0.20 -0.06 3 34 13 3.1 74% 87% 78% 79% -1.01 1.01 1.97 0.43 1.14 4 70 6 1.7 36% 40% 43% 39% 1.30 -1.30 -1.39 -1.35 -1.35 5 28 11 3.2 89% 73% 80% 81% -1.39 1.39 1.01 0.56 0.99 6 52 9 3.3 48% 60% 83% 64% 0.15 -0.15 0.05 0.69 0.20 7 58 8 2.5 43% 53% 63% 53% 0.53 -0.53 -0.43 -0.33 -0.43 8 60 7 1.4 42% 47% 35% 41% 0.66 -0.66 -0.91 -1.73 -1.10 9 25 9 3.8 100% 60% 95% 85% -1.59 1.59 0.05 1.32 0.98 10 55 10 3.6 45% 67% 90% 67% 0.34 -0.34 0.53 1.07 0.42

Averages 49.7 8.9 2.8 57% 59% 69% 62% 0.00 0.00 0.00 0.00 0.00 Standard Deviation 15.57 2.08 0.79 0.23 0.14 0.20 0.16 1.00 1.00 1.00 1.00 0.90

The standardized evaluation results that we can utilize during the iterative process.

[HCI] Week 14 UX Evaluation Workshop Part II

Education