Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | ashlyn-gallagher |
View: | 214 times |
Download: | 1 times |
SAURO/LEWIS
1 USABILITY TESTING
RESEARCH METHODS IN HCIHCI RESEARCHERS EMPLOY EMPIRICAL METHODS,
TECHNIQUES FOR INVESTIGATING THE WORLD AND COLLECTING EVIDENCE TO PROVE OR DISPROVE THEIR
HYPOTHESES ABOUT HOW PEOPLE INTERACT WITH COMPUTERS, AND ABOUT THE USABILITY OF INTERFACES.LAB EXPERIMENT
AN ARTIFICIAL SITUATION, CREATED BY AND HIGHLY
CONTROLLED BY THE EXPERIMENTER, THAT TYPICALLY COMPARES
ALTERNATIVE USER INTERFACES OR MEASURES HOW
USABILITY VARIES WITH SOME DESIGN PARAMETER.
EXAMPLE: A TEST OF FONT READABILITY, DONE BY BRINGING SUBJECTS
INTO THE EXPERIMENTER’S LAB, ASKING THEM TO READ
TEXT SELECTIONS DISPLAYED WITH
DIFFERENT FONTS, AND TIMING THEIR READING
SPEED.
FIELD STUDYA REAL SITUATION IN
THE ACTUAL ENVIRONMENT
WHERE PEOPLE USE THE INTERFACE
BEING CONSIDERED, USING REAL TASKS
(RATHER THAN TASKS CONCOCTED BY THE
EXPERIMENTER).
IN HCI, INITIAL FIELD STUDIES JUST
OBSERVE WITHOUT INTERVENING (E.G.,
CONTEXTUAL INQUIRY), WHILE
FINAL FIELD STUDIES DELIVER THE NEW UI AND SEE HOW IT’S
USED.
SURVEYA QUESTIONNAIRE,
CONDUCTED BY PAPER, PHONE, WEB, OR IN
PERSON.
IN GENERAL, THE RESULTS OF A SURVEY TEND TO APPLY MORE
STRONGLY TO THE WHOLE POPULATION OF PEOPLE RELEVANT TO THE STUDY, SINCE IT IS FAR CHEAPER TO
SURVEY A LARGE NUMBER OF PEOPLE,
AND GOOD STATISTICAL SAMPLING TECHNIQUES EXIST TO
MAKE THE RESULTS MORE
GENERALIZABLE.
SAURO/LEWIS
2 USABILITY TESTING
OBTRUSIVE
UNOBTRUSIVE
AB
STR
AC
TC
ON
CR
ETE
FIELD STUDY
SURVEY
LAB EXPERIMENT
IN FIELD STUDIES, SUBJECTS DO THEIR
OWN TASKS IN
THEIR OWN
ENVIRONMENTS
IN ORDER TO MAKE
STRONG STATISTICAL CLAIMS, LAB
EXPERIMENTS USE
SIMPLIFIED AND HIGHLY
CONTROLLED TASKS
SURVEYS ARE GENERALIZAB
LE, BUT SUBJECTS ARE AWARE THAT
THEY ARE BEING
STUDIED AND MAY
RESPOND ACCORDINGLY
SAURO/LEWIS
3 USABILITY TESTING
QUANTIFYING USABILITYUSABILITY IS THE EXTENT TO WHICH USERS CAN UTILIZE A
SYSTEM’S FUNCTIONALITY.
LEARNABILITY (IS
THE SYSTEM EASY TO LEARN?)
EFFICIENCY (ONCE LEARNED, IS THE SYSTEM FAST TO USE?)
RECOVERABILITY
(ARE ERRORS FEW AND RECOVER
ABLE?)
SATISFACTION (IS
THE SYSTEM
ENJOYABLE TO USE?)DIMENSIONS OF USABILITY
SAURO/LEWIS
4 USABILITY TESTING
USABILITY TESTING CONSIDERATIONSNUMEROUS VARIABLES AFFECT THE VALIDITY OF USABILITY
TESTS.
SAMPLE SIZEHOW MANY
PARTICIPANTS ARE NEEDED TO ENSURE THE VALIDITY OF THE
TEST?
RANDOMNESSDO NON-PARTICIPANTS HAVE FUNDAMENTALLY
DIFFERENT CHARACTERISTICS THAN
PARTICIPANTS?
REPRESENTATIVENESS
HOW WELL DOES THE SAMPLE POPULATION
REPRESENT THE PARENT POPULATION?
DATA COLLECTIONSHOULD THE DATA BE GATHERED REMOTELY
OR IN A MODERATED LAB SESSION?
COMPLETION RATEHOW MANY PARTICIPANTS
SUCCESSFULLY COMPLETE THE ASSIGNED
TASK DURING A USABILITY TEST?
TASK TIMEHOW LONG DOES A USER SPEND ON AN ACTIVITY DURING A
USABILITY TEST?
SAURO/LEWIS
5 USABILITY TESTING
CONTROLLED EXPERIMENT
1. START WITH A TESTABLE HYPOTHESIS• FOR EXAMPLE: “THE MACINTOSH MENU BAR,
WHICH IS ANCHORED AT THE TOP OF THE SCREEN, IS FASTER TO ACCESS THAN THE WINDOWS MENU BAR, WHICH IS SEPARATED FROM THE TOP OF THE SCREEN BY A WINDOW TITLE BAR.”
2. CHOOSE THE INDEPENDENT VARIABLES TO MANIPULATE TO TEST THE HYPOTHESIS• IN THIS CASE, THE Y-POSITION OF THE MENU
BAR.• OTHER POSSIBILITIES: USER CLASSES
(NOVICES VS. EXPERTS, MAC USERS VS. WINDOWS USERS), MENU ITEM ARRANGEMENT (ALPHABETIZED VS. FUNCTIONALLY-GROUPED).
3. MEASURE THE DEPENDENT VARIABLES TO TEST THE HYPOTHESIS• TIME, ERROR RATE, NON-ERROR EVENT
COUNT (E.G., NUMBER OF TIMES MENU ITEM IS EXPANDED), USER SATISFACTION (USUALLY VIA A QUESTIONNAIRE).
4. USE STATISTICAL TESTS TO ACCEPT OR REJECT THE HYPOTHESIS• ANALYZE HOW CHANGES IN THE
INDEPENDENT VARIABLES AFFECTED THE DEPENDENT VARIABLES, AND WHETHER THOSE EFFECTS WERE SIGNIFICANT (I.E., INDICATING A DEFINITE CAUSE-AND-EFFECT).
SAURO/LEWIS
6 USABILITY TESTING
SCHEMATIC VIEW OF EXPERIMENT DESIGN
PROCESSY = F (X)
X(INDEPENDENT
VARIABLES)
Y(DEPEND
ENT VARIABL
ES)
IDEALLY, THE IDEA IS TO DETERMINE THE PRECISE EFFECT THAT THE INDEPENDENT VARIABLES HAVE ON THE DEPENDENT VARIABLES.
PROCESSY = F (X, , , , , )
X(INDEPENDENT
VARIABLES)
Y(DEPEND
ENT VARIABL
ES)
IN REALITY, HOWEVER, THERE ARE A NUMBER OF UNKNOWN OR UNCONTROLLED VARIABLES THAT ALSO IMPACT THE DEPENDENT
VARIABLES (E.G., IN THE MENU BAR EXAMPLE, THE POINTING DEVICE BEING USED, THE ORIGINAL POSITION OF THE MOUSE POINTER, THE
SURFACE ON WHICH THE MOUSE IS BEING DRAGGED, THE USER’S LEVEL OF FATIGUE, THE USER’S PREVIOUS EXPERIENCE WITH A
PARTICULAR TYPE OF MENU BAR, ETC.).
, , , , (UNKNOWN/
UNCONTROLLED VARIABLES)THE PURPOSE OF EXPERIMENT DESIGN IS TO ELIMINATE (OR AT
LEAST TO RENDER HARMLESS) THE EFFECT OF THE UNKNOWN AND UNCONTROLLED VARIABLES, IN ORDER TO ENABLE
CONCLUSIONS TO BE DRAWN REGARDING THE EFFECT OF THE INDEPENDENT VARIABLES ON THE DEPENDENT VARIABLES.
SAURO/LEWIS
7 USABILITY TESTING
DESIGN OF THE MENU BAR EXPERIMENTWHAT USER POPULATION
SHOULD BE SAMPLED?MAC USERS VS. WINDOWS USERS?
YOUNG USERS VS. OLD USERS?LEFT-HANDED USERS VS. RIGHT-
HANDED USERS?
HOW SHOULD THE TEST BE IMPLEMENTED?
USING REAL MAC AND WINDOWS INTERFACES?
IMPLEMENT A SEPARATE INTERFACE THAT AVOIDS CONFOUNDING
VARIABLES (SIZE OF THE MENU BAR, READING SPEED OF THE FONT,
MOUSE ACCELERATION PARAMETERS, ETC.)?
WHAT TASKS SHOULD THE USERS BE ASSIGNED?
REALISTIC TASKS (E.G., E-MAIL) THAT CAN BE GENERALIZED BUT MAY PRODUCE DATA “NOISE”?
ARTIFICIAL TASKS THAT WOULD PRODUCE RELIABLE BUT UNREALISTIC RESULTS?
HOW SHOULD THE TIME VARIABLE BE MEASURED?
FROM WHEN THE USER IS TOLD WHAT TO DO (“CLICK EDIT”) TO WHEN THE TASK IS COMPLETED?
FROM THE TIME THE USER STARTS TO MOVE THE MOUSE UNTIL THE
TASK IS FINISHED?
IN WHAT ORDER SHOULD TASKS AND
INTERFACE CONDITIONS BE ASSIGNED?
WILL THE USER EXPERIENCE FASTER REACTION TIMES WITH PRACTICE?WILL THE USER BECOME FATIGUED IF THE CONDITIONS DON’T VARY?
WHAT HARDWARE SHOULD BE USED?
SHOULD EVERY USER USE THE SAME COMPUTER?
SHOULD THE INTERACTIVE DEVICE (MOUSE, TRACKBALL, TOUCHPAD,
JOYSTICK) VARY?
SAURO/LEWIS
8 CONFIDENCE INTERVALS
CONFIDENCEUSUALLY, WHEN WE WANT
INFORMATION ABOUT A POPULATION (E.G., ALL
AMAZON.COM USERS, ALL SENIOR CITIZENS ON FACEBOOK), THE BEST WE CAN DO IS ESTIMATE, BASED ON
A MUCH SMALLER SAMPLE.
A CONFIDENCE INTERVAL IS A RANGE OF VALUES WITH A SPECIFIC PROBABILITY OF
CONTAINING THE ESTIMATED VALUE WE SEEK.
THREE MAIN FACTORS AFFECT THE CONFIDENCE INTERVAL:1.THE CONFIDENCE LEVEL (I.E., HOW
CONFIDENT DO YOU NEED TO BE?)A 90% CONFIDENCE INTERVAL IS SIGNIFICANTLY NARROWER THAN A 95% CONFIDENCE INTERVAL, WHICH NARROWS DOWN THE RANGE OF ESTIMATED VALUES, BUT INCREASES THE CHANCES OF MAKING AN ERROR.
2.THE VARIABILITY (I.E., HOW MUCH DOES THE DATA FLUCTUATE?)ESTIMATED VIA THE SAMPLE’S STANDARD DEVIATION, THE HIGHER THE VARIABILITY IS, THE WIDER THE CONFIDENCE INTERVAL WILL BE.3.THE SAMPLE SIZE (I.E., HOW MUCH DATA CAN YOU ACCUMULATE?)THE CONFIDENCE INTERVAL SIZE AND THE SAMPLE SIZE HAVE AN INVERSE SQUARE ROOT RELATIONSHIP (E.G., TO CUT THE CONFIDENCE IN INTERVAL IN HALF, YOU’D NEED TO QUADRUPLE THE SAMPLE SIZE).
SAURO/LEWIS
9 CONFIDENCE INTERVALS
COMPLETION RATE CONFIDENCE INTERVALSTHE STANDARD FORMULA FOR THE CONFIDENCE
INTERVAL FOR THE PERCENTAGE OF A POPULATION THAT WILL BE ABLE TO COMPLETE A PARTICULAR
TASK IS:�̂� ± 𝒛(𝟏− 𝜶𝟐 )√ �̂� (𝟏− �̂� )
𝒏
WHERE:�̂� is the proportion of the sample that completed the task
𝒏 is the sample size
𝒛(𝟏−𝜶𝟐 )
is the critical value from the normal distribution for the confidence level
0.80 1.280.90 1.6450.95 1.960.99 2.575
SAURO/LEWIS
10 CONFIDENCE INTERVALS
COMPLETION RATE EXAMPLEFORTY-EIGHT STUDENTS ARE ASKED TO
FIND THE CLASS SCHEDULES PAGE ON THE NEWLY REDESIGNED SIUE WEB SITE, BUT ONLY THIRTY-FOUR ARE ABLE TO DO SO.
�̂� ± 𝒛(𝟏− 𝜶𝟐 )√ �̂� (𝟏− �̂� )
𝒏=𝟎 .𝟕𝟎𝟖±𝟏 .𝟗𝟔 √𝟎 .𝟕𝟎𝟖(𝟏−𝟎 .𝟕𝟎𝟖)
𝟒𝟖=𝟎 .𝟕𝟎𝟖 ±𝟎 .𝟏𝟐𝟗
�̂�=𝟑𝟒𝟒𝟖
≈𝟎 .𝟕𝟎𝟖 𝒏=𝟒𝟖 𝒛(𝟏−𝜶𝟐 )
=𝟏 .𝟗𝟔
WHAT WOULD BE THE 95% CONFIDENCE INTERVAL FOR THE PROPORTION OF THE ENTIRE STUDENT POPULATION ABLE TO
PERFORM THIS TASK?
SO WE CAN BE 95% CONFIDENT THAT BETWEEN 57.9% AND 83.7% OF THE STUDENTS WILL BE
ABLE TO FIND THE CLASS SCHEDULES PAGE ON THE NEW SITE.
SAURO/LEWIS
11 CONFIDENCE INTERVALS
A SLIGHT ADJUSTMENTRESEARCH HAS SHOWN THAT WHEN THE SAMPLE
COMPLETION RATE IS EXTREME (TOO CLOSE TO 0% OR 100%), A MORE ACCURATE FORMULA FOR THE CONFIDENCE INTERVAL
IS NEEDED.�̂�𝒂𝒅𝒋 ±𝒛(𝟏−𝜶𝟐 ) √ �̂�𝒂𝒅𝒋 (𝟏− �̂�𝒂𝒅𝒋 )𝒏𝒂𝒅𝒋
WHERE:
�̂�𝒂𝒅𝒋=𝒙+
𝒛(𝟏−
𝜶𝟐
)
𝟐
𝟐𝒏𝒂𝒅𝒋
𝒙 is the number who completed the task in the sample𝒏𝒂𝒅𝒋=𝒏+𝒛
(𝟏−𝜶𝟐
)
𝟐
FOR OUR PREVIOUS EXAMPLE, WHERE THE COMPLETION RATE WAS NOT THAT EXTREME (0.708), THE ADJUSTED 95% CONFIDENCE INTERVAL COMPUTES TO BETWEEN 57.8% AND 83.4%, NOT THAT
DIFFERENT FROM THE ORIGINAL INTERVAL OF 57.9% TO 83.7%.
SAURO/LEWIS
12 CONFIDENCE INTERVALS
CONTINUOUS DATAWHEN SAMPLE SIZES ARE SMALL AND DATA IS CONTINUOUS
(E.G., RATINGS VALUES INSTEAD OF COMPLETION BOOLEANS), USING THE NORMAL DISTRIBUTION CAN BE VERY INACCURATE, SO THE t-DISTRIBUTION IS USED TO
ACCOUNT FOR HOW WIDELY THE SAMPLE DATA FLUCTUATES.𝒙± 𝒕
(𝟏−𝜶𝟐
)
𝒔√𝒏 WHE
RE:
𝒏 is the sample size
𝒕(𝟏−𝜶𝟐 )
is the critical value from the t distribution for n−1 degrees of freedom and the specified confidence level𝒔 is the sample standard deviation
SAURO/LEWIS
13 CONFIDENCE INTERVALS
REMEMBER THE t STATISTIC?
FIRST, RECALL THAT THE z STATISTIC IS USED TO ANALYZE A SAMPLE WHEN THE POPULATION’S MEAN AND STANDARD DEVIATION ARE KNOWN.WHE
RE:
𝒏 is the sample size𝝈 is the population standard deviation𝝁 is the population mean𝒛=
𝑴−𝝁
( 𝝈√𝒏 )
The denominator , 𝝈√𝒏
, known as the standard error of the mean , is the standard deviation of the means of all size−n samples of the population .
SO, THE z STATISTIC IS THE NUMBER OF STANDARD ERROR UNITS THAT A SAMPLE’S MEAN IS FROM THE POPULATION’S MEAN, ASSUMING A NORMAL DISTRIBUTION.USING A STANDARD NORMAL DISTRIBUTION TABLE, THE CORRESPONDING p-VALUE CAN BE LOOKED UP, INDICATING THAT THE PROBABILITY IS 1-p THAT A SIZE-n SAMPLE WOULD HAVE A MEAN CLOSER TO m THAN THE SAMPLE IN QUESTION.
SAURO/LEWIS
14 CONFIDENCE INTERVALS
POPULATION CRISIS
THE t STATISTIC ALLOWS RESEARCHERS TO USE SAMPLE DATA TO TEST HYPOTHESES ABOUT AN
UNKNOWN POPULATION MEAN.THE PARTICULAR ADVANTAGE OF THE t STATISTIC IS THAT IT DOES NOT REQUIRE ANY KNOWLEDGE OF
THE STANDARD DEVIATION OF THE POPULATION.THUS, THE t STATISTIC CAN BE USED TO TEST HYPOTHESES ABOUT A COMPLETELY UNKNOWN
POPULATION, I.E., BOTH μ (THE POPULATION MEAN) AND σ (THE POPULATION STANDARD DEVIATION)
ARE UNKNOWN, AND THE ONLY AVAILABLE INFORMATION ABOUT THE POPULATION COMES
FROM THE SAMPLE.
ALL THAT IS REQUIRED FOR A HYPOTHESIS TEST WITH t IS A SAMPLE
AND A REASONABLE HYPOTHESIS ABOUT THE POPULATION MEAN.
SAURO/LEWIS
15 CONFIDENCE INTERVALS
THE t STATISTICLIKE THE z STATISTIC, THE t STATISTIC FORMS A RATIO.
THE NUMERATOR CONSISTS OF THE OBTAINED DIFFERENCE BETWEEN
THE SAMPLE MEAN AND THE HYPOTHESIZED POPULATION
MEAN.THE DENOMINATOR IS THE
ESTIMATED STANDARD ERROR (BASED ON THE SAMPLE’S
STANDARD DEVIATION, NOT THE POPULATION’S), WHICH MEASURES HOW MUCH
DIFFERENCE IS EXPECTED BY CHANCE.
𝒕=𝑴−𝝁
( 𝒔√𝒏 )
NOTE THAT WHEN LOOKING UP THE p -VALUE IN A t DISTRIBUTION TABLE, THE t
STATISTIC’S DEPENDENCE ON THE SAMPLE SIZE REQUIRES THAT YOU USE THE DEGREES OF FREEDOM (n -1) TO
REFERENCE THE CORRECT t STATISTIC.
SAURO/LEWIS
16 CONFIDENCE INTERVALS
CONFIDENCE INTERVAL FOR RATING SCALESFOR EXAMPLE, ASSUME THAT
THE SUS SCORES FOR A PARTICULAR SOFTWARE
SYSTEM ARE LISTED BELOW:
SO, WE CAN BE 95% CONFIDENT THAT THE
POPULATION’S SUS SCORE FOR THIS SYSTEM IS BETWEEN 79.92 AND
89.37.
SAURO/LEWIS
17 CONFIDENCE INTERVALS
CONFIDENCE INTERVAL FOR TASK TIMES
TASK TIME DATA TENDS TO BE POSITIVELY SKEWED BECAUSE...(A) THERE’S A NATURAL LOWER BOUND FOR HOW
LONG IT TAKES TO PERFORM A TASK.(B) SOME USERS WILL TAKE AN EXCEPTIONALLY LONG TIME TO COMPLETE A TASK.
UNDER THESE CIRCUMSTANCES, IT IS MORE INFORMATIVE TO USE THE GEOMETRIC MEAN (I.E., THE EXPONENTIAL OF THE
ARITHMETIC MEAN OF THE LOGARITHM OF THE DATA) INSTEAD OF THE ARITHMETIC
MEAN.
95% CONFIDENCE INTERVAL FOR THE POPULATION MEAN
(USING THE ARITHMETIC MEAN OF THE DATA)
95% CONFIDENCE INTERVAL FOR THE POPULATION MEAN
(USING THE GEOMETRIC MEAN OF THE DATA)
95% CONFIDENCE INTERVAL FOR THE LOGARITHM OF THE POPULATION MEAN (USING THE ARITHMETIC MEAN OF
THE LOGARITHM OF THE DATA)
SAURO/LEWIS
18 BENCHMARKS
COMPARING TO BENCHMARKSFREQUENTLY, THE GOAL
WHEN TESTING A SOFTWARE INTERFACE IS NOT
DETERMINING A RELIABLE CONFIDENCE INTERVAL, BUT
TESTING AGAINST A PARTICULAR GOAL OR
BENCHMARK.
FOR INSTANCE, YOU MIGHT WANT TO
DETERMINE THAT A CERTAIN MINIMUM COMPLETION RATE
WILL OCCUR, THAT A SPECIFIC MAXIMUM TASK TIME IS NOT
EXCEEDED, OR THAT A PARTICULAR
SATISFACTION SCORE WAS ACHIEVED.
SAURO/LEWIS
19 BENCHMARKS
TWO-TAILED & ONE-TAILED TESTS
WHEN BOTH SIDES OF A CONFIDENCE INTERVAL MATTER, A TWO-TAILED TEST IS PERFORMED,
WHERE THE CONFIDENCE INTERVAL IS SYMMETRICAL AND THE PROBABILITIES OF VALUES
BEING ABOVE THE UPPER LIMIT AND OF VALUES BEING BELOW THE LOWER LIMIT ARE EACH (1-)/2.WHEN TESTING
AGAINST A BENCHMARK, ONLY ONE SIDE OF THE
OUTCOME MATTERS, SO A
ONE-TAILED TEST IS USED, WHICH
MEANS THAT THE VALUE MUST BE
DOUBLED IN ORDER TO
ACHIEVE THE APPROPRIATE CONFIDENCE
INTERVAL.
SAURO/LEWIS
20 BENCHMARKS
BINOMIAL DISTRIBUTIONTHE BINOMIAL DISTRIBUTION IS THE DISCRETE
PROBABILITY DISTRIBUTION OF THE NUMBER OF SUCCESSES IN A SEQUENCE OF INDEPENDENT
YES/NO EXPERIMENTS.
(𝒏𝒌)𝒑𝒌(𝟏−𝒑 )(𝒏−𝒌)
IF THE PROBABILITY
OF A SUCCESS IS p, THEN THE
PROBABILITY OF GETTING k SUCCESSES
IN n ATTEMPTS
IS:
SAURO/LEWIS
21 BENCHMARKS
BENCHMARKED COMPLETION RATES
FOR REASONABLY SMALL (LESS THAN 30) SAMPLE SIZES, THE BINOMIAL DISTRIBUTION SHOULD BE USED TO DETERMINE WHETHER A BENCHMARK IS
MET.𝒑 (𝒙 )=∑
𝒙=𝒃
𝒏
[ 𝒏!𝒙 ! (𝒏− 𝒙 )!
𝒑 𝒙(𝟏−𝒑 )(𝒏−𝒙 )]WHERE:
𝒃 is the minimum benchmark for how many will complete the task𝒏 is the sample size𝒑 is the desired population completion rate for the task
FOR EXAMPLE, THE EXCEL CALCULATION BELOW DEMONSTRATES THAT IF b=26 OUT OF n=29 USERS SUCCESSFULLY COMPLETE A CERTAIN TASK (A TEST
COMPLETION RATE OF 90%), THEN THE PROBABILITY IS 86%THAT THE POPULATION COMPLETION RATE IS AT
LEAST p=80%.
SAURO/LEWIS
22 BENCHMARKS
LARGE-SAMPLE BENCHMARKED COMPLETION RATESFOR LARGER SAMPLE SIZES (AT LEAST 15
SUCCESSES AND AT LEAST 15 FAILURES), A NORMAL APPROXIMATION TO THE BINOMIAL
DISTRIBUTION SHOULD BE USED TO DETERMINE WHETHER A BENCHMARK IS MET.𝒛=
�̂�−𝒑
√ 𝒑 (𝟏−𝒑 )𝒏
WHERE:
𝒏 is the number of users tested
𝒑 is the desired population completion rate�̂� is the observed completion rate
FOR EXAMPLE, IF 139 OF 173 VISITORS TO A WEB SITE COMPLETED A SHIPPING ADDRESS FORM CORRECTLY,
THEN THE EXCEL CALCULATION BELOW DEMONSTRATES THAT THERE IS A 95% CHANCE THAT AT LEAST 75% OF
ALL USERS WILL BE ABLE TO DO SO.
SAURO/LEWIS
23 BENCHMARKS
BENCHMARKED SATISFACTION SCORESTO COMPARE AN INTERFACE’S SATISFACTION SCORE
(E.G., FROM A SUS QUESTIONNAIRE) TO A BENCHMARK, THE T-DISTRIBUTION IS UTILIZED.
𝒕=𝑴−𝝁
( 𝒔√𝒏 )
FOR EXAMPLE, RECENT CPR TRAINING APPS HAVE AVERAGED SUS SCORES OF 70.7.A SAMPLE OF 14 USERS TESTED A BETA VERSION OF A
NEW CPR TRAINING APPLICATION AND GAVE IT A MEAN SUS SCORE OF 73, WITH A STANDARD DEVIATION OF
11.9.𝒕=
𝑴−𝝁
( 𝒔√𝒏 )
=𝟕𝟑−𝟕𝟎 .𝟕
(𝟏𝟏 .𝟗√𝟏𝟒
)≈𝟎.𝟕𝟐𝟑
A ONE-TAILED T-TEST WITH 13 DEGREES OF
FREEDOM AND A T-VALUE OF 0.723
INDICATES THAT WE CAN BE 76%
CONFIDENT THAT THE NEW APP HAS AN
AVERAGE GREATER THAN THE INDUSTRY AVERAGE OF 70.7.
SAURO/LEWIS
24 BENCHMARKS
BENCHMARKED TASK TIMESTO COMPENSATE FOR THE POSITIVE
SKEWNESS OF THE TIME DATA, THE T-TEST FOR TASK TIMES IS PERFORMED
WITH LOGARITHMS.
SO, FOR EXAMPLE, THERE IS A 56% PROBABILITY THAT THE POPULATION’S MEAN
TASK TIME WOULD BE LESS THAN TWO MINUTES.
SAURO/LEWIS
25 COMPARISONS
USABILITY COMPARISON TESTS
ASSUME THAT TWO EARLY PROTOTYPES OF AN INTERFACE HAVE BEEN DEVELOPED, ONE USING LEFT NAVIGATION AND THE OTHER USING TOP
NAVIGATION.IF INDIVIDUALS IN ONE SAMPLE POPULATION EXPERIENCE NOTICEABLY FEWER NAVIGATION PROBLEMS THAN
INDIVIDUALS IN THE OTHER SAMPLE POPULATION, THEN WE WOULD HAVE EVIDENCE THAT ONE APPROACH IS MORE
EFFECTIVE THAN THE OTHER. HOWEVER, IT IS ALSO POSSIBLE THAT THE DIFFERENCE BETWEEN THE TWO SAMPLE
POPULATIONS IS SIMPLY SAMPLING ERROR.
SAURO/LEWIS
26 COMPARISONS
WITHIN-SUBJECTS TESTHYPOTHESIS: THE
CALENDAR BUTTON ON THE
LEFT NAVIGATION INTERFACE IS
FASTER TO ACCESS THAN IT IS ON THE TOP NAVIGATION INTERFACE.
DESIGN: WITHIN-SUBJECTS, WITH RANDOMIZED ORDER OF ASSIGNMENT OF INTERFACE TO SUBJECTS
BASED ON THE TABULATED DATA, THE TOP INTERFACE SEEMS TO BE FASTER (508 MS ON AVERAGE) THAN THE LEFT INTERFACE (584 MS), BUT GIVEN THE NOISE IN THE MEASUREMENTS (I.E.,
SOME OF THE LEFT INTERFACE TRIALS ARE ACTUALLY SLOWER THAN SOME OF THE TOP INTERFACE TRIALS), HOW DO
WE KNOW WHETHER THE LEFT INTERFACE IS REALLY FASTER?
LEFT INTERFACE TOP INTERFACE
625 MS 647 MS
480 MS 503 MS
621 MS 559 MS
633 MS 586 MS
694 MS 458 MS
599 MS 380 MS
505 MS 477 MS
527 MS 409 MS
651 MS 589 MS
505 MS 472 MS
THIS IS THE FUNDAMENTAL QUESTION UNDERLYING STATISTICAL ANALYSIS:
ESTIMATING THE AMOUNT OF EVIDENCE IN SUPPORT OF A HYPOTHESIS, EVEN IN THE
PRESENCE OF NOISE.
SAURO/LEWIS
27 COMPARISONS
WITHIN-SUBJECTS TEST ANALYSIS
THE P VALUE FOR THE TWO-TAILED T -TEST IS 0.025, WHICH MEANS THAT THE OBSERVED
DIFFERENCE BETWEEN THE LEFT AND TOP
INTERFACES IS ONLY 2.5% LIKELY TO HAPPEN
PURELY BY CHANCE, LEADING TO THE
CONCLUSION THAT THE DIFFERENCE BETWEEN
THE INTERFACES IS STATISTICALLY SIGNIFICANT.
SAURO/LEWIS
28 COMPARISONS
BETWEEN-SUBJECTS TESTAN INDEPENDENT-MEASURES OR BETWEEN-
SUBJECTS EXPERIMENT DESIGN ALLOWS RESEARCHERS TO EVALUATE THE MEAN
DIFFERENCE BETWEEN TWO POPULATIONS USING DATA FROM TWO SEPARATE SAMPLES.AS WITH ALL HYPOTHESIS
TESTS, THE GENERAL PURPOSE OF THE
INDEPENDENT-MEASURES T -TEST IS TO DETERMINE WHETHER THE SAMPLE
MEAN DIFFERENCE OBTAINED IN A RESEARCH STUDY INDICATES A REAL
MEAN DIFFERENCE BETWEEN THE TWO POPULATIONS OR
WHETHER THE OBTAINED DIFFERENCE IS SIMPLY THE
RESULT OF SAMPLING ERROR.
SAURO/LEWIS
29 COMPARISONS
BETWEEN-SUBJECTS TEST ANALYSIS
IF THE SAME DATA HAD BEEN ACCUMULATED FOR
A BETWEEN-SUBJECTS EXPERIMENT, THEN THE P
VALUE FOR THE TWO-TAILED T -TEST IS 0.047, WHICH MEANS THAT THE OBSERVED DIFFERENCE
BETWEEN THE LEFT INTERFACE AND TOP
INTERFACE IS ONLY 4.7% LIKELY TO HAPPEN PURELY
BY CHANCE.
SAURO/LEWIS
30 COMPARISONS
WEB-SCALE USABILITY RESEARCH
THE WEB ENABLES EXPERIMENTS ON A LARGER SCALE, FOR LESS TIME
AND MONEY, THAN EVER BEFORE.WEB SITES WITH MILLIONS OF VISITORS (E.G., GOOGLE, AMAZON, FACEBOOK) ARE CAPABLE OF ANSWERING QUESTIONS ABOUT THE DESIGN, USABILITY, AND OVERALL VALUE OF NEW
FEATURES SIMPLY BY DEPLOYING THEM AND
WATCHING WHAT HAPPENS.
CONSIDER THESE TWO VERSIONS OF A WEB PAGE,
FOR A SITE THAT SELLS CUSTOMIZED REPORTS ABOUT
SEX OFFENDERS LIVING IN YOUR AREA.
THE GOAL OF THE PAGE IS TO GET VISITORS TO FILL OUT THE YELLOW FORM AND BUY THE
REPORT.BOTH VERSIONS CONTAIN THE
SAME INFO; THEY JUST PRESENT IT IN DIFFERENT
WAYS.IN FACT, THE VERSION ON THE RIGHT IS A REVISED DESIGN, WHICH
WAS INTENDED TO IMPROVE THE DESIGN BY USING TWO FAT COLUMNS, SO THAT MORE CONTENT COULD BE BROUGHT “ABOVE
THE FOLD” AND THE USER WOULDN’T HAVE TO DO AS MUCH SCROLLING.
WHICH DESIGN IS MORE EFFECTIVE FOR THE END GOAL OF THE WEB SITE – CONVERTING VISITORS INTO SALES?
SAURO/LEWIS
31 COMPARISONS
A/B TESTINGTO DETERMINE WHICH
DESIGN WAS MORE EFFECTIVE, THE
DESIGNERS CONDUCTED AN
EXPERIMENT: HALF OF THE USERS TO THEIR
WEB SITE WERE RANDOMLY ASSIGNED TO SEE ONE VERSION
OF THE PAGE, AND THE OTHER HALF SAW THE
OTHER VERSION.
THE USERS WERE THEN TRACKED TO SEE HOW
MANY OF EACH ACTUALLY FILLED OUT THE FORM TO BUY THE
REPORT.
IN THIS CASE, THE REVISED DESIGN ACTUALLY FAILED – 244 USERS BOUGHT THE REPORT
FROM THE ORIGINAL VERSION, BUT ONLY 114 USERS BOUGHT THE REPORT FROM THE
REVISED VERSION.THE IMPORTANT POINT HERE IS NOT WHICH ASPECTS OF THE DESIGN CAUSED THE FAILURE (WHICH IS UNKNOWN, SINCE SEVERAL THINGS
CHANGED IN THE REDESIGN); THE POINT IS THAT THE WEB SITE CONDUCTED A RANDOMIZED EXPERIMENT AND COLLECTED DATA
THAT ACTUALLY TESTED THE REVISION.THIS KIND OF EXPERIMENT IS OFTEN CALLED AN A/B TEST.
SAURO/LEWIS
32 COMPARISONS
ANOTHER A/B TESTING EXAMPLE
IN THIS EXAMPLE, A SHOPPING CART FOR A WEB SITE, A NUMBER OF CHANGES HAVE BEEN MADE BETWEEN THE ORIGINAL VERSION (LEFT) AND THE REVISED VERSION
(RIGHT).TESTING THIS REDESIGN WITH
AN A/B TEST PRODUCED A STARTLING
DIFFERENCE IN REVENUE:
USERS WHO SAW THE CART ON THE LEFT SPENT TEN
TIMES AS MUCH AS USERS WHO SAW THE CART ON THE RIGHT!
THE DESIGNERS OF THIS SITE EXPLORED FURTHER AND DISCOVERED THAT THE PROBLEM WAS THE “COUPON CODE”
BOX ON THE RIGHT, WHICH LED USERS TO WONDER WHETHER THEY WERE PAYING TOO MUCH IF THEY DIDN’T HAVE A
COUPON, AND ABANDON THE CART.WITHOUT THE COUPON CODE BOX, THE REVISED VERSION ACTUALLY EARNED MORE REVENUE
THAN THE ORIGINAL VERSION.
SAURO/LEWIS
33 COMPARISONS
MICROSOFT HELP A/B TESTING EXAMPLE
AT THE END OF EVERY PAGE IN MICROSOFT’S
ONLINE HELP IS A QUESTION ASKING
FOR FEEDBACK ABOUT THE HELP ARTICLE; IF THE USER PRESSES
ANY OF THE BUTTONS, IT
DISPLAYS A TEXTBOX ASKING FOR MORE
DETAILS.
SAURO/LEWIS
34 COMPARISONS
REVISING MICROSOFT HELP
THE PROPOSED REVISION TO THIS INTERFACE AT LEFT WAS MOTIVATED
BY TWO ARGUMENTS:(1) IT GIVES MORE FINE-GRAINED
QUANTITATIVE FEEDBACK THAN THE YES/NO QUESTION; AND
(2) IT IS MORE EFFICIENT FOR THE USER, BECAUSE IT TAKES ONLY ONE CLICK RATHER THAN THE MINIMUM TWO CLICKS OF THE ORIGINAL INTERFACE.
WHEN THESE TWO INTERFACES WERE A/B TESTED ON
MICROSOFT’S SITE, HOWEVER, IT TURNED OUT THAT THE 5-STAR INTERFACE PRODUCED AN ORDER OF MAGNITUDE
FEWER RATINGS – AND MOST OF THEM WERE EITHER 1 STAR
OR 5 STARS, SO THEY WEREN’T EVEN FINE-GRAINED.
SAURO/LEWIS
35 COMPARISONS
WEB-BASED A/B TESTINGIN THE CONTEXT OF USABILITY STUDIES, A/B
TESTING IS SIMILAR TO CONTROLLED EXPERIMENTS.• CHOOSE AN INDEPENDENT
VARIABLE WITH TWO CONDITIONS.(MORE CONDITIONS ARE OKAY, E.G., A/B/C TESTING)• CHOOSE DEPENDENT VARIABLE(S)
TO MEASURE.(E.G., TIME, ERRORS, SUCCESS RATE, REVENUE)
• DURING A TESTING INTERVAL, RANDOMLY ASSIGN ARRIVING USERS TO ONE CONDITION OR THE OTHER.
(THE WEB SITE ITSELF DOES THIS!)
• DO STATISTICAL TESTING.
A/B TESTING OCCURS WITH REAL USERS ON A DEPLOYED SYSTEM, SO
BUGS CAN HAVE REAL CONSEQUENCES.
RATHER THAN STARTING WITH A 50/50 SPLIT BETWEEN TEST CONDITIONS, IT’S SAFER TO RAMP UP SLOWLY BY STARTING WITH 99.9/0.1, MOVING
TO 99/1, ETC.
SAURO/LEWIS
36 COMPARISONS
A/A TESTINGTO TEST THE INFRASTRUCTURE OF AN EXPERIMENT,
A/A TESTING DIVIDES USERS INTO TWO GROUPS WITH THE SAME CONDITION FOR EACH GROUP (I.E., A/B TESTING WITH A SINGLE CONDITION FOR BOTH
GROUPS).A/A TESTS ILLUSTRATE HOW DATA FLUCTUATE, WITH EXPERIMENTAL
RESULTS THAT MIGHT SEEM SUBSTANTIAL, BUT WHICH ARE
NOT STATISTICALLY SIGNIFICANT (IF THE USERS ARE SPLIT
CORRECTLY AND THERE ARE NO POTENTIALLY MISLEADING BIASES
IN THE EXPERIMENT).
SAURO/LEWIS
37 COMPARISONS
ISSUES WITH A/B TESTINGTHE WEB-SCALE NATURE OF A/B TESTING LEADS TO
SEVERAL POTENTIAL ISSUES THAT ARE NOT COMMONLY ENCOUNTERED IN SMALLER-SCALE LAB
EXPERIMENTS.
ETHIC
S
A/B
TES
TING
NEV
ER A
SKS
THE
USE
R’S
PERM
ISSI
ON
TO B
E
INVO
LVED
IN
THE
TEST
AND D
OES
N’T
OBT
AIN
INFO
RM
ED
CONSE
NT.
MYSTERYA/B TESTING
LEADS TO CONCLUSIONS REGARDING BOTTOM-LINE INDICATORS, BUT RARELY PROVIDES
REAL EXPLANATION
S.
LONGEV
IT
Y
A/B
TES
TS
RUN F
OR
DAY
S OR
WEE
KS,
BUT
THE
LONG-
TERM
EFFE
CTS
OF
A
DES
IGN
MIG
HT
NOT
BE
SEEN
UNTI
L USE
RS
GET
MORE
ACC
UST
OM
ED
TO IT
.
REMOTE USABILITY TESTING, WHERE THE USER’S BEHAVIOR IS ACTUALLY MONITORED, IS STILL IN THE EARLY
STAGES.• REMOTE
SYNCHRONOUS TESTING, USING WEBCAMS, HAS BEEN SHOWN TO BE JUST AS EFFECTIVE AS FACE-TO-FACE TESTING.
• REMOTE ASYNCHRONOUS TESTING, WHERE USERS REPORT CRITICAL USABILITY PROBLEMS THEMSELVES, TENDS TO SLOW THE USERS DOWN TREMENDOUSLY AND RESULT IN FEWER REPORTED ERRORS.
• AN ALTERNATIVE REMOTE ASYNCHRONOUS TESTING APPROACH, WITH INSTRUMENTATION INSTALLED ON THE WEB SITE TO TRACK EACH USER’S ACTIONS, SHOWS THE DETAILS OF THE INTERACTION, BUT REVEALS LITTLE ABOUT THE USER’S GOALS OR INTENTIONS.
SAURO/LEWIS
38 SAMPLE SIZES
DETERMINING SAMPLE SIZE
WHEN CONDUCTING A USABILITY TEST, HOW LARGE SHOULD YOU
MAKE THE SAMPLE SIZE?ESSENTIALLY, IF YOU CAN ESTIMATE THE CRITICAL DIFFERENCE FROM THE TEST (I.E., d = THE SMALLEST
DIFFERENCE BETWEEN THE OBTAINED AND TRUE VALUE THAT YOU NEED TO DETECT), THE SAMPLE’S STANDARD DEVIATION
(WHICH MIGHT BE ESTIMATED FROM PREVIOUS SIMILAR EXPERIMENTS), AND THE CRITICAL t-VALUE FOR THE DESIRED
LEVEL OF STATISTICAL CONFIDENCE), THEN THE FORMULA FOR t:
COULD BE SOLVED FOR n, THE NEEDED SAMPLE SIZE.UNLIKE THE z-VALUE, HOWEVER, WHICH USES A NORMAL
DISTRIBUTION, ESTIMATING THE t-VALUE COMPLICATES MATTERS BY ALSO BEING
DEPENDENT ON THE DEGREES OF FREEDOM (FOR A ONE-
SAMPLE t-TEST, df = n-1). TO OVERCOME THIS PROBLEM, AN
ITERATIVE PROCEDURE IS SUGGESTED…
SAURO/LEWIS
39 SAMPLE SIZES
DETERMINING SAMPLE SIZE: ITERATIVE PROCEDURE
1. USE THE Z-SCORE WITH THE DESIRED LEVEL OF CONFIDENCE (FROM A UNIT NORMAL TABLE) AS AN INITIAL ESTIMATE OF THE T-VALUE.
2. SOLVE THE ABOVE EQUATION FOR N.3. USE A T-DISTRIBUTION TABLE TO FIND THE T-SCORE
FOR THAT VALUE OF N (WITH DF = N-1).4. RECALCULATE N BY USING THIS NEW T-VALUE IN THE
EQUATION ABOVE.5. REVISE THE T-SCORE FROM THE T-DISTRIBUTION TABLE.6. CONTINUE THIS ITERATION UNTIL TWO CONSECUTIVE
CYCLES YIELD THE SAME N VALUE.
SAURO/LEWIS
40 SAMPLE SIZES
SAMPLE SIZE EXAMPLEASSUME THAT YOU HAVE BEEN USING A 100-POINT ITEM AS A POST-TASK MEASURE OF EASE-OF-USE IN
PAST USABILITY TESTS. ONE OF THE TASKS THAT YOU ROUTINELY CONDUCT IS SOFTWARE
INSTALLATION. FOR THE MOST RECENT USABILITY STUDY OF THE CURRENT VERSION OF THE
SOFTWARE PACKAGE, THE VARIABILITY OF THIS MEASUREMENT ON THE 100-POINT SCALE IS 25
(I.E., s=5).
YOU’RE PLANNING YOUR FIRST USABILITY STUDY WITH A NEW
VERSION OF THE SOFTWARE, AND YOU WANT TO GET AN ESTIMATE
OF THIS MEASURE WITH 90% CONFIDENCE AND TO BE WITHIN 2.5 POINTS OF THE TRUE VALUE.
LET’S CALCULATE HOW MANY PARTICIPANTS YOU NEED TO RUN
IN THE STUDY.
SOLVING THE BASIC t FORMULA FOR n YIELDS:
n THE QUESTION INDICATES THAT s = 5 AND d = 2.5, SO AN APPROPRIATE t-VALUE NEEDS TO BE DETERMINED.
SAURO/LEWIS
41 SAMPLE SIZES
SAMPLE SIZE EXAMPLE (CONTINUED)
FOR TWO-SIDED TESTING WITH A 90% CONFIDENCE INTERVAL (I.E., 5% IN EACH TAIL), A UNIT NORMAL TABLE
INDICATES THAT A z-VALUE OF 1.645 WOULD MAKE A GOOD FIRST ESTIMATE FOR THE t-VALUE. USING THE
ABOVE FORMULA, THIS YIELDS AN n-VALUE OF 10.8241, WHICH ROUNDS UP TO 11.SWITCHING TO A t-
DISTRIBUTION TABLE, n = 11 (I.E., df = 10) GIVES US A t-VALUE OF 1.812 FOR A 2-TAILED 90% CONFIDENCE
INTERVAL, WHICH PRODUCES AN n-VALUE OF 13.133376 IN THE FORMULA, ROUNDING UP
TO 14.
USING n = 14 (df = 13) YIELDS A t-VALUE OF 1.771, YIELDING AN
n-VALUE OF 12.545764, ROUNDING UP TO 13. USING n
= 13 (df = 12) YIELDS A t-VALUE OF 1.782, YIELDING AN n-VALUE
OF 12.702096, AGAIN ROUNDING UP TO 13.
THEREFORE, THE FINAL SAMPLE ESTIMATE SIZE FOR
THIS STUDY IS 13 PARTICIPANTS.
SAURO/LEWIS
42 SAMPLE SIZES
WEAK ARGUMENTS FOR LARGE SAMPLES
“IF THE POPULATION IS LARGE, THEN THE SAMPLE NEEDS TO BE LARGE.”• THE VARIANCE IN STATISTICAL SAMPLING IS DETERMINED BY
THE SAMPLE SIZE, NOT THE SIZE OF THE OVERALL POPULATION. THE EVALUATION OF A DESIGN ELEMENT’S QUALITY IS INDEPENDENT OF HOW MANY PEOPLE ARE GOING TO USE IT.
“THE MORE FEATURES IN THE INTERFACE, THE LARGER THE SAMPLE SIZE.”• WHEN THE INTERFACE IS LOADED WITH FEATURES, MORE
TESTS ARE NEEDED, NOT MORE USERS IN EACH TEST. TEST SUBJECTS WILL BE OVERWHELMED IF ASKED TO EVALUATE TOO MANY FEATURES.
“THE INTERFACE IS BEING DESIGNED TO ACCOMMODATE MANY TARGET AUDIENCES.”• THIS ONLY REQUIRES LARGER
SAMPLE SIZES IF THE DIFFERENT TARGET AUDIENCES WILL USE THE INTERFACE IN VERY DIFFERENT WAYS (E.G., BUYERS VS. SELLERS, TEACHERS VS. STUDENTS, DOCTORS VS. PATIENTS).
SAURO/LEWIS
43 USABILITY QUESTIONNAIRES
USABILITY QUESTIONNAIRES
USING STANDARDIZED QUESTIONNAIRES FOR USABILITY STUDIES OFFERS SEVERAL
ADVANTAGES.OBJECTIVITYUSABILITY
PRACTITIONERS ARE ABLE TO
INDEPENDENTLY VERIFY THE
MEASUREMENT STATEMENTS OF
OTHERS.
REPLICABILITYSTUDIES CAN
EASILY BE REPLICATED, IMPROVING
THEIR RELIABILITY.
QUANTIFICATION
RESULTS CAN BE REPORTED
IN FINER DETAIL AND MORE
OBJECTIVITY.
ECONOMYDEVELOPING
STANDARDIZED MEASURES TAKES
WORK, BUT REUSING THEM IS
INEXPENSIVE.
COMMUNICATIONSTANDARDIZED
MEASURES FACILITATE
COMMUNICATION BETWEEN
PRACTITIONERS.
SAURO/LEWIS
44 USABILITY QUESTIONNAIRES
POST-STUDY USABILITY QUESTIONNAIRES
THE PSSUQ IS A 16-ITEM SURVEY THAT MEASURES
USERS’ PERCEIVED
SATISFACTION WITH A PRODUCT
OR SYSTEM.
The Post-Study System Usability Questionnaire (Version 3)
Strongly Agree
Strongly Disagree
1 2 3 4 5 6 7 NA
1. Overall, I am satisfied with how easy it is to use this system.
2. It was simple to use this system.
3. I was able to complete the tasks and scenarios quickly using this system.
4. I felt comfortable using this system.
5. It was easy to learn to use this system.
6. I believe I could become productive quickly using this system.
7. The system gave error messages that clearly told me how to fix problems.
8. Whenever I made a mistake using the system, I could recover easily and quickly.
9. The information (such as on-line help, on-screen messages, and other documentation) provided with this system was clear.
10. It was easy to find the information I needed.
11. The information was effective in helping me complete the tasks and scenarios.
12. The organization of information on the system screens was clear.
13. The interface of this system was pleasant.
14. I liked using the interface of this system.
15. This system has all the functions and capabilities I expect it to have.
16. Overall, I am satisfied with this system.
AN OVERALL SATISFACTION
SCORE IS OBTAINED BY
AVERAGING THE SUB-SCALES OF
SYSTEM QUALITY (ITEMS 1-6),
INFORMATION QUALITY (ITEMS 7-
12), AND INTERFACE
QUALITY (ITEMS 13-16).
THE PSSUQ IS SUSCEPTIBLE TO
“ACQUIESCE BIAS”, THE FACT
THAT PEOPLE ARE MORE LIKELY TO AGREE WITH A
STATEMENT THAN TO DISAGREE
WITH IT.
SAURO/LEWIS
45 USABILITY QUESTIONNAIRES
INTERPRETING QUESTIONNAIRE RESULTS
PSYCHOMETRIC ANALYSIS OF USABILITY
QUESTIONNAIRES IS CONDUCTED TO
DETERMINE THEIR RELIABILITY, VALIDITY,
AND SENSITIVITY.
PSSUQ-3 Norms (Means and 99% Confidence Intervals)
Lower
Limit
Mean
Upper
Limit
1. Overall, I am satisfied with how easy it is to use this system. 2.60 2.85 3.09
2. It was simple to use this system. 2.45 2.69 2.93
3. I was able to complete the tasks and scenarios quickly using this system. 2.86 3.16 3.45
4. I felt comfortable using this system. 2.40 2.66 2.91
5. It was easy to learn to use this system. 2.07 2.27 2.48
6. I believe I could become productive quickly using this system. 2.54 2.86 3.17
7. The system gave error messages that clearly told me how to fix problems. 3.36 3.70 4.05
8. Whenever I made a mistake using the system, I could recover easily and quickly. 2.93 3.21 3.49
9. The information (such as on-line help, on-screen messages, and other documentation) provided with this system was clear. 2.65 2.96 3.27
10.
It was easy to find the information I needed.2.79 3.09 3.38
11.
The information was effective in helping me complete the tasks and scenarios. 2.46 2.74 3.01
12.
The organization of information on the system screens was clear. 2.41 2.66 2.92
13.
The interface of this system was pleasant.2.06 2.28 2.49
14.
I liked using the interface of this system.2.18 2.42 2.66
15.
This system has all the functions and capabilities I expect it to have. 2.51 2.79 3.07
16.
Overall, I am satisfied with this system.2.55 2.82 3.09
FOR EXAMPLE, THE PSSUQ-3 NORMS AT
LEFT SHOW THAT MOST ITEMS HAVE MEANS
THAT FALL BELOW THE SCALE MIDPOINT OF 4, INDICATING THAT THE
SCALE MIDPOINT SHOULD NOT BE USED
EXCLUSIVELY AS A REFERENCE FROM WHICH TO JUDGE PARTICIPANTS’
PERCEPTIONS ON USABILITY.
ALSO NOTE THE RELATIVELY POOR
RATINGS ASSOCIATED WITH ITEM 7, WHICH
REFLECT THE DIFFICULTY OF
PROVIDING USABLE ERROR MESSAGES IN A SOFTWARE PRODUCT,
AS WELL AS THE OVERALL
DISSATISFACTION THAT SUCH ERRORS CAUSE IN
USERS.
SAURO/LEWIS
46 USABILITY QUESTIONNAIRES
POST-TASK USABILITY QUESTIONNAIRESWHILE POST-STUDY SURVEYS PROVIDE INFORMATION
REGARDING THE GENERAL SATISFACTION OF USERS WITH AN INTERFACE, BRIEF MINI-SURVEYS OF USER REACTION TO
SPECIFIC TASKS IN SPECIFIC SCENARIOS ARE OFTEN MORE USEFUL WHEN ATTEMPTING TO DIAGNOSE MORE FOCUSED
PROBLEMS.
The After-Scenario Questionnaire (Version 1)
Strongly Agree
Strongly Disagree
1 2 3 4 5 6 7 NA
1. Overall, I am satisfied with the ease of completing the tasks in this scenario.
2. Overall, I am satisfied with the amount of time it took to complete the tasks in this scenario.
3. Overall, I am satisfied with the support information (online help, messages, documentation) when completing the tasks.
EXAMPLE SCENARIOS AND TASKS FOR OFFICE SOFTWARE SYSTEMS:MAIL
SCENARIO #1
• OPEN A NOTE
• SEND REPLY
• DELETE NOTE
MAIL SCENARIO
#2• OPEN A
NOTE• FORWARD
W/REPLY• SAVE
RESPONSE• DELETE
ORIGINAL
ADDRESS SCENARIO
• CREATE NEW LISTING
• MODIFY OLD LISTING
• DELETE UNMODIFIED LISTING
FILE SCENARI
O• RENAME
FILE• COPY
FILE• DELETE
FILE
EDITOR SCENARIO• LOCATE
DOCUMENT• EDIT
DOCUMENT• OPEN NOTE• COPY
NOTE’S TEXT INTO DOCUMENT
• SAVE DOCUMENT
• PRINT DOCUMENT
SAURO/LEWIS
47 USABILITY QUESTIONNAIRES
TRIANGULATIONANY GIVEN RESEARCH METHOD HAS ADVANTAGES
AND LIMITATIONS.• LAB EXPERIMENTS ARE ABSTRACT AND OBTRUSIVE, AND MAY NOT BE REPRESENTATIVE OF THE REAL WORLD.
• FIELD STUDIES CANNOT BE CONTROLLED, SO IT’S HARD TO MAKE STRONG, PRECISE CLAIMS REGARDING COMPARATIVE USABILITY.• SELF-REPORTING (VIA QUESTIONNAIRES) IS OFTEN BIASED BY REACTIVITY (E.G., THE SUBJECTS TRY TO BE POLITE OR TO SAY WHAT THEY THINK THEY SHOULD SAY, INSTEAD OF THE TRUTH).
ONE WAY TO DEAL WITH THIS PROBLEM IS VIA
TRIANGULATION, USING MULTIPLE METHODS TO
TACKLE THE SAME RESEARCH QUESTION.
IF THEY ALL SUPPORT YOUR CLAIM, THEN YOU HAVE
MUCH STRONGER EVIDENCE, WITHOUT AS MANY BIASES.