empirical user studies
Karrie Karahalios, Eric Gilbert
6 April 2007
some slides courtesy of Brian Bailey and John Hart
• Conduct user study to gain more precise measure of the
usability of an interface or system
• Complements low-fidelity techniques
• Requires a larger investment than low-fi prototyping
• Provide positive experience for users!
Empirical User Studies
• Measure performance, error rate, learnability and retention,
satisfaction, tolerable network delay…
• adapt to your particular interface and context
• Compare results to usability goals
• Identify usability issues and resolve them
Overview of Doing Empirical User Studies
• Develop materials
• Prepare for the study
• Conduct the study
• Analyze results and iterate
• Learn from the experience
Prepare for the Study
• Identify usability goals
• Develop experimental tasks and design
• Recruit users
• Instrument software/hardware
Identify Usability Goals
• Identify questions you want answered
• questions should be specific and measurable
• can a user perform each task in < 30s?
• after only five minutes of instruction, can a user perform
each task with < 2 errors?
• are users rating the interface at least a „3‟ for overall
satisfaction on a 5-point scale?
Develop Experimental Design
• Structure of experiment
• what will users do, in what order, where, etc.
• Between groups (randomly assigned to treatment groups)
• Control group
• Experimental group
• Within groups
• Each user performs under all conditions
• Order randomized
• Cheaper because it uses fewer participants
• What gets changed and what is its effect?
• Independent variables
• the variables you manipulate
e.g. # of menu items, lighting conditions, mouse vs. keys
• Dependent variables
• measured part
e.g. speed of menu choice, reaction time to stimuli
• Variable type matters
• Typically want about 8 – 12 users
• depends on desired confidence in the results
• 12 is the magic number for the ANOVA test (more later)
• This could be the most challenging aspect of the study
• expect about a 0.1% to 10% response rate
• may need IRB approval, especially if you want to publish
• Give users a compelling reason to participate
• It is important to target your user population.
• example: if you are developing for Firefox, make sure that
you use people already familiar with Firefox.
• Beyond that, it is also important to gain a diversity of different
types of users:
• can tell you important things about your system, and help
• Log performance and errors (if possible)
• Determined media capture needs
• ensure that you have access to equipment
• manage physical layout of the testing space
• Anything else that you need?
Conduct the Study
• Give user an overview of the study
• Introduce your system, allow for practice
• Have users work through the tasks
• Collect experimental measures (e.g., performance and error
• Fill out questionnaire, if any
• Debrief the user
• Entire session should last less than 60 minutes
Tell the User At Least:
• Purpose of the study, but not necessarily details of what you
• What they will be doing (the tasks)
• They are not being tested, the interface/system is
• They can quit at anytime and will not affect relationship with
you, the university, the company, etc.
• About the equipment in the room
• Whether their face and/or actions will be recorded
• How to think aloud (if you are collecting verbal data)
• If you will or will not be available to answer questions
Make Users Feel Comfortable
• Offer breaks at boundary points
• Offer to send results in aggregate form or allows users to see
• Develop understandable instructions
• Do not “defend” your interface
• Do not make subjective comments about users, ease or
difficulty of tasks, etc.
Analyze Results and Iterate
• Analyze data using statistical methods (ANOVAs and Chi-
Squared tests common)
• take a stats course, e.g., Stat 320, for more detail
• did you meet the goals? How from the goals are you?
t-tests and ANOVAs
• t-tests compare two random samples and determine if the
samples are statistically significantly different
• e.g., are dynamic menus better than static menus?
• ANOVAs (analysis of variance) compare n random samples
and determine if the samples are statistically significantly
• e.g., which is best: dynamic, static or radial menus?
• Both assume the samples come from normal distributions
and both produce p-values.
• Bell curve
• y = exp(-x2)
• Occurs from sum of
• e.g. sum of dice rolls
• Total time = t-find + t-
home + t-click
• Total # of errors
• probability value
• The probability that the difference you observe in an
experiment is due to random chance
• An expression of the confidence of your result
• Typically, a difference is called statistically significant when
p < 0.05.
• Some ANOVAs produce partial eta-squared values in
addition to p-values.
• They are becoming widespread in HCI literature.
• You may see them soon in a usability report.
• Partial eta-squared values offer a practical measure of
Advantages of Empirical User Studies
• Measure performance (time, error rate)
• Measure user satisfaction
• Give realistic experience of the interface
• realistic system response
• move among tasks seamlessly
• designers not in control, the user is
• Focus will be on the details
• most big issues should already be resolved
Disadvantages of Empirical User Studies
• Users typically must come to the lab
• makes it more difficult to recruit them
• users may have anxiety
• Large setup effort involved
• software instrumentation, hardware setup, questionnaire
design, IRB approval, etc.
• Prototype may crash
An Example of How This Gets Used in
• “The Impact of Delayed Visual Feedback on Collaborative
Performance” by Darren Gergle, presented at CHI 06.
• What is the relationship between delayed visual feedback and
collaboration? How much network delay can be tolerated?
• e.g, architectural planning, telesurgery and remote repair
The Collaborative Puzzle Task
• The experimental task was for a helper to guide a worker
through a visual puzzle over a network connection
• Only one: visual delay in the helper‟s view window
• Delay sampled from this distribution [60 - 3300ms]:
• f(n) = Tn = Tn-1 * e.05 with T1 = 60
• Only one: task performance time
• Participants were asked to perform the puzzle task as quickly
and accurately as possible.
Quantitative Analysis Using ANOVA
• “For delays between 60ms and 939ms, we found no
evidence to indicate any impact of delayed visual feedback
on task performance (SE = (2.87), F1,610 = .028, p = .87).”
• p > 0.05, so the samples are not significantly different
• “However, for delay rates between 939ms and 1798ms there
is a significant impact on task performance (F1,610 = 13.57, p
• Since p < 0.001, this result is highly significant
Graph of Delay vs. Performance