Evaluation
1
Evaluation
• Personal evaluation
• Software validation
• Software evaluation
2
Personal evaluation
• What have I achieved?
• Have I achieved what I set out to achieve?
• Where have I fallen short?
• Why?
• What could I have done better?
• Assumes an a priori statement of what you
hope/expect/intend to achieve
3
Self evaluation in your dissertation
Dissertation plan • Ch 3 lays out success
• Introduction criteria by which success
• Background of project is to be judged
• Success criteria • Ch 6 will review work
done in Ch 5 with respect
• Design to these criteria, including
• Realisation reflection on overall
• Evaluation/Testing validity of the approach
• Conclusions & Further • But this is not “software
Work evaluation”
4
Program validation
• Systematically check all functions in your
program/application
• Systematically check all sequences of
inputs etc.
• Does your program/application do what
you think it is supposed to do?
• This is important, but ...
• This is not “software evaluation”
5
Software evaluation
Note: We are using the term “software” in a
very vague sense: it could include a
program, a web application, any sort of
implementation that does something
• Evaluate the appropriateness of the
software with respect to its intended use
• Large range of aspects of software that
can be evaluated
6
Evaluation evaluation
• In your dissertation you are asked to evaluate
what you have achieved
• Your research could (should?) include an
evaluation element
• So you will need to evaluate your evaluation
• Your evaluation might have negative results, but
still be an informative experiment which you can
evaluate positively
• Your research could even be to compare
evaluation schemes!
7
A case study
• Last year a student of mine did a project which
was a comparative evaluation of a number of
speech synthesis devices
• His dissertation discussed
– Factors in setting up a comparative evaluation
– A description of the actual evaluation
– A discussion of the results
• His personal evaluation then considered how
well the experiment (i.e. the evaluation) had
been conducted
8
Software evaluation
• Functionality – does it do what is supposed to
do?
• Reliability – does it do the same thing under the
same conditions?
• Usability – is it user-friendly?
• Efficiency – cost, speed, etc.
• Maintainability – can you modify it? Is it robust?
• Portability – can it be transferred from one
environment/platform to another?
9
Software evaluation
• Evaluating commercial software is different from
evaluating something you have constructed
– Even if you have constructed it from commercially
available components
• Again, note the difference between validation
and evaluation
– Especially concerning “functionality”
• Also, evaluation not the same as a software
review, as found eg in a magazine
10
Stakeholders
• Developers
– Researchers
– Commercial developers
• End-users
– Actual end-users (is this a single type?)
– Their managers (buyers)
• Vendors
• Investors
11
Evaluation types
• Feasibility / Suitability
– For any of the above stakeholders
• Internal evaluation
– For development
– Iterative testing, to evaluate progress
– Adequacy evaluation
– Diagnostic evaluation (debugging)
– Black box vs. glass box evaluation
12
Evaluation types
• Declarative evaluation
– How well does it perform?
– Comparison with a “gold standard” ideal performance
– Comparison with a baseline “wooden block”
• Usability evaluation
– How long does each step take?
– Is it “natural”, intuitive?
– Is it easy to learn to use?
– Is it well documented?
13
Evaluation types
• Operational evaluation
– ROI
– Compatibility with other software
– Consistency of interfaces
• Internal
• With respect to “standards” (eg Microsoft)
– Failsofts
– Role of humans
– Preparation, throughput, correction, output
– Backup
• Documentation
• Support
• Corporate situation of provider
14
Framework for evaluation
• Definition of the relevant quality
characteristics – what is it you want to
evaluate? Be specific
• Definition of attributes pertinent to this
quality
• Definition of a measure able to provide
values for these attributes
• Definition of a method whereby the
measure can be made
15
Framework for evaluation
Important to be sure that
• The quality to be evaluated is genuinely a
quality that is claimed of the software
• The attribute to be measured does reflect
the quality in question
• The measure does genuinely measure
that attribute (and not some other one)
• The method is sufficient to deliver a
meaningful measure
16
Example: spell checker
• Function:
– (a) identify wrongly-spelled word
– (b) suggest an appropriate correction
– (among other features)
• Quality: ability to do (a)
• Attribute: success rate in performance of that
task
• Measure: “Precision”: percentage of wrongly-
spelled words correctly identified in a document
• Method: give it a text with some wrongly-spelled
words and count how many it spots
17
Example: spell checker
• Good evaluation, but not A*
• Success means
– Identifying misspelled words (true positives)
– Ignoring correctly spelled words (true negatives)
• So is the measure really appropriate? We are only
counting true positives and false negatives: we are not
giving credit for the true negatives, nor penalising false
positives
• The method is underspecified:
– How much text?
– What sort of text?
– Should we take into account what we know about spell checking
(a certain class of error is very hard to detect)?
– Should we classify misspellings and measure different classes
separately?
18
Attributes
• Different types imply different
measures/methods
• Example: dish-washers
Water
Name Racks Options* consumption Noise level Cleanliness
ABC 2 a,b 10 noisy ***
EFG 3 b 6 quiet *
PQR 2 a 5 very noisy **
* a = pre-wash rinse cycle; b = independent rinse cycle
19
Methods and measures
• Objective measures
– Measuring, counting, timing
– Doing a specific task
– In case of usability issues, need to evaluate
with a number of subjects (not just do it
yourself)
– Comparison against a gold standard
• Precision P
correct
R
correct
• Recall total possible
• Other measures also considering false positives and
negatives
20
Methods and measures
• Subjective measures
– Interview after use
– Feedback questionnaire
• Rating scales (usually 5 or 7 points, + DK, N/A)
• Open-ended questions?
• Questions should relate to some specific point
• Repeat (some) questions in a disguised way
– Performance analysis
• Video the session, analyse afterwards
21
Methods and measures
• Don’t try to measure too many different things
with the same instrument
• Though this can be possible to some extent
• But extraneous factors need to be controlled
carefully
• Problem of statistical significance:
– Do you have enough subjects to know that the
differences (and similarities) are not just random
fluctuations?
22
Example
• Simulated doctor-patient interviews with
patients with limited English, using
computer-based communication device
with symbols and digitised speech
– two devices (laptop+mousepad, tablet+stylus)
– doctors and nurses
– literate and illiterate patients
23
24
Example
• General question: could they get to the end of
the consultation? (How did we “measure” this?)
• Objective measures
– How long did it take?
– How many questions did they ask?
– How many answers were (apparently) correctly
understood?
• Subjective measures
– Feedback questionnaire with satisfaction ratings
– Open-ended questions about specific issues
25
Subjects
• Many types of evaluation require volunteers
– How many do you need?
– Where will you get them from?
– Are they suitable?
• Exclusion factors: eg prior familiarity with your topic
• Need to control for irrelevant differences in their profile
– How will you guarantee their cooperation?
– Ethical issues
• Officially, you need ethics clearance for any experiments
involving living beings!
• In any case, important that volunteers know what they are
letting themselves in for
• Also important that you don’t waste people’s time, eg
evaluating a useless task (for example as a baseline)
26
Summary
• What are you trying to evaluate?
– Be specific, not general eg “What do you think
of this interface?”
• What is the best way to measure what you
are interested in?
• How feasible is it to do what you want?
• [After Easter]: How to write it all up!
27
Next session
• No class next week
• First week after Easter (19 Apr)
– No class on Thursday
– Instead, practical sessions on Library
Resources with Barry White
– choose one of three sessions
• each at 2pm-4pm
• Wed 18, Thur 19 or Fri 20 April
• in the Joule Library
• Do we need a sign-up sheet?
28