Embed
Email

Evaluation

Document Sample

Shared by: yunyi
Categories
Tags
Stats
views:
1
posted:
11/14/2011
language:
English
pages:
28
Evaluation









1

Evaluation

• Personal evaluation

• Software validation

• Software evaluation









2

Personal evaluation

• What have I achieved?

• Have I achieved what I set out to achieve?

• Where have I fallen short?

• Why?

• What could I have done better?

• Assumes an a priori statement of what you

hope/expect/intend to achieve



3

Self evaluation in your dissertation

Dissertation plan • Ch 3 lays out success

• Introduction criteria by which success

• Background of project is to be judged

• Success criteria • Ch 6 will review work

done in Ch 5 with respect

• Design to these criteria, including

• Realisation reflection on overall

• Evaluation/Testing validity of the approach

• Conclusions & Further • But this is not “software

Work evaluation”





4

Program validation

• Systematically check all functions in your

program/application

• Systematically check all sequences of

inputs etc.

• Does your program/application do what

you think it is supposed to do?

• This is important, but ...

• This is not “software evaluation”

5

Software evaluation

Note: We are using the term “software” in a

very vague sense: it could include a

program, a web application, any sort of

implementation that does something

• Evaluate the appropriateness of the

software with respect to its intended use

• Large range of aspects of software that

can be evaluated

6

Evaluation evaluation

• In your dissertation you are asked to evaluate

what you have achieved

• Your research could (should?) include an

evaluation element

• So you will need to evaluate your evaluation

• Your evaluation might have negative results, but

still be an informative experiment which you can

evaluate positively

• Your research could even be to compare

evaluation schemes!



7

A case study

• Last year a student of mine did a project which

was a comparative evaluation of a number of

speech synthesis devices

• His dissertation discussed

– Factors in setting up a comparative evaluation

– A description of the actual evaluation

– A discussion of the results

• His personal evaluation then considered how

well the experiment (i.e. the evaluation) had

been conducted



8

Software evaluation

• Functionality – does it do what is supposed to

do?

• Reliability – does it do the same thing under the

same conditions?

• Usability – is it user-friendly?

• Efficiency – cost, speed, etc.

• Maintainability – can you modify it? Is it robust?

• Portability – can it be transferred from one

environment/platform to another?

9

Software evaluation

• Evaluating commercial software is different from

evaluating something you have constructed

– Even if you have constructed it from commercially

available components

• Again, note the difference between validation

and evaluation

– Especially concerning “functionality”

• Also, evaluation not the same as a software

review, as found eg in a magazine



10

Stakeholders

• Developers

– Researchers

– Commercial developers

• End-users

– Actual end-users (is this a single type?)

– Their managers (buyers)

• Vendors

• Investors

11

Evaluation types

• Feasibility / Suitability

– For any of the above stakeholders

• Internal evaluation

– For development

– Iterative testing, to evaluate progress

– Adequacy evaluation

– Diagnostic evaluation (debugging)

– Black box vs. glass box evaluation



12

Evaluation types

• Declarative evaluation

– How well does it perform?

– Comparison with a “gold standard” ideal performance

– Comparison with a baseline “wooden block”

• Usability evaluation

– How long does each step take?

– Is it “natural”, intuitive?

– Is it easy to learn to use?

– Is it well documented?



13

Evaluation types

• Operational evaluation

– ROI

– Compatibility with other software

– Consistency of interfaces

• Internal

• With respect to “standards” (eg Microsoft)

– Failsofts

– Role of humans

– Preparation, throughput, correction, output

– Backup

• Documentation

• Support

• Corporate situation of provider

14

Framework for evaluation

• Definition of the relevant quality

characteristics – what is it you want to

evaluate? Be specific

• Definition of attributes pertinent to this

quality

• Definition of a measure able to provide

values for these attributes

• Definition of a method whereby the

measure can be made

15

Framework for evaluation

Important to be sure that

• The quality to be evaluated is genuinely a

quality that is claimed of the software

• The attribute to be measured does reflect

the quality in question

• The measure does genuinely measure

that attribute (and not some other one)

• The method is sufficient to deliver a

meaningful measure

16

Example: spell checker

• Function:

– (a) identify wrongly-spelled word

– (b) suggest an appropriate correction

– (among other features)

• Quality: ability to do (a)

• Attribute: success rate in performance of that

task

• Measure: “Precision”: percentage of wrongly-

spelled words correctly identified in a document

• Method: give it a text with some wrongly-spelled

words and count how many it spots



17

Example: spell checker

• Good evaluation, but not A*

• Success means

– Identifying misspelled words (true positives)

– Ignoring correctly spelled words (true negatives)

• So is the measure really appropriate? We are only

counting true positives and false negatives: we are not

giving credit for the true negatives, nor penalising false

positives

• The method is underspecified:

– How much text?

– What sort of text?

– Should we take into account what we know about spell checking

(a certain class of error is very hard to detect)?

– Should we classify misspellings and measure different classes

separately?

18

Attributes

• Different types imply different

measures/methods

• Example: dish-washers

Water

Name Racks Options* consumption Noise level Cleanliness

ABC 2 a,b 10 noisy ***

EFG 3 b 6 quiet *

PQR 2 a 5 very noisy **



* a = pre-wash rinse cycle; b = independent rinse cycle









19

Methods and measures

• Objective measures

– Measuring, counting, timing

– Doing a specific task

– In case of usability issues, need to evaluate

with a number of subjects (not just do it

yourself)

– Comparison against a gold standard

• Precision P

correct

R

correct

• Recall total possible

• Other measures also considering false positives and

negatives

20

Methods and measures

• Subjective measures

– Interview after use

– Feedback questionnaire

• Rating scales (usually 5 or 7 points, + DK, N/A)

• Open-ended questions?

• Questions should relate to some specific point

• Repeat (some) questions in a disguised way

– Performance analysis

• Video the session, analyse afterwards



21

Methods and measures

• Don’t try to measure too many different things

with the same instrument

• Though this can be possible to some extent

• But extraneous factors need to be controlled

carefully

• Problem of statistical significance:

– Do you have enough subjects to know that the

differences (and similarities) are not just random

fluctuations?



22

Example

• Simulated doctor-patient interviews with

patients with limited English, using

computer-based communication device

with symbols and digitised speech

– two devices (laptop+mousepad, tablet+stylus)

– doctors and nurses

– literate and illiterate patients





23

24

Example

• General question: could they get to the end of

the consultation? (How did we “measure” this?)

• Objective measures

– How long did it take?

– How many questions did they ask?

– How many answers were (apparently) correctly

understood?

• Subjective measures

– Feedback questionnaire with satisfaction ratings

– Open-ended questions about specific issues



25

Subjects

• Many types of evaluation require volunteers

– How many do you need?

– Where will you get them from?

– Are they suitable?

• Exclusion factors: eg prior familiarity with your topic

• Need to control for irrelevant differences in their profile

– How will you guarantee their cooperation?

– Ethical issues

• Officially, you need ethics clearance for any experiments

involving living beings!

• In any case, important that volunteers know what they are

letting themselves in for

• Also important that you don’t waste people’s time, eg

evaluating a useless task (for example as a baseline)

26

Summary

• What are you trying to evaluate?

– Be specific, not general eg “What do you think

of this interface?”

• What is the best way to measure what you

are interested in?

• How feasible is it to do what you want?



• [After Easter]: How to write it all up!

27

Next session

• No class next week

• First week after Easter (19 Apr)

– No class on Thursday

– Instead, practical sessions on Library

Resources with Barry White

– choose one of three sessions

• each at 2pm-4pm

• Wed 18, Thur 19 or Fri 20 April

• in the Joule Library

• Do we need a sign-up sheet?

28



Related docs
Other docs by yunyi
2.2 Virtueller Adressraum
Views: 3  |  Downloads: 0
HIGHLINE TAPPED TO PRODUCE INAUG
Views: 2  |  Downloads: 0
Heteroflexibility
Views: 8  |  Downloads: 0
Lynn Jones 5 Grade Lesson Plan F
Views: 0  |  Downloads: 0
SPONSOR SHIP AND TABLE HOSTING OPPOR TUNITIES
Views: 0  |  Downloads: 0
NJTinside2
Views: 0  |  Downloads: 0
The Vegetarian Food Pyramid J
Views: 0  |  Downloads: 0
Anti-Spam Measures for End Users
Views: 0  |  Downloads: 0
Slide 1 - UCL
Views: 1  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!