Docstoc

Evaluation

Document Sample
Evaluation Powered By Docstoc
					Evaluation




             1
              Evaluation
• Personal evaluation
• Software validation
• Software evaluation




                           2
          Personal evaluation
•   What have I achieved?
•   Have I achieved what I set out to achieve?
•   Where have I fallen short?
•   Why?
•   What could I have done better?
•   Assumes an a priori statement of what you
    hope/expect/intend to achieve

                                             3
 Self evaluation in your dissertation
Dissertation plan         • Ch 3 lays out success
• Introduction              criteria by which success
• Background                of project is to be judged
• Success criteria        • Ch 6 will review work
                            done in Ch 5 with respect
• Design                    to these criteria, including
• Realisation               reflection on overall
• Evaluation/Testing        validity of the approach
• Conclusions & Further   • But this is not “software
    Work                    evaluation”


                                                       4
         Program validation
• Systematically check all functions in your
  program/application
• Systematically check all sequences of
  inputs etc.
• Does your program/application do what
  you think it is supposed to do?
• This is important, but ...
• This is not “software evaluation”
                                               5
        Software evaluation
Note: We are using the term “software” in a
  very vague sense: it could include a
  program, a web application, any sort of
  implementation that does something
• Evaluate the appropriateness of the
  software with respect to its intended use
• Large range of aspects of software that
  can be evaluated
                                              6
        Evaluation evaluation
• In your dissertation you are asked to evaluate
  what you have achieved
• Your research could (should?) include an
  evaluation element
• So you will need to evaluate your evaluation
• Your evaluation might have negative results, but
  still be an informative experiment which you can
  evaluate positively
• Your research could even be to compare
  evaluation schemes!

                                                     7
                A case study
• Last year a student of mine did a project which
  was a comparative evaluation of a number of
  speech synthesis devices
• His dissertation discussed
  – Factors in setting up a comparative evaluation
  – A description of the actual evaluation
  – A discussion of the results
• His personal evaluation then considered how
  well the experiment (i.e. the evaluation) had
  been conducted

                                                     8
          Software evaluation
• Functionality – does it do what is supposed to
  do?
• Reliability – does it do the same thing under the
  same conditions?
• Usability – is it user-friendly?
• Efficiency – cost, speed, etc.
• Maintainability – can you modify it? Is it robust?
• Portability – can it be transferred from one
  environment/platform to another?
                                                       9
          Software evaluation
• Evaluating commercial software is different from
  evaluating something you have constructed
  – Even if you have constructed it from commercially
    available components
• Again, note the difference between validation
  and evaluation
  – Especially concerning “functionality”
• Also, evaluation not the same as a software
  review, as found eg in a magazine

                                                        10
              Stakeholders
• Developers
  – Researchers
  – Commercial developers
• End-users
  – Actual end-users (is this a single type?)
  – Their managers (buyers)
• Vendors
• Investors
                                                11
            Evaluation types
• Feasibility / Suitability
  – For any of the above stakeholders
• Internal evaluation
  – For development
  – Iterative testing, to evaluate progress
  – Adequacy evaluation
  – Diagnostic evaluation (debugging)
  – Black box vs. glass box evaluation

                                              12
              Evaluation types
• Declarative evaluation
  – How well does it perform?
  – Comparison with a “gold standard” ideal performance
  – Comparison with a baseline “wooden block”
• Usability evaluation
  –   How long does each step take?
  –   Is it “natural”, intuitive?
  –   Is it easy to learn to use?
  –   Is it well documented?

                                                      13
                Evaluation types
• Operational evaluation
  – ROI
  – Compatibility with other software
  – Consistency of interfaces
       • Internal
       • With respect to “standards” (eg Microsoft)
  –   Failsofts
  –   Role of humans
  –   Preparation, throughput, correction, output
  –   Backup
       • Documentation
       • Support
       • Corporate situation of provider
                                                      14
     Framework for evaluation
• Definition of the relevant quality
  characteristics – what is it you want to
  evaluate? Be specific
• Definition of attributes pertinent to this
  quality
• Definition of a measure able to provide
  values for these attributes
• Definition of a method whereby the
  measure can be made
                                               15
    Framework for evaluation
Important to be sure that
• The quality to be evaluated is genuinely a
  quality that is claimed of the software
• The attribute to be measured does reflect
  the quality in question
• The measure does genuinely measure
  that attribute (and not some other one)
• The method is sufficient to deliver a
  meaningful measure
                                               16
       Example: spell checker
• Function:
  – (a) identify wrongly-spelled word
  – (b) suggest an appropriate correction
  – (among other features)
• Quality: ability to do (a)
• Attribute: success rate in performance of that
  task
• Measure: “Precision”: percentage of wrongly-
  spelled words correctly identified in a document
• Method: give it a text with some wrongly-spelled
  words and count how many it spots

                                                 17
        Example: spell checker
• Good evaluation, but not A*
• Success means
   – Identifying misspelled words (true positives)
   – Ignoring correctly spelled words (true negatives)
• So is the measure really appropriate? We are only
  counting true positives and false negatives: we are not
  giving credit for the true negatives, nor penalising false
  positives
• The method is underspecified:
   – How much text?
   – What sort of text?
   – Should we take into account what we know about spell checking
     (a certain class of error is very hard to detect)?
   – Should we classify misspellings and measure different classes
     separately?
                                                                18
                        Attributes
• Different types imply different
  measures/methods
• Example: dish-washers
                                         Water
   Name       Racks      Options*     consumption   Noise level Cleanliness
   ABC          2           a,b           10          noisy         ***
   EFG          3            b            6           quiet          *
   PQR          2            a            5         very noisy       **

          * a = pre-wash rinse cycle; b = independent rinse cycle




                                                                          19
      Methods and measures
• Objective measures
  – Measuring, counting, timing
  – Doing a specific task
  – In case of usability issues, need to evaluate
    with a number of subjects (not just do it
    yourself)
  – Comparison against a gold standard
     • Precision    P
                       correct
                                  R
                                      correct
     • Recall            total        possible
     • Other measures also considering false positives and
       negatives
                                                       20
        Methods and measures
• Subjective measures
  – Interview after use
  – Feedback questionnaire
    •   Rating scales (usually 5 or 7 points, + DK, N/A)
    •   Open-ended questions?
    •   Questions should relate to some specific point
    •   Repeat (some) questions in a disguised way
  – Performance analysis
    • Video the session, analyse afterwards

                                                           21
       Methods and measures
• Don’t try to measure too many different things
  with the same instrument
• Though this can be possible to some extent
• But extraneous factors need to be controlled
  carefully
• Problem of statistical significance:
  – Do you have enough subjects to know that the
    differences (and similarities) are not just random
    fluctuations?

                                                         22
                 Example
• Simulated doctor-patient interviews with
  patients with limited English, using
  computer-based communication device
  with symbols and digitised speech
  – two devices (laptop+mousepad, tablet+stylus)
  – doctors and nurses
  – literate and illiterate patients


                                               23
24
                    Example
• General question: could they get to the end of
  the consultation? (How did we “measure” this?)
• Objective measures
  – How long did it take?
  – How many questions did they ask?
  – How many answers were (apparently) correctly
    understood?
• Subjective measures
  – Feedback questionnaire with satisfaction ratings
  – Open-ended questions about specific issues

                                                       25
                        Subjects
• Many types of evaluation require volunteers
  – How many do you need?
  – Where will you get them from?
  – Are they suitable?
     • Exclusion factors: eg prior familiarity with your topic
     • Need to control for irrelevant differences in their profile
  – How will you guarantee their cooperation?
  – Ethical issues
     • Officially, you need ethics clearance for any experiments
       involving living beings!
     • In any case, important that volunteers know what they are
       letting themselves in for
     • Also important that you don’t waste people’s time, eg
       evaluating a useless task (for example as a baseline)
                                                                     26
                 Summary
• What are you trying to evaluate?
  – Be specific, not general eg “What do you think
    of this interface?”
• What is the best way to measure what you
  are interested in?
• How feasible is it to do what you want?

• [After Easter]: How to write it all up!
                                                 27
                  Next session
• No class next week
• First week after Easter (19 Apr)
  – No class on Thursday
  – Instead, practical sessions on Library
    Resources with Barry White
  – choose one of three sessions
     •   each at 2pm-4pm
     •   Wed 18, Thur 19 or Fri 20 April
     •   in the Joule Library
     •   Do we need a sign-up sheet?
                                             28

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:11/14/2011
language:English
pages:28