Excel Market Evaluation

Document Sample
Excel Market Evaluation Powered By Docstoc
					Evaluating Visualizations

CS5764: Information Visualization
          Chris North
       Evaluating Visualizations
• Empirical:
  – Usability Test
     • Observation, problem identification
  – Controlled Experiment
     • Formal controlled scientific experiment
     • Comparisons, statistical analysis
• Analytic:
  – Expert Review
     • Examination by visualization expert
  – Heuristic Evaluation
     • Principles, Guidelines
     • Algorithmic
 Usability test vs. Controlled Expm.
• Usability test:
     •   Formative: helps guide design
     •   Single UI, early in design process
     •   Few users
     •   Usability problems, incidents
     •   Qualitative feedback from users
• Controlled experiment:
     •   Summative: measure final result
     •   Compare multiple UIs
     •   Many users, strict protocol
     •   Independent & dependent variables
     •   Quantitative results, statistical significance
Controlled Experiments
    What is Science?
•
           What is Science?
• Measurement
• Modeling
             Scientific Method
1.   Form Hypothesis
2.   Collect data
3.   Analyze
4.   Accept/reject hypothesis

•    How to “prove” a hypothesis in science?
      •
               Scientific Method
1.   Form Hypothesis
2.   Collect data
3.   Analyze
4.   Accept/reject hypothesis

•    How to “prove” a hypothesis in science?
      •   Easier to disprove things, by counterexample
      •   Null hypothesis = opposite of hypothesis
      •   Disprove null hypothesis
      •   Hence, hypothesis is proved
          Empirical Experiment
• Typical question:
     • Which visualization is better in which situations?


     Spotfire              vs.             TableLens
               Cause and Effect
• Goal: determine “cause and effect”
     • Cause = visualization tool (Spotfire vs. TableLens)
     • Effect = user performance time on task T
• Procedure:
     • Vary cause
     • Measure effect
• Problem: random variation
                         random variation
             Real                               Collected
             world                                data
                        uncertain conclusions
     • Cause = vis tool OR random variation?
                  Stats to the Rescue
• Goal:
      • Measured effect unlikely to result by random variation
• Hypothesis:
      • Cause = visualization tool (e.g. Spotfire ≠ TableLens)
• Null hypothesis:
      • Visualization tool has no effect (e.g. Spotfire = TableLens)
      • Hence: Cause = random variation
• Stats:
      • If null hypothesis true, then measured effect occurs with probability < 5%
      • But measured effect did occur! (e.g. measured effect >> random variation)
• Hence:
      • Null hypothesis unlikely to be true
      • Hence, hypothesis likely to be true
                         Variables
• Independent Variables (what you vary), and
  treatments (the variable values):
     • Visualization tool
              » Spotfire, TableLens, Excel
     • Task type
              » Find, count, pattern, compare
     • Data size (# of items)
              » 100, 1000, 1000000
• Dependent Variables (what you measure)
     •   User performance time
     •   Accuracy, Errors
     •   Subjective satisfaction (survey)
     •   HCI metrics
            Example: 2 x 3 design
                      Ind Var 2: Task Type
                   Task1   Task2     Task3

           Spot-
Ind Var 1: fire
Vis. Tool Table-
           Lens


                                             Measured user
                                             performance times
 • n users per cell                          (dep var)
                          Groups
• “Between subjects” variable
     •   1 group of users for each variable treatment
     •   Group 1: 20 users, Spotfire
     •   Group 2: 20 users, TableLens
     •   Total: 40 users, 20 per cell
• “With-in subjects” (repeated) variable
     •   All users perform all treatments
     •   Counter-balancing order effect
     •   Group 1: 20 users, Spotfire then TableLens
     •   Group 2: 20 users, TableLens then Spotfire
     •   Total: 40 users, 40 per cell
                        Issues
• Eliminate or measure extraneous factors
• Randomized
• Fairness
     • Identical procedures, …
• Bias
• User privacy, data security
• IRB (internal review board)
                        Procedure
• For each user:
     • Sign legal forms
     • Pre-Survey: demographics
     • Instructions
              » Do not reveal true purpose of experiment
     • Training runs
     • Actual runs
              » Give task, measure performance
     • Post-Survey: subjective measures


• * n users
                     Data
• Measured dependent variables
• Spreadsheet:
     User Spotfire       TableLens

          task task task task task task
          1    2    3    1    2    3
             Step 1: Visualize it
•   Dig out interesting facts
•   Qualitative conclusions
•   Guide stats
•   Guide future experiments
                    Step 2: Stats
                      Ind Var 2: Task Type
                   Task1   Task2     Task3

           Spot-   37.2    54.5      103.7
Ind Var 1: fire
Vis. Tool Table-   29.8    53.2      145.4
           Lens

                                             Average
                                             user performance
                                             times (dep var)
   TableLens better than Spotfire?

  Avg
  Perf time
  (secs)

              Spotfire   TableLens




• Problem with Averages ?
   TableLens better than Spotfire?

  Avg
  Perf time
  (secs)

                 Spotfire   TableLens




• Problem with Averages: lossy
     • Compares only 2 numbers
     • What about the 40 data values? (Show me the data!)
              The real picture

  Avg
  Perf time
  (secs)

               Spotfire   TableLens




• Need stats that compare all data
                       Statistics
• t-test
      • Compares 1 dep var on 2 treatments of 1 ind var
• ANOVA: Analysis of Variance
      • Compares 1 dep var on n treatments of m ind vars


• Result:
      • p = probability that difference between treatments is random
        (null hypothesis)
      • “statistical significance” level
      • typical cut-off: p < 0.05
      • Hypothesis confidence = 1 - p
Excel
                           p < 0.05
•   Woohoo!
•   Found a “statistically significant” difference
•   Averages determine which is „better‟
•   Conclusion:
       •   Cause = visualization tool (e.g. Spotfire ≠ TableLens)
       •   Vis Tool has an effect on user performance for task T …
       •   “95% confident that TableLens better than Spotfire …”
       •   NOT “TableLens beats Spotfire 95% of time”
       •   5% chance of being wrong!
       •   Be careful about generalizing
                      p > 0.05
• Hence, no difference?
     • Vis Tool has no effect on user performance for task T…?
     • Spotfire = TableLens ?
                         p > 0.05
• Hence, no difference?
     • Vis Tool has no effect on user performance for task T…?
     • Spotfire = TableLens ?


• NOT!
     •   Did not detect a difference, but could still be different
     •   Potential real effect did not overcome random variation
     •   Provides evidence for Spotfire = TableLens, but not proof
     •   Boring, basically found nothing
• How?
     • Not enough users
     • Need better tasks, data, …
             Data Mountain
• Robertson, “Data Mountain”   (Microsoft)
     •
  Comparison of Info Vis Systems
• Kobsa
Cleveland‟s Rules for Secondary Tasks
 • Chewar et al.
Shortcomings of this Method?
Usability Testing
 Usability test vs. Controlled Expm.
• Usability test:
     •   Formative: helps guide design
     •   Single UI, early in design process
     •   Few users
     •   Usability problems, incidents
     •   Qualitative feedback from users
• Controlled experiment:
     •   Summative: measure final result
     •   Compare multiple UIs
     •   Many users, strict protocol
     •   Independent & dependent variables
     •   Quantitative results, statistical significance
     Usability Specification Table

Scenario    Worst case Planned   Best case   Observed
task                   Target    (expert)

Find most   1 min.    10 sec.    3 sec.      ??? sec
expensive
house for
sale?
…
            Usability Test Setup
• Set of benchmark tasks
     • Easy to hard, specific to open-ended
     • Coverage of different UI features
     • E.g. “find the 5 most expensive houses for sale”
• Consent forms
     • Not needed unless video-taping user‟s face (new rule)
• Experimenters:
     • Facilitator: instructs user
     • Observers: take notes, collect data, video tape screen
     • Executor: run the prototype if faked
• Users
     • 3-5 users, quality not quantity
           Usability Test Procedure
• Goal: mimic real life
      • Do not cheat by showing them how to use the UI!
• Initial instructions
      • “We are evaluating the system, not you.”
• Repeat:
      •   Give user a task
      •   Ask user to “think aloud”
      •   Observe, note mistakes and problems
      •   Avoid interfering, hint only if completely stuck
• Interview
      • Verbal feedback
      • Questionnaire
• ~1 hour / user
             Usability Lab
• E.g McBryde 102
                          Data
• Note taking
     • E.g. “&%$#@ user keeps clicking on the wrong button…”
• Verbal protocol: think aloud
     • E.g. user expects that button to do something else…
• Rough quantitative measures
     • HCI metrics: e.g. task completion time, ..
• Interview feedback and surveys
• Video-tape screen & mouse
• Eye tracking, biometrics?
                         Analyze
• Initial reaction:
      • “stupid user!”, “that‟s developer X‟s fault!”, “this sucks”
• Mature reaction:
      • “how can we redesign UI to solve that usability problem?”
      • the user is always right

• Identify usability problems
      • Learning issues: e.g. can‟t figure out or didn‟t notice feature
      • Performance issues: e.g. arduous, tiring to solve tasks
      • Subjective issues: e.g. annoying, ugly
• Problem severity: critical vs. minor
          Cost-Importance Analysis

Problem     Importance     Solutions       Cost        Ratio I/C




• Importance 1-5: (task effect, frequency)
     • 5 = critical, major impact on user, frequent occurance
     • 3 = user can complete task, but with difficulty
     • 1 = minor problem, small speed bump, infrequent
• Ratio = importance / cost
     • Sort by this
     • 3 categories: Must fix, next version, ignored
                          Refine UI
• Simple solutions vs. major redesigns
• Solve problems in order of: importance/cost
• Example:
     • Problem: user didn‟t know he could zoom in to see more…
     • Potential solutions:
         –   Better zoom button icon, tooltip
         –   Add a zoom bar slider (like moosburg)
         –   Icons for different zoom levels: boundaries, roads, buildings
         –   NOT: more more “help” documentation!!! You can do better.
• Iterate
     • Test, refine, test, refine, test, refine, …
     • Until? Meets usability specification
Is there Science in Visualization?

    Phenomenon                 Engineering




                 Measurement

   Science



                         Modeling


                                             …analogy to biology
    Measurement in the Science of Vis.
Phenomena
                                                               goal,
                                            inference,        problem
                             perception,
     system,                                  insight         solving
                 visual      interaction
    algorithm




   •frame-rate   •realism     •task time       •?             •market
   •capacity     •data/ink    •accuracy                       •?
   •…            •…           •…



Measurement methods                        2 kinds of holes
New instruments and methods for
measurement

   Dead
    Or
   Alive
    Example: Insight-based Evaluation
• Problem: current measurements focus on low-level
  task performance and accuracy
      • Do benchmark tasks represent reality?
      • What about Insight?


• Idea: treat tasks as dependent variable
      •   What do users learn from this Visualization?
      •   Realistic scenario, open-ended, think aloud
      •   Insight coding
      •   Information-rich results


• Short-term vs long-term
       Insight-based Evaluation
• Saraiya, North, Duca
      Project phase 5: Evaluation
• 1. Insight results of your data analysis
     • What insights you found in the data?
     • How your tool helped/hindered?
• 2. Insight-based usability study
     • Run on a few users, only your tool
     • What insights did they find in the data?
     • How your tool helped/hindered?
• 3. Refinements to your tool
     • Refinements you made
     • Refinements you wish you could make
                 Homework #6
• Draw a graph
     • No auto-layout tools allowed


• Report:
     • 1. graph picture
     • 2. process you used

				
DOCUMENT INFO
Description: Excel Market Evaluation document sample