Data Visualization in Data Mining by qfK0DH

VIEWS: 9 PAGES: 24

									Information Visualization
in Data Mining
S.T. Balke
Department of Chemical Engineering
and Applied Chemistry
University of Toronto
Motivation

   Data visualization
    – relies primarily on human cognition for value
      discovery;
    – permits direct incorporation of human ingenuity
      and analytic capabilities into data mining;
    – can very effectively deal with very large
      quantities of data;
    – powerfully combines with machine-based
      discovery techniques.
Uses

   Explorative Analysis
    – Data cleaning
    – Provide hypotheses
   Confirmative Analysis
    – Confirm or reject hypotheses
   Presentation
    – Communicate your work
http://www.alz.washington.edu/DATA2001/GERALD1/sld011.htm
Calculated Properties of
the Anscombe Data Sets

     mean of the x values = 9.0

     mean of the y values = 7.5

     equation of the least-
     squared regression line is: y
     = 3 + 0.5x

     sums of squared errors
     (about the mean) = 110.0
Calculated Properties of
the Anscombe Data Sets

    regression sums of squared errors
    (variance accounted for by x) = 27.5

    residual sums of squared errors (about
    the regression line) = 13.75

    correlation coefficient = 0.82

    coefficient of determination = 0.67
The Anscombe Data
Marley, 1885
Snow’s Cholera
Map, 1855
http://pupgg.princeton.edu/disk20/anonymous/groth/lick/licknorth.gif
       Graphical Excellence
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

       Graphical displays should:
        show the data
        induce the viewer to think about the substance, not the
          methodology
        avoid distorting what the data says
        present many numbers in a small space
        make large data sets coherent
        encourage the eye to compare different pieces of data
        reveal the data at several levels of detail (broad overview to
          fine structure)
        serve a reasonably clear purpose: description, exploration,
          tabulation, or decoration
        be closely integrated with the statistical and verbal
          descriptions of the data set.
         Graphical Excellence

            Gives the viewer the greatest number
             of ideas in the shortest time with the
             least ink in the smallest space.
            Nearly always multivariate.
            Requires telling the truth about the
             data.

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
                                         Lie Factor=14.8


(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
       Lie Factor

                                   size of effect shown in graphic
                    Lie Factor 
                                         size of effect in data

                                    (27.5  18.0)100
                       Lie Factor         18         14.8
                                     (5.3  0.6)100
                                           0.6

                         Require: 0.95<Lie Factor<1.05




(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
      Using Area for One
      Dimensional Data

                                                 Lie Factor=2.8




(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
      More guidelines:

         The number of information-carrying
          (variable) dimensions depicted should
          not exceed the number of dimensions
          in the data.
         No legends: use labels on graph
         Graphics must not quote data out of
          context.
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
       Data Ink Ratio

                                           data ink
           Data ink Ratio 
                              total ink used to pr int the graphic



      Data ink Ratio = proportion of a graphic’s ink devoted to the

                      non-redundant display of data-information.

      Data ink Ratio=1.0-(proportion of a graphic that can be erased
                     without loss of data-information)



(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
       Maximize Data Density

                                number of entries in the data matrix
 data density of a graphic 
                                       area of data graphic




(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
        Beware Chartjunk


   NO

   “Isn’t it remarkable that the computer can be programmed
   to draw like that.”

   YES:

   “My, what interesting data!”


(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
How to Say Nothing with
Information Visualization
http://www.crs4.it/~zip/13ways.html


   Never include a color legend.
   Avoid annotation.
   Never mention error characteristics of the
    visualization method.
   When in doubt, smooth.
   Don’t say how long it required to plot.
   Never compare your results with other data
    visualization techniques.
   Never cite references for the data.
   Claim generality but show results from a single data
    set.
   Use viewing angle to hide blemishes in 3D objects.
An Overview of
Information Visualization
Methods
http://www.informatik.uni-
halle.de/~keim/tutorials.html
Methods of Interest

   Scatterplot Matrices
   Parallel Coordinates
   Pixel Oriented Methods
   Icon based Methods
   Dimensional Stacking
   Treemap
Assignment 1: see
handout
Some websites of
interest:
   http://dmoz.org/Computers/Software/Databases/Data_Mining/
    Public_Domain_Software/
   http://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Products/
    Visualization/



Try a search at google.com using the
  followng key words together:
name_of_method download software

								
To top