Docstoc

A Nonlinear Mapping for Data Structure Analysis

Document Sample
A Nonlinear Mapping for Data Structure Analysis Powered By Docstoc
					IEEE TRANSACTIONS ON COMPUTERS, VOL. C-18, NO. 5, MAY 1969                                                                               401


       A     Nonlinear Mapping                                   for      Data             Structure Analysis
                                                   JOHN W. SAMMON, JR.

   Abstract-An algorithm for the analysis of multivariate data is      Let us now randomly2 choose an initial d-space con-
presented along with some experimental results. The algorithm is       figuration for the Y vectors and denote the configura-
based upon a point mapping of N L-dimensional vectors from the L-
space to a lower-dimensional space such that the inherent data         tion as follows:
"structure" is approximately preserved.                                                   Yii              Y21                  YN1
   Index Terms-Clustering, dimensionality reduction, mappings,
multidimensional scaling, multivariate data analysis, nonparametric,
                                                                               Y1=                Y2     [Y...N
pattern recognition, statistics.                                                      _   yldJ _Y2d _L_yNdJ
                                                                                             _

                     INTRODUCTION                              Next we compute all the d-space interpoint distances
                                                               dij, which are then used to define an error E, which
      HE purpose of this paper is to describe the non- represents how well the present configuration of N
      linear mapping algorithm (NLM) which has been points in the d-space fits the N points in the L-space,
      found to be highly effective in the analysis of mul- i.e.,
tivariate data. The analysis problem is to detect and
                                                                                            vE [dj*dij* dj]2
identify "structure" which may be present in a list of                                  1    N        -
                                                                             E             =
                                                                                                                        (1)
N L-dimensional vectors. Here the word structure refers                            E [dij*] i<j
to geometric relationships among subsets of the data                               i<i
vectors in the L-space. Some examples of structure are                                                 the d N variables
hyperspherical and hyperellipsoidal clusters, and linear Note that ... erroranda function ofd. TheXnext
                                                                           the         is
and certain nonlinear relationships among the vectors the NLM  yp^ p= ,         ,N        q= 1, * * ,               step in
of some subset.                                                             algorithm is to adjust the y,, variables or
   The algorithm is based upon a point mapping of the equivalently change the d-space configuration so
                                                                                                                      as to
N L-dimensional vectors from the L-space to a lower-           decrease the error. We use a steepest descent procedure
dimensional space such that the inherent structure of further for a minimum of the error (see Appendix I for
                                                               to search
the data is approximately preserved under the mapping.                  details).
The approximate structure preservation is maintained                            SOME COMPUTER RESULTS
by fitting N points in the lower-dimensional space such
that their interpoint distances approximate the corre-            We have exercised the nonlinear mapping algorithm
sponding interpoint distances in the L-space. We shall         on several data sets in order to test and evaluate the
be primarily interested in mappings to 2- and 3-dimen- utility of the program in detecting and identifying
sional spaces since the resultant data configuration can structure in data. Some of the results obtained for sev-
easily be evaluated by human observations in 3 or less eral different artificially generated data sets3 are re-
dimensions.                                                    ported for the case where d= 2. We have also run the
                                                               algorithm on many real data sets and have achieved
               THE NONLINEAR MAPPING                           highly satisfactory results; however, for demonstration
                                                               purposes it is useful to work with artificially generated
   Suppose that we have N vectors in an L-space desig- data in order that we can compare our results with the
nated Xi, i= 1, * *, N and corresponding to these we known data structure. The test data sets were as follows.
define N vectors in a d-space (d = 2 or 3) designated Yi,
i=l, *, N. Let the distance' between the vectors                  1) Straight Line Data: These data consisted of nine
Xi and Xj in the L-space be defined by dij*=dist [Xi,          points distributed along a line in a 9-dimensional space.
                                                                     data points were spaced evenly
Xj] and the distance between the corresponding vectors Theinterpoint Euclidean distance of along the line with
-Y and Yj in the d-space be defined by dij = dist [ Yi, YP I . an                                      V,/9 units. The ini-
                                                               tial 2-space configuration was chosen randomly.
    Manuscript received August 26, 1968; revised February 2, 1969.
    The author was with Rome Air Development Center, Griffiss             2 For the purpose of this discussion it is convenient to think of
AFB, Rome, N. Y. He is now with Computer Symbolic, Inc., Rome,         the starting configuration as being selected randomly; however, in
N. Y.                                                                  practice the initial configuration for the vectors is found by project-
    1 Any distance measure could be used; however, if we have no a     ing the L-dimensional data orthogonally onto a d-space spanned
priori knowledge concerning the data, we would have no reason to       by the d original coordinates with the largest variances.
prefer any metric over the Euclidean metric. Thus, this algorithm          I One exception is data set 3 which is a classical data set. This
uses the Euclidean distance measure.                                   data set was not artificially generated.
402                                                                    IEEE TRANSACTIONS ON COMPUTERS, MAY 1969

   2) Circular Data: The data consisted of nine points, in Appendix I. As was expected, the data structure in-
eight of which were spaced evenly (450 apart) along a herent in both data sets 1 and 2 was faithfully repro-
circle of radius 2.5 units in a 2-dimensional space. The duced under the mapping (see Figs. 1 and 2). This was
ninth point was placed at the center of the circle. The expected since the data sets are 1- and 2-dimensional,
initial 2-space configuration was chosen randomly.                  respectively, and therefore a mapping to a 2-space can
   3) Iris Data: This data set is fairly well-known since be accomplished with zero error. In both cases the final
it was used by Fisher [4] in several statistical experi- error was 10-16.
ments. The data were originally obtained by making                     Observing the result of the Iris data mapping (Fig. 3),
four measurements on Iris flowers. These measurements we can essentially detect the three species of Iris. The
were then used to classify three different species of Iris          final error was 2 X 10-3 which is considered quite small.
flowers. Fifty sample vectors were obtained from each                  The results obtained on the simplex data (data set 4)
of the three species. Thus, the data set consists of 150 were quite interesting. The result of the mapping clearly
points distributed in a 4-dimensional space.                        showed the presence of five clusters (see Fig. 4). How-
   4) Gaussian Data Distributed at the Vertices of a Sim- ever, when we compare this result with the projection
plex: These data consist of 75 points distributed in a of the same data onto the 2-space defined by the two
4-dimensional space. There are five spherical Gaussian largest eigenvectors of the estimated data covariance
distributions which have their respective mean vectors matrix, we can only detect four clusters. Two of the
located at the vertices of a 4-dimensional simplex.4 The clusters overlap completely in the 2-space which fits
intervertex distance is V5/4 units and each covariance the data in the least squares sense (see Fig. 5). The re-
matrix is diagonal with a standard deviation along every sulting NLM error was 0.05. The same experiment has
coordinate of 0.2 unit. Fifteen points were generated been conducted using Gaussian data distributed at the
from each of the five Gaussian distributions, making a vertices of higher-dimensional simplexes. Figs. 6 and 7
total of 75 points.                                                 show the NLM and principal eigenvector plots, re-
   5) Helix Data: This data set consisted of 30 points dis- spectively, for a 19-dimensional Gaussian simplex dis-
tributed along a 3-dimensional helix. The parametric tribution. These experiments indicate that for some data
equations for this helix are                                        sets, the NLM is superior to eigenvector projections for
                                                                    data structure analysis.
                            X = cos         Z                          The results shown in Figs. 8 and 9 clearly indicate the
                            Y = sin Z                               "istring structure" in data sets 5 and 6, respectively. The
                                                                    mapping error for the helix data was 6 X 10-4. The error
                            z =
                                   .\/2 t.                          for data set 6 was 1.6 X 10-3.
                                     2                                 The utility of any data analysis technique is somehow
                                                                    more convincing when applied to "real" data as opposed
The points are distributed at one-unit intervals along to artificially generated data, presuming, of course, that
the curve (i.e., t=0, 1, 2,            , 29).
                                                                    the analysis results are correct. For this reason, the
   6) Nonlinear Data: This data set consisted of 29 application of the NLM algorithm to an experiment in
points distributed evenly along a 5-dimensional curve. document retrieval by content is reported here.
The parametric equations for this curve are                            The experiment, conducted jointly by Rome Air De-
                          X = cos
                                                                    velopment Center (RADC) and the University of Col-
                                        Z
                                                                    orado, involved the construction of a document classifi-
                          Y = sin Z                                 cation space (referred to as the C-space) where every
                          U = 0.5 cos 2Z                            document in the library was represented as a 17-dimen-
                                0.5 sin 2Z
                                                                    sional vector. The construction technique devised by
                          V
                                                                    Ossorio [8], [9] describes a mapping of 1125 preselected
                                    =




                                 -/ t.                              words and phrases into the C-space. Documents, or
                          Z
                                                                    equivalently retrieval requests, were located in the space
                                    =



                                 2
                                                                    by computing the vector average of the corresponding
The points were distributed at one-unit intervals along key words or phrases which were contained in the docu-
the curve (i.e., t = 0, 1, *       , 28).                           ment or the request. Retrieval was accomplished by
   Figs. 1 through 9 display the results obtained using rank-ordering the relevance of the library documents to
the nonlinear mapping algorithm. Convergence was es- a given request. The relevance measure was computed
sentially obtained for each case in twenty or less itera- using the Euclidean metric between the document vec-
tions using the gradient searching technique described tors and the request vector, the concept being that
                                                                    document vectors which are close in the C-space are
    4The vertices of a simplex are all equidistant from one another
                                                                    related by content and therefore should be retrieved to-
as   well   as   from the origin.                                gether.
SAMMON: NONLINEAR MAPPING FOR DATA STRUCTURE ANALYSIS                                                                                                                                                                                  403
                                                                                                                       NONLINEAR MAPPING
                                                                                                                  Data: 9 pts. along a straight line
                                                                                                                  Original Dimension =9
                                                                                                                  Starting Configuration: Random
                                                   15
                                                                                                                  Mapping Error-10-16


                                                   10                             _                                      __/




                                                   -5t


                                                    -15          -10         -5       0     5          10           I5             20                 25

                                                                                            Fig. 1.


                                                                 NONLINEAR MAPPING                                                                                                                     NONLINEAR MAPPING
                                                           Data: 9 pts. on o circle                                                                                                             Data:75pts. trom 5Goussion distributions
                                                           Original Dimension -2                                                                                                                Original Dimension 4
                                                           Starting Configurotion: Random                                                                                                       Starting Configuration one
                                                                                                                                                                                                          coord I note pl
                                                                                                                                                                                                                          Maximum variance
                                                           Mapping Error = 10-16                                                                                                                 Mapping Error - 05
                                                                                                 2.0




                                                                                                  1.5

12



10                                                                                                                                                                                                        00
                                                                                                            1.0                                                                                       0~~~~~~~~~~~
                                                                                                                              43~ ~ ~~tg 4.
                                                                                                                                          0~~~~~0
                                                                                                                                     0 0 00~0  0


                                                                                                                                                                          0~~~~~~~
8-.

                                                                                                                         .5                         1. 0
                                                                                                                                                         -           --          IS5 --          E<EYCTRPO2.5
                                                                                                                                                                                                   2.0                                       S.C

                 -I     -2          -3        -4            -5               -6       -7                                                                                  Fig. 4.
                                          Fig. 2.
                                                                                                                                                                                                 EIGENVECTOR PLOT
                  NONLINEAR MAPPING                                                                                                                                              Data: 75p t. from 5Gaussian distributions
                                                                                                                                                                                 Original Dimension -4
            Dot a:   ris data                                                                                                                                                    Projection on the two principle eigenvectors
            Original Dimension =4                                   x
                                                                                                                                                           5                     5
            Storting Configurotion Maximum Variance                X              X
                                   Coordinate Plane                      X                                                                                                            51
6           Mapping Error-*002 X-                                      XX                                                     3     3                                     5
                                                                                                                                            5


                                                                 xX                                                                  5 5                         5
                                                                                                                                     35 5 5
                                                                                                                                  35 5                       5
                                                                                                                                                                         5~          ~~tIll I
                                                         X'xxx xxxK
                                                                                                                     3 333                      5            5
                                                                                                                         33             3                             5                   55
                                                                                                                                                                                                  .I
                                                                                                                          3
                                                                                                                              3         3 33
     4 0OoO                        @     :=:              K~: x
       o003 °° °
          0                   6@    o
                                   @ Q tw~~~ X
                                                                                                                                                                                                               2
                                                                                                                                                                                                      2       22
            000~~~~~
                                                                                                                                                                          4                               2    2
                                                                                                                                                     44 4             4
                                                                                                                                                                         44421                  22 22
                                                                                                                                                                                                 2222                  22
        0   00
                                                                                                                                                                          44               22      22
                                                                                                                                                                                                  22               2
                                                                                                                                                                                                                       2
                                                                                                                                                                      4
                                                                                                                                        4                        44                   2
                                                                                                                                                      444                         4

                                                                                                                                                     4               4                            2                         4
                 2        3         4          5             6               7        s9

                                          Fig. 3.                                                                                                                         Fig. 5.
4i4                                                                                                                                      IEEE TRANSACTIONS ON COMPUTERS, MAY 1969

                                                                                                                                                   NONLINEAR MAPPING
                                                                                                                                              Data: 100 pts. from 2OGaussiOn-spherical
                                                                                                                                                    distributions= 19
                                                                                                                                              Origina I d men sion a 19
                                                                                                                                               Starting Conf ig ura t ion s Random
                 4.850
                                                    4.850
                                                                                 H
                                                                                                              .....ff~                        ~~~~~~~~~
                                                                                                                                               MaDPing Error 14
                                                                                                                                                  KK.K
                                                                                      H                              E                                           K                              A
                 4.484--                                                                                                                                                              A     A       A

                 4.30:                                                                                                           T
                                                                                     0                                                   T
                 4.11.
                                                                                                                                                                                            L
                                                                                 ]                 |                     i                      |        Ey                  L L
                 3.934 - -0205             t
                                           l
                                                                                                                                                                 ~~
                                                                       ~~~R
                                                                                                                                                    -~



                                                                                                                                                        NN
                                       4
                                           M   66
                                                                   R
                                                               _______                                   E~~~~~Fig 6
                             M
                 3.384                                                                                 .-1G
                                                                                                          G
                                                                                                                             G
                                                                                                                    G                                                                               DD
                 3.20:
                                                    B          B                                                    G
                 3.01 8                    B            B.S
                 2.834
                                                                                 F                                                                  S        S
                 2.65:
                                      .304                .608                .912             1.216              __50__24
                                                                                                               C1.0  84 2.2
                                                                                                                        212
                                                                                                                                                                                 __
                                                                                                                                                                                      43_276
                                                                                                                                                                                      243 276                     300
                                                                                                                                                                                                                   .4


                                                                                                         Fig. 6.

                                                                                                                                                                     EIGENVECTOR PLOT
                                                                                                                                                Data- 100 pts from 2OGaussiian spherical
                                                                                                                                                       distributions
                                                                                                                                                Original Dimension a 19
                 0. 630
                                                                                                                                                Projection on the two principle                         esigenvectors
                 0. 538

                 0.445                                                                                 MN---
                                                                                                        M
                 0.352                                                                         M
                  0.259
                                                                                               R
                  0. 16(                                                             -R--N-                                                                      -
                                                                                           R                                                                           T


                                  Q                                      O7s                                                     F
                                                                                                                                                    N
                                                                                                                                                             N
                                                                                     v
                                                         Q         -TJ
                                                                                     ~~~~c               S         UGH                                   N                                                    A AA


                 -0.205E                                                                                                                 0                       KK          L
                                                                                                                                     8                                       LL                 L


                 -   0.391

                 -0.4841                                                                 R__   __BI_
                         -0.654        0.531              -   0.407       -0.283               -0.159 -0.035                             B   O.O8                    0.212                0.336           0.460

                                                                                                          Fig. 7.

    Briefly, the C-space construction proceeded as follows. or phrases as being represented by vectors in a 23-
 First the subject content covered by the 188 documents dimensional space spanned by the 23 coordinate fields.)
 in the experimental library was subjectively partitioned Next, a 23 X 23 field correlation matrix C was computed,
 into 23 technical fields (see Appendix II for a listing of where the ijth element represented the correlation be-
 these fields). Several experts representing each field tween the ith and jth fields. C was then factored using
 rated the relevance of each of the 1125 words or phrases the minimum residual method and rotated to a Varimax
 to his field, using a scale from 0 to 8. The rating by the criterion. Seventeen orthogonal factors were then
 experts within each field were then averaged to obtain a selected to define the 17-dimensional C-space.
 word-by-field relevance matrix designated X; the ijth         All 1125 words and phrase vectors were mapped into
 element of   X represents the relevance of word or phrase the 17-dimensional C-space using a simple nonlinear
 i to field j. (It is convenient to think of the 1125 words formula which tended to emphasize large coordinate
                   9.-

                   8.-




                   IC5-




                   IC
                     Daa79ps




                    15
                    14
                         -H


                         -
                             I    2   3.




                                       IOUNA


                                 -Ori-in-




                                       4
                                            ~~0
SAMMON: NONLINEAR MAPPING FOR DATA STRUCTURE ANALYSIS




                                       NOUNA -.PiN


                                            ln




                                           4.1
                                           . l




                                            5
                                             5 6
                                                 -




                                                 7




                                                     MPIN



                                                     6
                                                         - - -
                                                           -




                                                         olna




                                                          -im-ns-o-.




                                                           7
                                                                 uv




                                                                 8
                                                                     -
                                                                         8
                                                                             -   -




                                                                                 10
                                                                                       l0
                                                                                            -   -




                                                                                                12
                                                                                                     1
                                                                                                         -




                                                                                                         13
                                                                                                              3    1




                                                                                                                  14
                                                                                                                            NONLINEAR MAPPING
                                                                                                                       Data1: 3Opts. along a helix
                                                                                                                       ~~~~~~~Original Dimension- 3
                                                                                                                        Starting Configuration:'
                                                                                                                       Mopping




                                                                                                                         15
                                                                                                                              S




                                                                                                                                  16
                                                                                                                                       Error-




                                                                                                                                       1




                                                                                                                                           17
                                                                                                                                                7
                                                                                                                                                    6xIOA4




                                                                                                                                                     1




                                                                                                                                                    18   19
                                                                                                                                                              Random




                                                                                                                                                              9    2




                                                                                                                                                                  20   21
                                                                                                                                                                            ;I
                                                                                                                                                                                 405




                                                                                      Fig. 9.

projections and minimize small coordinate projections. and finally mapped into the C-space. Each requester was
Finally, the 188 documents were located in the C-space then asked to identify those documents of the entire 188
by algebraically averaging the word or phrase vectors which he felt were most relevant to his query. The C-
corresponding to the word or phrases which appeared in space was then evaluated by examining the rank order-
the documents.5                                                     ing of the retrieved documents to compare them to the
   In order to evaluate the C-space as a potential method list of relevant documents specified by the requester.
for document indexing, several individuals were asked to The results of this evaluation can be found in Ossorio
generate English queries (see Appendix III for the per- [8].
tinent queries used here) which were then keypunched,                 The nonlinear mapping algorithm was used to evalu-
automatically scanned for key word or phrase content, ate the "structure" of the documents in the C-space.
                                                                    Specifically, we were interested in how the documents
   5The entire document was never searched for key words or considered relevant to a particular request were clus-
phrases. Rather, for one half of the documents only the abstracts          and further, how these clusters were interrelated
were used, and for the remainder several paragraphs from each docu- tered,
ment were used.                                                     to each other and to the entire library. To accomplish
406                                                                            IEEE TRANSACTIONS ON COMPUTERS, MAY 1969

                                                             The experimental system will operate as follows. The
                                                             on-line user would examine the 30 highest-ranked docu-
                                                             ments by retrieving and reading their abstracts. He
                                                             would then indicate those he considered relevant. Next,
                                                             a scatter diagram similar to Fig. 10 would be presented
                                                             upon the CRT display where each of the 30 documents
                                                             would be indicated by an I or an R, depending upon its
                                                             relevance. In addition, the original query vector will be
                                                             displayed as a Q. After examining the relative positions
                                                             of the documents in the mapping, the user would select
                                                             (using a light pen) one or more relevant documents to
                                                             be used to generate a new query vector(s). The concept
                                                             is that the query vector can be moved to highly relevant
                                                             regions of the document space by interacting at a display
Fig. 10.  Nonlinear mapping-photograph of CRT display. Data: console with a geometric representation of the space.
    1= eight Request 1 vectors; 2 =seven Request 2 vectors; 3
                                                       =          six-
    teen Request 3 vectors; 4 = thirteen Request 4 vectors; 5 = seven
    Request 5 vectors. Starting configuration: maximum variance                     RELATIONSIIIP OF NLM TO OTHER
    coordinate plane. Mapping error 0.062.
                                     =

                                                                                     STRUCTURE ANALYSIS ALGORITHMS
this analysis, all 188 17-dimensional vectors were used                     A mapping algorithm which bears a relationship to
as  the input data to the NLM. The numerals 1 through                    the NLM algorithm is one developed by Shepard [11]
5 were used in the resulting 2-dimensional mapping to                    and later improved by Kruskal [5], [6]. Briefly, the
designate the documents labeled relevant to queries 1                    Shepard-Kruskal algorithm seeks to find a configuration
through 5. In addition, the symbol D was used to desig-                  of points in a t-space such that the resultant interpoint
nate the remaining library documents. It is important                    distances preserve a monotonic relationship to a given
to note that the NLM algorithm did not utilize the                       set of interelement similarities (or dissimilarities).
numeric query designations in computing the mapping.                     Specifically, they wish to analyze a set of interelement
Only at the time of plotting the final 2-space configura-                similarities (or dissimilarities) given by Sij, i = 1, * * ,
tion of the 188 points were the numeric and symbolic                     N, j = 1, * * *, N. Suppose these similarities are ordered
designators used to distinguish the data. The error in                   in increasing magnitude, such that
achieving the NLM shown in Fig. 10 was 0.062, which
was considered to be acceptable for adequate 2-space                       SPJl1 < SP2q12 < S...
                                                                                               <

representation.                                            The Kruskal-Shepard algorithm seeks to find a set of
   The following facts were obtained upon investigation N t-dimensional vectors yi, i = 1, . . . , N, such that the
of the NLM result.                                         order of the interpoint distances dij=dist[yi, y;] devi-
   1) The documents considered relevant to a given re- ates as little as possible from the monotonic ordering of
quest were clustered, lending evidence to support the the corresponding similarities. Although the mathema-
hypothesis that related documents have C-space vectors tical formulations are similar, the underlying mapping
which are close.                                           criterions are quite different.
   2) There does not appear to be any natural C-space         Ball [1 ] has compiled an excellent survey of cluster-
structure relating subsets of documents. Instead, the      ing and clumping algorithms which are useful in solving
documents tend to be uniformly distributed throughout the "structure analysis" problem. However, it has been
the space.                                                 our experience in using clustering techniques that these
   3) Clusters 2 and 3 tend to overlap, yet they are well- algorithms suffer to some extent from the following four
separated from clusters 4 and 5. This can easily be ac- deficiencies.
counted for since requests 2 and 3 are both concerned         1) When using a particular algorithm, the resulting
with the common subject of statistical data analysis, cluster configuration is highly dependent upon a set of
whereas 4 and 5 involve completely different subjects. control parameters which must be fixed by the user.
In general, the intercluster relationships seem consistent Some examples of such parameters are:
with their respective subject relationships.
                                                                a) the similarity measure;
   In summary, we have found the NLM algorithm to                b) various similarity thresholds;
be of considerable value in aiding us in our understand-        c) number of iterations required;
ing of the C-space as well as other document spaces.            d) thresholds which control the increase or reduc-
Presently we are planning to incorporate a similar map-             tion of the number of clusters;
ping technique in an on-line document retrieval system          e) the minimum number of vectors required to de-
 in order to improve the retrieval via geometric means.             fine a cluster.
SAMMON: NONLINEAR MAPPING FOR DATA STRUCTURE ANALYSIS                                                                     407
   When choosing the control parameters for complex           we are limited at present to N< 250 vectors.6 In those
data, the user must either have a good deal of a priori       cases where N> 250, we suggest using a data compres-
information regarding the "structure" of his data, or he      sion technique to reduce the data set to less than 250
must apply the algorithm many times for different val-        vectors. Specifically, we propose to use the Isodata [2]
ues of the control parameters. This second alternative is,    clustering algorithm to perform data compression. This
at best, tedious.                                             is actually a natural function of clustering since we re-
   2) Most of the existing clustering algorithms are          place several vectors with a typical representative
particularly sensitive to hyperspherical structure and are    vector (i.e., the cluster center). Our previous objections
inefficient in detecting more complex relationships in        to present-day clustering algorithms do not apply here
the data.                                                     since we are only concerned with fitting the data with
   3) Perhaps the most serious deficiency involving           250 cluster centers. We are specifically not using the
present-day clustering algorithms is that there do not        clustering algorithm to detect structure.
exist really good ways for evaluating a resultant cluster        We have used the NLM to analyze multivariate data
configuration.                                                from two or more classes for the purpose of determining
   4) When two clusters are close, the vectors between        how well the classes can be discriminated from one an-
tend to form a bridge and cause spurious mergers [7].         other. In these cases, it is recommended that the dimen-
                                                              sionality be reduced to the smallest number of variables
   We feel that the nonlinear mapping is a highly prom-       which still preserve discrimination.7 In many problems
ising structure analysis algorithm since it suffers little    certain measurements provide little discriminatory in-
from the listed clustering deficiencies. Consider the         formation; yet if these measurements are included, the
following facts concerning the algorithm.                     NLM will attempt to "fit" interpoint distances along
   1) The routine does not depend upon any control            these "noisy" directions as well as along discriminating
parameters that would require a priori knowledge about        directions. In truly high-dimensional problems, the re-
the data. Specifically, the user must set the number of       sulting mapping may show considerable overlap be-
iterations and the convergence constant (MF in Appen-         tween classes and still a high degree of discrimination
dix I).                                                       may be possible. This phenomena occurred when
   2) It is highly efficient in identifying hyperspherical,   analyzing a 4-class, 24-dimensional data set. The result-
hyperellipsoidal, and other complex data structures.          ing NLM (the final error was 0.5, which was considered
   3) The resulting mapping (scatter diagram) is easily       high) showed considerable overlap among the data
evaluated by the researcher, thereby taking advantage         from three of the classes; yet, using a piecewise linear
of the human ability to detect and identify data struc-       discrimination technique (based upon the use of a
ture.                                                         Fisher's linear discriminant between all pairs of classes),
  4) The problem concerning extraneous data and               94 percent correct classification was achieved. In this
spurious mergers is not present since humans easily           case, the NLM did not give an incorrect result since the
eliminate troublesome data points by making global
evaluations (machines have difficulty performing this            6 The nonlinear map is programmed in FORTRAN IV and runs on a

function).                                                    GE-635 computer equipped with 128 K of core. The computation
                                                              time can be estimated by
  5) The algorithm is simple and efficient.
                                                                                T'c   (1.1 X   10-5)- (2   )

               LIMITATIONS AND EXTENSIONS                    minutes, where
   There are, of course, limitations to every algorithm         I= number of iterations
                                                                N= number of vectors.
and the nonlinear mapping is no exception. There exist
two limitations which we are presently investigating. use 7the number of techniques may be used for this purpose. We often
                                                                  A
                                                                    following:
The first has to do with the reliability of the scatter dia-    a) Discriminant measure
gram in displaying extremely complex high-dimensional
structure. It is conceivable that the minimum mapping                             M(X) E               -
                                                                                            i<Ki    .i2 + j2
error is too large (E>>0.1) and the 2-dimensional scatter
                                                                b) Interpoint measure
plot fails to portray the true structure. However, we feel
that for data structures composed of superpositions of                 M (X) =2 E
                                                                                  1        1     Ni Ni
                                                                                                     E (Xp(fi) Xq(i))2
                                                                                 (TX i< NiNj p=1 q=.1
hyperspherical and hyperellipsoidal clusters, the non-
linear mapping algorithm will, in general, display ade- where
quate representations of the true data "structure."            g2i= mean of class i along X
                                                               O'Xi= variance of class i along X
  The second limitation of the nonlinear mapping al-              2=variance of all data along X
gorithm is related to the number of vectors that it can Xp(i) the pth sample from the ith class along X
handle. Since we must compute and store the interdis-          Ni= number of samples from the ith class.
tance matrix, which consists of N(N-1)/2 elements,              c) Multilinear discriminant defined in Wilks [141.
408                                                                            IEEE TRANSACTIONS ON       COMPUTERS, MAY       1969

classes greatly overlapped in approximately 20 dimen-                aE        -2 N[d  -ddpj-
sions and mildly overlapped in the remaining space. The                                      Ld-      yjq)   j(ypq
                                                                                         =

                                                                    (9yp,q      c j=l , d * (Y- -Yq
                                                                                        dpjdvj.*
NLM weighted all coordinates equally in an attempt to
fit the interpoint distances, and therefore the resulting
mapping indicated the predominant overlap which actu- and
ally existed.                                               a2E -2 N
   The NLM algorithm described here is one of many al-
gorithms which are being programmed and incorporated (9yV
                                                               2      C j=1 dpj*dpj
                                                                           j#p
into a large    on-line graphics-oriented computer sys-
tem, entitled the On-Line Pattern Analysis and Recog-
nition System (OLPARS)            [10].8 Once the NLM al-
                                                                     [(d* - dp))- q +-Yi)2                               *dp)]
gorithm is incorporated into the OLPARS system, the
on-line user will be able to designate a data set, and from  In our program we take precautions to prevent any
the graphics console execute the NLM. The user shall two points in the d-space from becoming identical. This
specify a mapping to a 2-space or a 3-space. For d = 2, prevents the partials from "blowing up."
the resultant scatter diagram will be displayed upon the
CRT; for d = 3, a perspective scatter plot will be dis-                             APPENDIX II
played. If the 3-space option is selected, the user will be                CLASSIFICATION SPACE FIELDS
able to dynamically analyze the resultant perspective         1) Adaptive Systems
scatter diagram by selecting various rotations of the         2) Analog Computers
three space. When the user selects d = 2, he will be given    3) Applied Mathematics
the capability to designate subsets of data (via piecewise    4) Automata Theory
linear boundaries drawn on the CRT) representing a col-       5) Computer Components and Circuits
lection of points in the scatter diagram which exhibit        6) Computer Memories
structure, and thereby partition the initial data list into   7) Computer Softwave
structured subsets.                                           8) Display Consoles
                        APPENDIX I                            9) Human Factors
                                                             10) Information Retrieval
   Let E(m) be defined as the mapping error after the        11) Information Theory
mth iteration, i.e.,                                         12) Input-Output Equipment
                      1   N                                  13) Language Translation
             E(m) -- [d ij*-dij(m) 2/dij*                    14 ) Linear Algebra
                      C i<j
                                                             15) Multivariate Statistical Analysis
where                                                        16) Nonnumeric Data Processing
                               N                             17) Numerical Analysis
                        c = E [dIj*]                         18) Pattern Recognition
                              iKi                            19) Probability and Statistics
and                                                          20) Programming Languages
                                                             21) Stochastic Processes
                         V d
                            Z [yik(m) - yjk(m)]2 -
                                                             22) System Design and Evaluation
            dij(m) =       k=l                               23) Time-Sharing Systems.
The new d-space configuration at time m +1 is given by                                           APPENDIX I II
                                                                                                   REQUESTS
           ypq(m + 1)      =   ypq(m)   -   (MF) *p.(m)
                                                                             Request 1: What is known about the statistical dis-
where                                                                     tributions of words or concepts in English text? What
                                                                          impact does this knowledge or lack of knowledge have
              Apq(m)   =   aE(m)        /   92E(m)                        on the effectiveness of standard statistical methods to
                           ,Oypq(m)         Oypq(M)2                      information retrieval problems? Are nonparametric
and MF is the "magic factor" which was determined                         methods more applicable?
empirically to be MFm 0.3 or 0.4. The partial derivatives                    Request 2: I am interested in techniques for data anal-
are given by                                                              ysis. In particular, I wish information on "cluster-seek-
                                                                          ing" techniques as opposed to those of factorial analysis
    8 For other examples of interactive pattern analysis systems,   see
                                                                          and discriminant analysis. "Cluster-seeking" techniques
Ball and Hall [31, Stanley et al. [12], and Walters [13].                 may be classified as follows: probabilistic techniques,
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-18, NO. 5, MAY 1969                                                                               409

signal detection, clustering techniques, clumping tech-                                            REFERENCES
niques, eigenvalue-type techniques, and minimal mode-                     [1] G. H. Ball, "A comparison of some cluster-seeking techniques,"
seeking techniques.                                                           Rome Air Development Center, Rome, N. Y., Tech. Rept.
                                                                              RADC-TR-66-514, November 1966.
   Request 3: I would like any information concerning                     [21 G. H. Ball and D. Hall, "Isodata," portion of Stanford Research
Bayesian statistics. In particular, I would like to know                      Institute (SRI) Final Report to RADC Contract AF30(602)-
                                                                              4196, September, 1967.
if one can define or devise multiple-decision procedures                  [31      , "Promenade-an improved interactive graphics man/ma-
from the Bayes approach. Also, how sensitive are Bayes                        chine system for pattern recognition," Stanford Research In-
                                                                              stitute, Menlo Park, Calif., Project 6737, October, 1968.
procedures to the prior distribution? Finally, I would                    [4] R. A. Fisher, "The use of multiple measurements in taxonomic
like a comparison of the Bayes approach to other classi-                      problems," Ann. Eugenics, vol. 7, pp. 178-188, 1936.
                                                                          [5] J. B. Kruskal, "Multidimensional scaling by optimizing good-
cal decision theoretic approaches.                                            ness of fit to a nonmetric hypothesis," Psychometrika, pp. 1-27,
                                                                              March 1964.
   Request 4: What is the structure and characteristics                   [6] -, "Nonmetric multidimensional scaling: a numerical meth-
of paging techniques?                                                         od," Psychometrika, vol. 29, pp. 115-129, June 1964.
                                                                          [7] G. Nagy, "State of the art in pattern recognition," Proc. IEEE,
   Request 5: Are there survey documents (information)                        vol. 56, pp. 836-861, May 1968.
available which discuss or detail the relative practical-                 [81 P. G. Ossorio, "Classification space analysis," RADC-TDR-64-
                                                                              287, October 1964.
ity of memories; for example, capacity versus utiliza-                    [9]       , "Attribute space development and evaluation," RADC-
tion, density, weight, environmental features, failure                        TDR-67-640, January 1968.
                                                                         [101 J. W. Sammon, "On-line pattern analysis and recognition sys-
rates, economics, etc.?                                                       tem (OLPARS)," RADC-TR-68-263, August 1968.
                                                                         [11] R. N. Shepard, "The analysis of proximities: multidimensional
                                                                              scaling with an unknown distance function," Psychometrika, vol.
                                                                              27, pp. 125-139, 219-246, 1962.
                                                                         [12] G. L. Stanley, G. G. Lendaris, and W. C. Nienow, "Pattern rec-
                     ACKNOWLEDGMENT                                           ognition program," AC Electronics Defense Research Labs.,
                                                                              Santa Barbara, Calif., TR-567-16, November 1967.
  The author expresses his appreciation to D. Elefante                   [13] C. M. Walters, "On line computer based aids for the investiga-
for his efforts in developing an efficient FORTRAN iv pro-                    tion of sensor data compression, transmission and delay prob-
                                                                              lems," 1966 Proc. Natl. Telemetry Conf., Boston, Mass.
gram for the Nonlinear Mapping Algorithm.                                [14] S. S. Wilks, Mathematical Statistics. New York: J. Wiley, 1962.




                     Mathematical                   Analysis of Ferrite Core
                                                  Memory Arrays

                                                       WILLIAM T. WEEKS

    Abstract-A mathematical model for simulating pulse propaga-                             INTRODUCTION
tion in ferrite core memory arrays is described. Although specifically
developed to analyze 3-dimensional arrays, the model is sufficiently  EU URING the past five years, considerable prog-
general to give a satisfactory analysis of pulse propagation, waveform       ress has been made in the development of mathe-
deterioration, and noise generation in a wide variety of memory              matical models for simulating the electrical
configurations. The model treats the memory as a generalized,        properties of ferrite core memory arrays. The purpose of
mutually coupled, multiconductor transmission line system. Insofar
as is possible, the transmission line parameters are calculated from this paper is to describe the techniques available for the
the array geometry, thus leaving only a small number of parameters   analysis of 3-dimensional ferrite core memory arrays.
that must be supplied empirically. Following a discussion of the     The techniques presented here represent a substantial
equations which define the model and the methods by which they       advance in the state-of-the-art over earlier reported
are solved, a sample array calculation is given to illustrate the kind
                                                                     work [1], [2], which dealt mainly with the simulation of
of information that can be obtained from the model.
                                                                     2-dimensional arrays.
   Index Terms-Arrays, computers, ferrite cores, memories, pulse        A precise mathematical description of a memory ar-
propagation, transmission line system.                               ray, backed up by a rigorous and practically realizable
                                                                     method for solving the resulting equations, would be of
    Manuscript received October 23, 1968; revised February 10, 1969. inestimable value to a memory designer, for it would re-
    The auithor is with IBM Corporation, Components Division,
Poughkeepsie, N. Y. 12602.                                           move much of the uncertainty from the design process

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:11/19/2011
language:English
pages:9