Analysis of Data Mining Visualization Techniques Using ICA AND SOM Concepts by ijcsis

VIEWS: 103 PAGES: 10

More Info
									                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 9, No. 1, January 2011


                                 USING ICA AND SOM CONCEPTS
                            K.S.RATHNAMALA,1 Dr.R.S.D. WAHIDA BANU2
                            1 Research Scholar of Mother Teresa Women’s University, Kodaikanal
                            2 Professor& Head, Dept. of Electronics& Communication Engg., GCE.

This research paper is about data mining (DM) and                Agglomerative hierarchical methods, Time series

visualization     methods     using       independent            segmentation, Finding patterns by proximity,

component analysis and self organizing map for                   Clustering validity indices, Feature selection and

gaining insight into multidimensional data. A new                weighing Fast ICA.

method is presented for an interactive visualization             1. INTRODUCTION

of cluster structures in a self-organizing Map. By               The tasks that are encountered within data mining

using a contraction model, the regular grid of self-             research are predictive modeling, descriptive

organizing map visualization is smoothly changed                 modeling,     discovering     rules     and    patterns,

toward a presentation that shows better the                      exploratory data analysis, and retrieval by content.

proximities in the data space. A Novel Visual Data               Predictive modeling includes many typical tasks of

Mining method is proposed for investigating the                  machine learning such as classification and

reliability of estimates resulting from a Stochastic             regression. Descriptive modeling that is ultimately

independent component analysis (ICA) algorithm.                  about modeling all of the data e.g., estimating its

There are two algorithms presented in this paper                 probability distribution.     Finding a clustering,

that can be used in a general context. Fast ICA for              segmentation or informative linear representation

independent binary sources is described.           The           are common subtasks of descriptive modeling.

model resembles the ordinary ICA model but the                   Particular methods for discovering rules and

summation is replaced by the Boolean Operator                    patterns    emphasize    finding      interesting   local

OR and the multiplication by AND. A heuristic                    characteristics and patterns       instead of global

method for estimating the binary mixing matrix is                models.

also proposed. Furthermore, the differences on the                          Descriptive data mining techniques for

results when using different objective function in               data description can be divided roughly into three

the FastICA estimation algorithm is also discussed.              groups:

KEY WORDS:                                                       Proximity preserving projections for (visual)

    Independent     component         analysis,    Self          investigation of the structure of the data.

organizing map, Vector quantization, patterns,

                                                                                             ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                     Vol. 9, No. 1, January 2011

Partitioning       the           data          by       clustering    and           beginning gives a global ordering for the map. The

segmentation .                                                                      kernel width σ(t) is then decreased monotonically

Linear projections for finding interesting linear                                   along with iteration steps which increases the

combinations of the original variables using                                        flexibility of the map to provide lower quantization

principal component analysis and independent                                        error in the end. If the radius is run to zero, the

component analysis.                                                                 batch SOM becomes identical to K-means.

          A clustering is a partition of the set of all                                      The batch SOM is a computational short-

data items C= {1,2,....N} into K disjoint clusters                                  cut version of the basic. Despite the intuitive clarity

C = U iK = 1Ci                                                                      and elegance of the basic SOM, its mathematical

                                                                                    analysis has turned out to be rather complex. This
                                                                                    comes from the fact that there exists no cost
        The basic Self-organizing map is formed of
                                                                                    function that the basic SOM would minimize for a
K map units organized on a regular k x l low-
                                                                                    probability distribution .
dimensional grid-usually 2D for visualization.
                                                                                    In general, the number of map codebook vectors
Associated to each map unit i, there is a
                                                                                    governs the computational complexity of one
1.   Neighborhood                kernel         h(dij,σ(t))where       the
                                                                                    iteration step of the SOM. If the size of the SOM is
distance dij is measured from map unit i to others
                                                                                    scaled linearly with the number of data vectors, the
along the grid (output space), and
                                                                                    load scales to O (MN2). But on the other hand, the
2. a codebook vector ci that quantize the data space
                                                                                    selection of K can be made following, e.g.,        N as
(input space).
                                                                                    suggested in and the load decreases to O (MN1.5).
The magnitude of the neighborhood kernel
                                                                                    It is suggested that the SOM Toolbox applies to
decreases monotonically with the distance dij. A
                                                                                    small to medium data sets up to, say, 10 000-100
typical choice is the Gaussian kernel .
                                                                                    000 records. A specific problem is that the memory
Batch algorithm
                                                                                    consumption     in   the     SOM     Toolbox      grows
One possibility to implement a batch SOM
                                                                                    quadratically along with the map size K.
algorithm is to add an extra step to the batch K-
                                                                                             In practice, the SOM and its variants have
means procedure.
                                                                                    been successful in a considerable number of

        ∑ = 1 C h(d ,α (t ))c ) , ∀i
                         j         ij               j                               application fields and individual applications. In
ci :=

         ∑ = 1 C h(d σ (t ))
               j             j          dj ,                                        the context of this paper interesting application

                                                                                    areas close to VDM include
A relatively large neighborhood radius in the

                                                                                                                 ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 9, No. 1, January 2011

   Visualization and UI techniques especially in                 merely provide a partition of the items in the

information retrieval, and exploratory data analysis             sample: the agglomerative hierarchical methods

in general.                                                      provide an example of this case.

   Context-aware computing.                                      The family of partitional methods is often opposed

   Industrial applications for process monitoring                to the hierarchical methods.              Agglomerative

and analysis.                                                    hierarchical methods do not aim at minimizing a

         Visualization capabilities, data and noise              global criteria for partitioning, but join data items

reduction     by   topoloigically    restricted   vector         in bigger clusters in a bottom-up manner. In the

quantization and practical robustness of the SOM                 beginning, all samples are considered to form their

are of benefit to data mining.         There are also            own cluster. After this, at N-1 steps the pair of

methods for additional speed-ups in the SOM for                  clusters having minimal pairwise dissimilarity δ

especially large datasets in data mining and in                  are joined, which reduces the number of remaining

document retrieval applications.                                 clusters by one. The merging is repeated until all

         The SOM framework is not restricted to                  data is in one cluster. This gives a set of nested

Euclidean space or real vectors. A variant of the                partitions and a tree presentation is quite a natural

SOM in a non-Euclidean space or real vectors. A                  way of representing the result.

variant of the SOM in a non-Euclidean space is                   Here we list the between-cluster dissimilarities δ

presented to enhance modeling and visualizations                 of some of the most common agglomeration
of hierarchically distributed data.       This method            strategies the single linkage (SL), complete linkage
uses a fisheye distortion in the visualization. Also             (CL) and average linkage (AL) criteria.
self-organizing maps and similar structures for
                                                                 δ1 = δ SL = min dij , iεCk , jεC1
symbolic data exist and have been applied also to
                                                                 δ 2 = δ CL = mixdij , iεCk , jεC1
context-aware computation.

3.AGGLOMERATIVE                     HIERARCHICAL                 δ 3 = δ AL =          ∑
                                                                                 Ck C1 iεC k
                                                                                               j C1


Some clustering methods construct a model of the                 Where Ck, Cl, (k≠l) are any two distinct clusters.

input data space that inherently would allow                     SL    and    CL    are   invariant        for   monotone

classifying a new sample into some of the                        transformations of dissimilarity. SL is reported to

determined clusters. K-means partition the input                 be noise sensitive but capable of producing

data space in this manner. Some other methods                    elongated or chained clusters while CL and AL

                                                                                               ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 9, No. 1, January 2011

tend to produce more spherical clusters. If                      segmentation     with   SSE         cost    and    vector

similarities are used instead, the merging occurs                quantization. In vector quantization, the borders of

for maximum pairwise cluster similarity.                         the nearest neighbors regions Vi are defined by the

4. TIME SERIES SEGMENTATION:                                     codebook vectors, whereas in segmentation, the

          In addition to the basic cluster analysis              mean vectors, Ci are determined by the segments

tasks, other clustering methods that include                     Ci but cannot directly be used to infer the segment

auxiliary constraints are also discussed here. The               borders.

time series segmentation where the data items have                          Minimizing the cost in Eq- 1 for

some natural order, e.g., time, which must be taken              segmentation aims at describing each segment by

into account; a segment always consists of a                     its mean value. It may also be seen as splitting the

sequence of subsequent samples of the time series.               sequence so that the (biased) sample variance

          A        K-segmentation divides X into K               computed by pooling the sample variances of the

segments Ci with K -1 segment borders C1,......,                 segments together is minimal.

CK-1 so that                                                     Algorithms
                                                                            The basic segmentation problem can be
C1 = [x(1), x(2)......, x(c1)], ., CK= [x(cK-1+1), x(cK-
                                                                 solved optimally using dynamic programming.
1+2),....,x(cN)]   Eq -1
                                                                 The dynamic programming algorithm finds also
          This is the basic time series segmentation
                                                                 optimal 1,2,.... K-1 segmentations while searching
task where each segment is considered to emerge
                                                                 for an optimal K-segmentation. The computational
from a different model; Furthermore, we consider
                                                                 complexity of dynamic programming is of order O
the case where the data to be segmented is readily
                                                                 (KN2) if the cost of a segmentation can be
                                                                 calculated in linear time. It may be too much when
          As in the basic clustering task, we wish to
                                                                 there are large amounts of data.
minimize some adequate cost function by selection
                                                                            Another   class    are     the    merge-split
of the segment border. We stay with costs which
                                                                 algorithms of which the local and global iterative
are sums of individual segment costs that are not
                                                                 replacement algorithms (LIR and GIR) resemble
affected by changes in other segments.              An
                                                                 the batch K-means in the sense that at each step
example of such a function is an SSE cost function
                                                                 they change the descriptors of partition (Segment
like that of Eq-1 where ci is the mean vector of data
                                                                 borders vs. codebook vectors) to match with a
vectors in segment Ci.          There is, of course, a
                                                                 necessary condition of local optimum. The LIR
fundamental        difference   between   time   series
                                                                 gets more easily stuck in bad local minima, and the

                                                                                              ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 9, No. 1, January 2011

GIR was considerably better in this sense, yet still             within-cluster dispersion (Scatter) DW, between

sensitive to the initialization. The GIR and LIR                 cluster dispersion DB, and their sum, the total

algorithms can be seen as variants of the “Pavlidis              dispersion DT, that is constant and independent of

algorithm” that changes the borders gradually                    the clustering. For data in a Euclidean space.

toward a local optimum.                                                     K
                                                                 DW = ∑ DW (i ),                DW (i ) =   ∑ ( x( j ) − ci) ( x( j ) − ci)   T

         The      test    procedures    use    random                    i =1                               ε
                                                                                                            j Ci

initialization for the segments. As in the case of K-            DB = ∑ / Ci / ci − c) (ci − c)T
                                                                            i =1

means, the initialization matters, and it might be                                          N
                                                                 DT = DW + DB = ∑ ( x( j ) − c) ( x( j ) − c)T
advisable to try an educated guess for initial                                              i =1

positions.      One possibility to create a more                 Where K is the number of clusters, Ci is the

effective segmentation algorithm is to combine                   average of the data in cluster Ci, and c is the

several greedy methods. For example, the basic                   average of all data.              These quantities can be

bottom-up and top-down methods can be fine-                      formulated also for a general dissimilarity matrix.

tuned by merge-split methods.                                                The dispersion matrices can be used as a

Applications                                                     basis for different cost functions.                Tow criteria

         Time       series    and      other    similar          invariant to (non-singular) linear transformations

segmentation      problems     arise     in    different         of data based on the dispersion matrices;

applications, e.g., in approximating functions by                maximizing trace Dw DW . Minimizing det (DW)

piecewise linear functions. This might be done for               gives the maximum likelihood solution for a model
the purpose of simplifying or analyzing contour or               where all clusters are assumed to have a Gaussian
boundary lines.          Another aim, important in               distribution with the same covariance matrix.
information retrieval, is to compress or index                               The aforementioned criteria may be
voluminous signal data. Other applications in data               difficult to optimize. Therefore a scale dependent
analysis span from phoneme segmentation into                     criteria, minimization of trace (DW) has become
finding sequences in biological or industrial                    popular,          presumably       because        it   can    be
process data.                                                    (Suboptimally) minimized with the fast and
5. VECTOR QUANTIZATION:                                          computationally light K-means algorithm that is
Suggested by intuitive aim of the basic clustering               shortly described in more detail.
task, adequate global clustering criteria can be                 Minimization of trace (DW) is the same as
obtained by minimizing / maximizing a function of                minimizing the sum of squared errors (SSE)

                                                                                                    ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                Vol. 9, No. 1, January 2011

between a data vector x(i) and the nearest cluster                               nearest neighbor conditions that are necessary for

centroid Cj:                                                                     optimal vector quantization.

         SSE = ∑                ∑        x( j ) − ci 2
                    i −1         ε
                            x ( j ) Ci                                           1.   Given a codebook of vectors ci i=1,2,......K

The   above     Eq.        is    encountered             in    vector                 associate the data vectors into codebook

quantization,   a     form        of       clustering         that    is              vectors according to the nearest neighbor

particularly intended for compressing data.                          In               condition. Now, each code book vector has a

vector quantization, the cluster centroids appearing                                  set of data vectors Ci associated to it.

in the above Eq. are called codebook vectors. The                                2.   Update the codebook vectors to the centroids

codebook vectors partition the input space in                                         of sets Ci according to the centroid condition.

nearest neighbor regions Vi. A region Vi                                              That is, for all i set ci :=(1/⎪Ci⎪)∑j∈Ci xj.

associated with the nearest cluster centroid by                                  3.   Repeat form step 1 until the codebook vectors

                                                                                      ci do not change any more.
    Vi= {x: x-ci ≤ x-c1 ;∀∫}
                                                                                 When the iteration stops, a local minimum for the

(nearest neighbor condition).                                                    quantity SSE is achieved K-means typically

                                                                                 converges very fast. Furthermore, when K<< N,
Cluster Ci in the above Eq is now the set of input
                                                                                 K-means is computationally far less expensive than
data points that belong to Vi.
                                                                                 the hierarchical agglomerative methods, since
                                                                                 computing     KN     distances between          codebook
         k-means refers to a family of algorithms
                                                                                 vectors and the data vectors suffices.
that appear often in the context of vector
                                                                                      Well known problems with the K-means
quantization.              K-means            algorithms             are
                                                                                 procedure are that it converges but to a local
tremendously popular in clustering and often used
                                                                                 minimum      and    is   quite   sensitive    to     initial
for exploratory purpose. As a clustering model the
                                                                                 conditions. A simple initialization is to start the
vector quantizer has an obvious limitation. The
                                                                                 procedure using K randomly picked vectors from
nearest neighbor regions are convex, which limits
                                                                                 the sample. A first aid solution for trying to avoid
the shape of clusters that can be separated.
                                                                                 bad local minima is to repeat K-means a couple of
We consider only the batch k-means algorithm;
                                                                                 times from different initial conditions.             More
different sequential procedures are explained. The
                                                                                 advanced solutions include using some form of
batch K-means algorithm proceeds by applying
                                                                                 stochastic relaxation among other modifications.
alternatively in successive steps the centroid and

                                                                                                              ISSN 1947-5500
                                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                       Vol. 9, No. 1, January 2011

                                                                                                                       ⎧                ⎫
                                                                                                                         ⎪ δ (C C ) ⎪
6. CLUSTERING VALIDITY INDICES                                                                          ν AB = min ⎨ A k , l ⎬
                                                                                                               k , k ≠ l ⎪ max Δ B (Cm ) ⎪
The clustering methods in this paper do not directly                                                                   ⎩   m            ⎭

make a decision of the number of clusters but require                                                   where δA is some between-cluster dissimilarity

it as a parameter. This poses a question which                                                          measure and ΔB is some measure of within-cluster

number of clusters fits best to the "natural structure"                                                 dispersion (diameter), e.g.,

of the data. The problem is somewhat vaguely                                                                    Δ1 (Ck ) = max d ij , i, jε Ck
defined since the utility of clusters is not explicitly
stated with any cost function. An approach to solve
                                                                                                                       Δ 2 (Ck ) =
                                                                                                                                       Ck − Ck
                                                                                                                                         2         ∑d
                                                                                                                                                  i , jεC k
                                                                                                                                                              ij .

this is the "add-on" relative clustering validity
                                                                                                        There are literally dozens of relative cluster
criteria. Basically, one clusters first the data with an
                                                                                                        validity indices and as is obvious, the selection of
algorithm with cluster number K = 2,3,... , Kmax.
                                                                                                        the R-index is hardly optimal but a working
Then, the index is computed for the partitions, and
                                                                                                        solution and it is only meant to roughly guide the
(local) minima, maxima, or knee of the index plot
indicate the adequate choice(s) of K.
                                                                                                        6.1. Finding interesting linear projections
Two examples of such indices Davies-Bouldin
                                                                                                        Finding patterns in data can be assisted by
type indices are among the most popular relative
                                                                                                        searching an informative recoding of the original
clustering validity criteria:
                                                                                                        variables by a linear transformation. The linearity
       I           K                           Δ(Ci ) + Δ(C j )
I DB =
                   i =1
                           i,    Ri = max
                                                     δ (Ci , C j )
                                                                        , ∀j, j ≠ i                     is at the same time the power and the weakness of

                                                                                                        these methods. On one hand, a linear model is
where Δ(Ci) is some adequate scalar measure for
                                                                                                        limited, but on the other hand, potentially both
within-cluster dispersion and δ( Ci, Cj) for between
                                                                                                        computationally more tractable and intuitively
cluster dispersion. A simplified variant of this, the
                                                                                                        more understandable than a non-linear method.
R-index (IR) is
                                                                                                        6.2. Independent component analysis
     i K S in
I R = ∑ k ,Where                                                                                        In the basic, linear and noise-free, ICA model, we
     K k =1 S kex
                                                                                                        have M latent variables si, i.e., the unknown
           1                                                       1
S kin =
          Ck   2       ∑
                     i , jεCk
                                d ij , and S kex =   min C
                                                        l          Cl
                                                                        iεC k
                                                                                ∑ di, j
                                                                                            (l ≠ k ).
                                                                                                        independent components (or source signals) that

In preliminary experiments, the R-index gave
                                                                                                        are mixed linearly to form M observed signals,
reasonable suggestions for a sensible number of
                                                                                                        variables xi. When X is the observed data, the
clusters with a given benchmarking data set.
                                                                                                        model becomes

                                                                                                                                     ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 1, January 2011

X=AS     -Eq-2                                                 whitening is performed the demixing matrix for the

where A is an unknown constant matrix, called the              original, centered data is W = W* Λ-½ ET.

mixing matrix, and S contains the unknown                      Here, we present the symmetrical version of the

independent components;                                        FastICA algorithm where all independent com-

S = [s( 1) s(2)... s(N)] consisting of vectors s(i), s         ponents are estimated simultaneously:

= [s1 s2 ... sM]T. The task is to estimate the mixing          1. Whiten the data. For simplicity, we denote here

matrix A (and the realizations of the independent              the whitened data vectors by x and the mixing

components si) using the observed data X alone.                matrix for whitened data with W.

The independent components must have non-                      2.        Initialize       the      demixing          matrix

Gaussian distributions. However, what is often                      T
                                                               W = w1 w2 wM ,
                                                                       T  T
                                                                                      ]         e.g., randomly.
estimated in practice, is the demixing matrix W for
                                                               3. Compute new basis vectors using update rule

                                                                           (          )            (            )
S = WX, where W is a (pseudo)inverse of A.
                                                                wj := E g ( wT x) x − E − g ' ( wT x ) w j
                                                                             j                   j
This kind of problem setting is pronounced in blind
                                                               where g is a non-linearity derived from the
signal separation (BSS) problems, such as the
                                                               objective function J; in case of kurtosis it becomes
"cocktail party problem" where one has to resolve
                                                               g(u) = u3, and in case of skewness g(u) = u2. Use
the utterance of many nearby speakers in the Same
                                                               sample estimates for expectations.
room. Several algorithms for performing ICA have
                                                               4. Orthogonalize the new W, e.g., by W:=
been proposed, and the FastICA algorithm is
briefly described in the next section.
                                                               5. Repeat from step 3 until convergence.
6.3. FastICA
                                                               There is also a deflatory version of the FastICA
The FastICA algorithm         is based on finding
                                                               algorithm that finds the independent components
projections    that   maximize      non-Gaussianity
                                                               one by one. It searches for a new component by
measured by an objective function. A necessary
                                                               using the fixed point iteration (in step 3 of the
condition for independence is uncorrelatedness,
                                                               procedure above) in the remaining subspace that is
and a way of making the basic ICA problem
                                                               orthogonal to previously found estimates.
somewhat easier is to whiten the original signals
                                                               Both practical and theoretical reasons make the
X. Thereafter, it suffices to rotate the whitened
                                                               FastICA an appealing algorithm. It has very com-
data Z suitably, i.e., to find an orthogonal demixing
                                                               petitive computational and convergence properties.
matrix that produces the estimates for the
                                                               Furthermore, FastICA is not restricted to resolve
independent components S = W*Z. When the
                                                               either super or sub-Gaussian sources of the original

                                                                                                ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 9, No. 1, January 2011

sources as it is the case with many algorithms.             could work for data emerging from sources and

However, the FastICA algorithm faces the same               basis    vectors    that     are   "sparse     enough".

problems related to suboptimal local minima and             Consequently, we experimented how far the

random initialization which appear in many other            performance of the basic ICA can be pushed, using

algorithms-including     K-means      and     GIR.          reasonable     heuristics,     without       elaborating

Consequently, a special tool Icasso for VDM style           something completely new. In this paper, the

assessment of the results was developed in the              experiment can be seen as a feasibility study for

course of this paper .                                      using ICA where the data was close to binary.

6.4. ICA and binary mixture of binary signals               Furthermore, there are similar problems in other

Next, we consider a very specific non-linear                application fields, prominently in text document

mixture of latent variables, the problem of the             analysis where such data is encountered. Since the

Boolean mixture of latent binary signals and                basic ICA model is not the optimal choice for

possibly binary noise. The mixing matrix AB, the            handling such problems in general, probabilistic

observed data vectors xB and the independent,               models and algorithms have              recently    been

latent source vectors sB all consist now of binary          developed for this purpose.

vectors ∈{0,1}M. The basic model in Eq.-2 is                First the estimated linear mixing matrix A is

replaced by a Boolean expression                            normalized by dividing each column with the

                                                            element whose magnitude is largest in that column.
                 ∧ sB,
xiB = ∨ aij
                                i = 1,2…M
        j =1        j                                       Second, the elements below and equal to 0.5 are

where ∧ is Boolean AND and ∨ Boolean OR.                    rounded to zero and those above 0.5 to one:

Instead of using Boolean operators it could be               AB =U ( AΛ − T )
                                                             ˆ       ˆ

written xB = U(ABsB) using a step function U as a           Where the diagonal scaling matrix Λ has elements
post-mixture non-linearity. The mixture can be
                                                             λi =              where
further corrupted by binary noise: exclusive-OR                           ˆ
                                                                    s max(ai )
type of noise.
                                                                         ⎧min ai if ⎣min ai ⎦ > ⎣max ai ⎦
                                                                              ˆ          ˆ           ˆ
On one hand, the basic ICA cannot solve the
                                                            s max(ai ) = ⎨
                                                                         ⎩max ai otherwise.
problem in the above eqn. The methods for post-                         ˆ
                                                            Where max ai means taking the maximum and

non-linear mixtures that assume invertible non-
                                                            min   ˆ
                                                                  ai the minimum element of the column vector
linearity cannot be directly applied either. On the
                                                             ai , Matrix T contains thresholds, here we set tij =
other hand, it seems possible that the basic ICA

                                                                                         ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 9, No. 1, January 2011

0.5, ∀i, j. As supposed, this trick works quite well               5.   Hyvarinen, A., Karhunen, J., and Oja, E. (20010.

                                                                        Independent Component Analysis.          Wiley Inter-
with sparse data and skewness E (y3) works better
than kurtosis as a basis for the objective function
                                                                   6.   Lampinen, J. and Kostiainen, T. (20020. Generative
on a wide range of sparsity data, except for noisy
                                                                        Probability Density Model in the Self-Organizing
                                                                        Map. In Seiffert and Jain (2002), chapter 4, pages 75-
In a nutshell, new ways have been presented to                     7.   Ultsch, A. (20030.     Maps for the Visualization of

develop data mining techniques using SOM and                            High-Dimensional Data Spaces.          In WSOM2003

ICA as data visualization methods e.g., to be used                      (2003). CD-ROM.

in process analysis, an exploratory method of
                                                                   8.   WSOM2003 (2003). Proceedings of the Workshop on
investigating the stability of ICA estimates,
                                                                        Self-organizing      Maps     (WSOM2003,       Hibino,
enhancements and modifications of algorithms                            Kitakyushu, Japan.

such as the fast fixed-point algorithm for time
                                                                   9.   Grinstein, G.G. and Ward, M.O. (2002). Introduction
series segmentation and a heuristic solution to the
                                                                        to Data Visualization. In Fayyad et al. (2002), chapter
problem of finding a binary mixing matrix and
                                                                        1, pages 21-45.
independent binary sources. Both time-series

segmentation and PCA revealed meaningful                           10. Kohonen, T. (2001). Self-organizing Maps. Springer,

                                                                        3rd edition.
contexts     from the features in a visual data

exploration.                                                       11. Keim,D.A. and Kriegel, H.-P.(1996).       Visualization

REFERENCES:                                                             Techniques     for   Mining    Large    Database:    A
1.   Alhoniemi, E. (2000).      Analysis of Pulping Data                comparison.    IEEE Transactions of Knowledgeand
     Using the Self-Organizing Map.         Tappi Journal,              Data Engineering.
                                                                   12. Vesanto.J. (2002). Data Exploration Process Based on
2.   Cheung, Y.-M. (2003).         k* -Means:       A New
                                                                        the Self-Organizing Map.
     Generalized k-Means Clustering Algorithm. Pattern

     Recognition Letters, 24(15):2883-2898.
                                                                   13. WSO2003 (20030. Proceedings of the Workshop on
3.   Grabmeier, J. and Rudolph, A. (2002). Techniques of
                                                                        Self-Organizing      Maps     (WSOM2003),      Hibino,
     Cluster Algorithms in Data Mining. Data Minning
                                                                        Kitakyushu, Japan.
     and Knowledge Discovery, 6(4):303-360.

4.   Hoffman, P.E. and Grinstein, G.G. (2002). A Survey            14. Yin, H. (2001) Visualization Induced SOM (ViSOM).

     of Visualizations for High-Dimensional Data Mining.                In Allinson, N., Yin, H., Allinson, L., and Slack, j.,

     In Fayyad et al. (2002), chapter 2, pages 47-82.                   editors, Advances in Self-Organizing Maps.

                                                                                                ISSN 1947-5500

To top