VIEWS: 103 PAGES: 10 CATEGORY: Emerging Technologies POSTED ON: 2/15/2011 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 ANALYSIS OF DATA MINING VISUALIZATION TECHNIQUES USING ICA AND SOM CONCEPTS K.S.RATHNAMALA,1 Dr.R.S.D. WAHIDA BANU2 1 Research Scholar of Mother Teresa Women’s University, Kodaikanal 2 Professor& Head, Dept. of Electronics& Communication Engg., GCE. This research paper is about data mining (DM) and Agglomerative hierarchical methods, Time series visualization methods using independent segmentation, Finding patterns by proximity, component analysis and self organizing map for Clustering validity indices, Feature selection and gaining insight into multidimensional data. A new weighing Fast ICA. method is presented for an interactive visualization 1. INTRODUCTION of cluster structures in a self-organizing Map. By The tasks that are encountered within data mining using a contraction model, the regular grid of self- research are predictive modeling, descriptive organizing map visualization is smoothly changed modeling, discovering rules and patterns, toward a presentation that shows better the exploratory data analysis, and retrieval by content. proximities in the data space. A Novel Visual Data Predictive modeling includes many typical tasks of Mining method is proposed for investigating the machine learning such as classification and reliability of estimates resulting from a Stochastic regression. Descriptive modeling that is ultimately independent component analysis (ICA) algorithm. about modeling all of the data e.g., estimating its There are two algorithms presented in this paper probability distribution. Finding a clustering, that can be used in a general context. Fast ICA for segmentation or informative linear representation independent binary sources is described. The are common subtasks of descriptive modeling. model resembles the ordinary ICA model but the Particular methods for discovering rules and summation is replaced by the Boolean Operator patterns emphasize finding interesting local OR and the multiplication by AND. A heuristic characteristics and patterns instead of global method for estimating the binary mixing matrix is models. also proposed. Furthermore, the differences on the Descriptive data mining techniques for results when using different objective function in data description can be divided roughly into three the FastICA estimation algorithm is also discussed. groups: KEY WORDS: Proximity preserving projections for (visual) Independent component analysis, Self investigation of the structure of the data. organizing map, Vector quantization, patterns, 171 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 Partitioning the data by clustering and beginning gives a global ordering for the map. The segmentation . kernel width σ(t) is then decreased monotonically Linear projections for finding interesting linear along with iteration steps which increases the combinations of the original variables using flexibility of the map to provide lower quantization principal component analysis and independent error in the end. If the radius is run to zero, the component analysis. batch SOM becomes identical to K-means. A clustering is a partition of the set of all The batch SOM is a computational short- data items C= {1,2,....N} into K disjoint clusters cut version of the basic. Despite the intuitive clarity C = U iK = 1Ci and elegance of the basic SOM, its mathematical analysis has turned out to be rather complex. This 2.SELF- ORGANIZING MAP comes from the fact that there exists no cost The basic Self-organizing map is formed of function that the basic SOM would minimize for a K map units organized on a regular k x l low- probability distribution . dimensional grid-usually 2D for visualization. In general, the number of map codebook vectors Associated to each map unit i, there is a governs the computational complexity of one 1. Neighborhood kernel h(dij,σ(t))where the iteration step of the SOM. If the size of the SOM is distance dij is measured from map unit i to others scaled linearly with the number of data vectors, the along the grid (output space), and load scales to O (MN2). But on the other hand, the 2. a codebook vector ci that quantize the data space selection of K can be made following, e.g., N as (input space). suggested in and the load decreases to O (MN1.5). The magnitude of the neighborhood kernel It is suggested that the SOM Toolbox applies to decreases monotonically with the distance dij. A small to medium data sets up to, say, 10 000-100 typical choice is the Gaussian kernel . 000 records. A specific problem is that the memory Batch algorithm consumption in the SOM Toolbox grows One possibility to implement a batch SOM quadratically along with the map size K. algorithm is to add an extra step to the batch K- In practice, the SOM and its variants have means procedure. been successful in a considerable number of ∑ = 1 C h(d ,α (t ))c ) , ∀i K j ij j application fields and individual applications. In ci := j ∑ = 1 C h(d σ (t )) K j j dj , the context of this paper interesting application areas close to VDM include A relatively large neighborhood radius in the 172 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 Visualization and UI techniques especially in merely provide a partition of the items in the information retrieval, and exploratory data analysis sample: the agglomerative hierarchical methods in general. provide an example of this case. Context-aware computing. The family of partitional methods is often opposed Industrial applications for process monitoring to the hierarchical methods. Agglomerative and analysis. hierarchical methods do not aim at minimizing a Visualization capabilities, data and noise global criteria for partitioning, but join data items reduction by topoloigically restricted vector in bigger clusters in a bottom-up manner. In the quantization and practical robustness of the SOM beginning, all samples are considered to form their are of benefit to data mining. There are also own cluster. After this, at N-1 steps the pair of methods for additional speed-ups in the SOM for clusters having minimal pairwise dissimilarity δ especially large datasets in data mining and in are joined, which reduces the number of remaining document retrieval applications. clusters by one. The merging is repeated until all The SOM framework is not restricted to data is in one cluster. This gives a set of nested Euclidean space or real vectors. A variant of the partitions and a tree presentation is quite a natural SOM in a non-Euclidean space or real vectors. A way of representing the result. variant of the SOM in a non-Euclidean space is Here we list the between-cluster dissimilarities δ presented to enhance modeling and visualizations of some of the most common agglomeration of hierarchically distributed data. This method strategies the single linkage (SL), complete linkage uses a fisheye distortion in the visualization. Also (CL) and average linkage (AL) criteria. self-organizing maps and similar structures for δ1 = δ SL = min dij , iεCk , jεC1 symbolic data exist and have been applied also to δ 2 = δ CL = mixdij , iεCk , jεC1 context-aware computation. 1 3.AGGLOMERATIVE HIERARCHICAL δ 3 = δ AL = ∑ Ck C1 iεC k ∑d ε j C1 ij METHODS: Some clustering methods construct a model of the Where Ck, Cl, (k≠l) are any two distinct clusters. input data space that inherently would allow SL and CL are invariant for monotone classifying a new sample into some of the transformations of dissimilarity. SL is reported to determined clusters. K-means partition the input be noise sensitive but capable of producing data space in this manner. Some other methods elongated or chained clusters while CL and AL 173 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 tend to produce more spherical clusters. If segmentation with SSE cost and vector similarities are used instead, the merging occurs quantization. In vector quantization, the borders of for maximum pairwise cluster similarity. the nearest neighbors regions Vi are defined by the 4. TIME SERIES SEGMENTATION: codebook vectors, whereas in segmentation, the In addition to the basic cluster analysis mean vectors, Ci are determined by the segments tasks, other clustering methods that include Ci but cannot directly be used to infer the segment auxiliary constraints are also discussed here. The borders. time series segmentation where the data items have Minimizing the cost in Eq- 1 for some natural order, e.g., time, which must be taken segmentation aims at describing each segment by into account; a segment always consists of a its mean value. It may also be seen as splitting the sequence of subsequent samples of the time series. sequence so that the (biased) sample variance A K-segmentation divides X into K computed by pooling the sample variances of the segments Ci with K -1 segment borders C1,......, segments together is minimal. CK-1 so that Algorithms The basic segmentation problem can be C1 = [x(1), x(2)......, x(c1)], ., CK= [x(cK-1+1), x(cK- solved optimally using dynamic programming. 1+2),....,x(cN)] Eq -1 The dynamic programming algorithm finds also This is the basic time series segmentation optimal 1,2,.... K-1 segmentations while searching task where each segment is considered to emerge for an optimal K-segmentation. The computational from a different model; Furthermore, we consider complexity of dynamic programming is of order O the case where the data to be segmented is readily (KN2) if the cost of a segmentation can be available. calculated in linear time. It may be too much when As in the basic clustering task, we wish to there are large amounts of data. minimize some adequate cost function by selection Another class are the merge-split of the segment border. We stay with costs which algorithms of which the local and global iterative are sums of individual segment costs that are not replacement algorithms (LIR and GIR) resemble affected by changes in other segments. An the batch K-means in the sense that at each step example of such a function is an SSE cost function they change the descriptors of partition (Segment like that of Eq-1 where ci is the mean vector of data borders vs. codebook vectors) to match with a vectors in segment Ci. There is, of course, a necessary condition of local optimum. The LIR fundamental difference between time series gets more easily stuck in bad local minima, and the 174 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 GIR was considerably better in this sense, yet still within-cluster dispersion (Scatter) DW, between sensitive to the initialization. The GIR and LIR cluster dispersion DB, and their sum, the total algorithms can be seen as variants of the “Pavlidis dispersion DT, that is constant and independent of algorithm” that changes the borders gradually the clustering. For data in a Euclidean space. toward a local optimum. K DW = ∑ DW (i ), DW (i ) = ∑ ( x( j ) − ci) ( x( j ) − ci) T The test procedures use random i =1 ε j Ci K initialization for the segments. As in the case of K- DB = ∑ / Ci / ci − c) (ci − c)T i =1 means, the initialization matters, and it might be N DT = DW + DB = ∑ ( x( j ) − c) ( x( j ) − c)T advisable to try an educated guess for initial i =1 positions. One possibility to create a more Where K is the number of clusters, Ci is the effective segmentation algorithm is to combine average of the data in cluster Ci, and c is the several greedy methods. For example, the basic average of all data. These quantities can be bottom-up and top-down methods can be fine- formulated also for a general dissimilarity matrix. tuned by merge-split methods. The dispersion matrices can be used as a Applications basis for different cost functions. Tow criteria Time series and other similar invariant to (non-singular) linear transformations segmentation problems arise in different of data based on the dispersion matrices; −1 applications, e.g., in approximating functions by maximizing trace Dw DW . Minimizing det (DW) piecewise linear functions. This might be done for gives the maximum likelihood solution for a model the purpose of simplifying or analyzing contour or where all clusters are assumed to have a Gaussian boundary lines. Another aim, important in distribution with the same covariance matrix. information retrieval, is to compress or index The aforementioned criteria may be voluminous signal data. Other applications in data difficult to optimize. Therefore a scale dependent analysis span from phoneme segmentation into criteria, minimization of trace (DW) has become finding sequences in biological or industrial popular, presumably because it can be process data. (Suboptimally) minimized with the fast and 5. VECTOR QUANTIZATION: computationally light K-means algorithm that is Suggested by intuitive aim of the basic clustering shortly described in more detail. task, adequate global clustering criteria can be Minimization of trace (DW) is the same as obtained by minimizing / maximizing a function of minimizing the sum of squared errors (SSE) 175 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 between a data vector x(i) and the nearest cluster nearest neighbor conditions that are necessary for centroid Cj: optimal vector quantization. K SSE = ∑ ∑ x( j ) − ci 2 i −1 ε x ( j ) Ci 1. Given a codebook of vectors ci i=1,2,......K The above Eq. is encountered in vector associate the data vectors into codebook quantization, a form of clustering that is vectors according to the nearest neighbor particularly intended for compressing data. In condition. Now, each code book vector has a vector quantization, the cluster centroids appearing set of data vectors Ci associated to it. in the above Eq. are called codebook vectors. The 2. Update the codebook vectors to the centroids codebook vectors partition the input space in of sets Ci according to the centroid condition. nearest neighbor regions Vi. A region Vi That is, for all i set ci :=(1/⎪Ci⎪)∑j∈Ci xj. associated with the nearest cluster centroid by 3. Repeat form step 1 until the codebook vectors ci do not change any more. Vi= {x: x-ci ≤ x-c1 ;∀∫} When the iteration stops, a local minimum for the (nearest neighbor condition). quantity SSE is achieved K-means typically converges very fast. Furthermore, when K<< N, Cluster Ci in the above Eq is now the set of input K-means is computationally far less expensive than data points that belong to Vi. the hierarchical agglomerative methods, since K-means computing KN distances between codebook k-means refers to a family of algorithms vectors and the data vectors suffices. that appear often in the context of vector Well known problems with the K-means quantization. K-means algorithms are procedure are that it converges but to a local tremendously popular in clustering and often used minimum and is quite sensitive to initial for exploratory purpose. As a clustering model the conditions. A simple initialization is to start the vector quantizer has an obvious limitation. The procedure using K randomly picked vectors from nearest neighbor regions are convex, which limits the sample. A first aid solution for trying to avoid the shape of clusters that can be separated. bad local minima is to repeat K-means a couple of We consider only the batch k-means algorithm; times from different initial conditions. More different sequential procedures are explained. The advanced solutions include using some form of batch K-means algorithm proceeds by applying stochastic relaxation among other modifications. alternatively in successive steps the centroid and 176 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 ⎧ ⎫ ⎪ δ (C C ) ⎪ 6. CLUSTERING VALIDITY INDICES ν AB = min ⎨ A k , l ⎬ { k , k ≠ l ⎪ max Δ B (Cm ) ⎪ { The clustering methods in this paper do not directly ⎩ m ⎭ make a decision of the number of clusters but require where δA is some between-cluster dissimilarity it as a parameter. This poses a question which measure and ΔB is some measure of within-cluster number of clusters fits best to the "natural structure" dispersion (diameter), e.g., of the data. The problem is somewhat vaguely Δ1 (Ck ) = max d ij , i, jε Ck defined since the utility of clusters is not explicitly 1 stated with any cost function. An approach to solve Δ 2 (Ck ) = Ck − Ck 2 ∑d i , jεC k ij . this is the "add-on" relative clustering validity There are literally dozens of relative cluster criteria. Basically, one clusters first the data with an validity indices and as is obvious, the selection of algorithm with cluster number K = 2,3,... , Kmax. the R-index is hardly optimal but a working Then, the index is computed for the partitions, and solution and it is only meant to roughly guide the (local) minima, maxima, or knee of the index plot exploration. indicate the adequate choice(s) of K. 6.1. Finding interesting linear projections Two examples of such indices Davies-Bouldin Finding patterns in data can be assisted by type indices are among the most popular relative searching an informative recoding of the original clustering validity criteria: variables by a linear transformation. The linearity I K Δ(Ci ) + Δ(C j ) I DB = K ∑R i =1 i, Ri = max δ (Ci , C j ) , ∀j, j ≠ i is at the same time the power and the weakness of these methods. On one hand, a linear model is where Δ(Ci) is some adequate scalar measure for limited, but on the other hand, potentially both within-cluster dispersion and δ( Ci, Cj) for between computationally more tractable and intuitively cluster dispersion. A simplified variant of this, the more understandable than a non-linear method. R-index (IR) is 6.2. Independent component analysis i K S in I R = ∑ k ,Where In the basic, linear and noise-free, ICA model, we K k =1 S kex have M latent variables si, i.e., the unknown 1 1 S kin = Ck 2 ∑ i , jεCk d ij , and S kex = min C l Cl ∑ iεC k ∑ di, j jεC1 (l ≠ k ). independent components (or source signals) that k In preliminary experiments, the R-index gave are mixed linearly to form M observed signals, reasonable suggestions for a sensible number of variables xi. When X is the observed data, the clusters with a given benchmarking data set. model becomes 177 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 X=AS -Eq-2 whitening is performed the demixing matrix for the where A is an unknown constant matrix, called the original, centered data is W = W* Λ-½ ET. mixing matrix, and S contains the unknown Here, we present the symmetrical version of the independent components; FastICA algorithm where all independent com- S = [s( 1) s(2)... s(N)] consisting of vectors s(i), s ponents are estimated simultaneously: = [s1 s2 ... sM]T. The task is to estimate the mixing 1. Whiten the data. For simplicity, we denote here matrix A (and the realizations of the independent the whitened data vectors by x and the mixing components si) using the observed data X alone. matrix for whitened data with W. The independent components must have non- 2. Initialize the demixing matrix Gaussian distributions. However, what is often T [ W = w1 w2 wM , T T ] e.g., randomly. estimated in practice, is the demixing matrix W for 3. Compute new basis vectors using update rule ( ) ( ) S = WX, where W is a (pseudo)inverse of A. wj := E g ( wT x) x − E − g ' ( wT x ) w j j j This kind of problem setting is pronounced in blind where g is a non-linearity derived from the signal separation (BSS) problems, such as the objective function J; in case of kurtosis it becomes "cocktail party problem" where one has to resolve g(u) = u3, and in case of skewness g(u) = u2. Use the utterance of many nearby speakers in the Same sample estimates for expectations. room. Several algorithms for performing ICA have 4. Orthogonalize the new W, e.g., by W:= been proposed, and the FastICA algorithm is W(WTW)-1/2. briefly described in the next section. 5. Repeat from step 3 until convergence. 6.3. FastICA There is also a deflatory version of the FastICA The FastICA algorithm is based on finding algorithm that finds the independent components projections that maximize non-Gaussianity one by one. It searches for a new component by measured by an objective function. A necessary using the fixed point iteration (in step 3 of the condition for independence is uncorrelatedness, procedure above) in the remaining subspace that is and a way of making the basic ICA problem orthogonal to previously found estimates. somewhat easier is to whiten the original signals Both practical and theoretical reasons make the X. Thereafter, it suffices to rotate the whitened FastICA an appealing algorithm. It has very com- data Z suitably, i.e., to find an orthogonal demixing petitive computational and convergence properties. matrix that produces the estimates for the Furthermore, FastICA is not restricted to resolve independent components S = W*Z. When the either super or sub-Gaussian sources of the original 178 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 sources as it is the case with many algorithms. could work for data emerging from sources and However, the FastICA algorithm faces the same basis vectors that are "sparse enough". problems related to suboptimal local minima and Consequently, we experimented how far the random initialization which appear in many other performance of the basic ICA can be pushed, using algorithms-including K-means and GIR. reasonable heuristics, without elaborating Consequently, a special tool Icasso for VDM style something completely new. In this paper, the assessment of the results was developed in the experiment can be seen as a feasibility study for course of this paper . using ICA where the data was close to binary. 6.4. ICA and binary mixture of binary signals Furthermore, there are similar problems in other Next, we consider a very specific non-linear application fields, prominently in text document mixture of latent variables, the problem of the analysis where such data is encountered. Since the Boolean mixture of latent binary signals and basic ICA model is not the optimal choice for possibly binary noise. The mixing matrix AB, the handling such problems in general, probabilistic observed data vectors xB and the independent, models and algorithms have recently been latent source vectors sB all consist now of binary developed for this purpose. vectors ∈{0,1}M. The basic model in Eq.-2 is First the estimated linear mixing matrix A is replaced by a Boolean expression normalized by dividing each column with the element whose magnitude is largest in that column. ∧ sB, n xiB = ∨ aij B i = 1,2…M j =1 j Second, the elements below and equal to 0.5 are where ∧ is Boolean AND and ∨ Boolean OR. rounded to zero and those above 0.5 to one: Instead of using Boolean operators it could be AB =U ( AΛ − T ) ˆ ˆ written xB = U(ABsB) using a step function U as a Where the diagonal scaling matrix Λ has elements post-mixture non-linearity. The mixture can be 1 λi = where further corrupted by binary noise: exclusive-OR ˆ s max(ai ) type of noise. ⎧min ai if ⎣min ai ⎦ > ⎣max ai ⎦ ˆ ˆ ˆ On one hand, the basic ICA cannot solve the s max(ai ) = ⎨ ˆ ⎩max ai otherwise. ˆ problem in the above eqn. The methods for post- ˆ Where max ai means taking the maximum and non-linear mixtures that assume invertible non- min ˆ ai the minimum element of the column vector linearity cannot be directly applied either. On the ˆ ai , Matrix T contains thresholds, here we set tij = other hand, it seems possible that the basic ICA 179 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, January 2011 0.5, ∀i, j. As supposed, this trick works quite well 5. Hyvarinen, A., Karhunen, J., and Oja, E. (20010. Independent Component Analysis. Wiley Inter- with sparse data and skewness E (y3) works better science. than kurtosis as a basis for the objective function 6. Lampinen, J. and Kostiainen, T. (20020. Generative on a wide range of sparsity data, except for noisy Probability Density Model in the Self-Organizing data. Map. In Seiffert and Jain (2002), chapter 4, pages 75- Conclusion: 92. In a nutshell, new ways have been presented to 7. Ultsch, A. (20030. Maps for the Visualization of develop data mining techniques using SOM and High-Dimensional Data Spaces. In WSOM2003 ICA as data visualization methods e.g., to be used (2003). CD-ROM. in process analysis, an exploratory method of 8. WSOM2003 (2003). Proceedings of the Workshop on investigating the stability of ICA estimates, Self-organizing Maps (WSOM2003, Hibino, enhancements and modifications of algorithms Kitakyushu, Japan. such as the fast fixed-point algorithm for time 9. Grinstein, G.G. and Ward, M.O. (2002). Introduction series segmentation and a heuristic solution to the to Data Visualization. In Fayyad et al. (2002), chapter problem of finding a binary mixing matrix and 1, pages 21-45. independent binary sources. Both time-series segmentation and PCA revealed meaningful 10. Kohonen, T. (2001). Self-organizing Maps. Springer, 3rd edition. contexts from the features in a visual data exploration. 11. Keim,D.A. and Kriegel, H.-P.(1996). Visualization REFERENCES: Techniques for Mining Large Database: A 1. Alhoniemi, E. (2000). Analysis of Pulping Data comparison. IEEE Transactions of Knowledgeand Using the Self-Organizing Map. Tappi Journal, Data Engineering. 83(7):66. 12. Vesanto.J. (2002). Data Exploration Process Based on 2. Cheung, Y.-M. (2003). k* -Means: A New the Self-Organizing Map. Generalized k-Means Clustering Algorithm. Pattern Recognition Letters, 24(15):2883-2898. 13. WSO2003 (20030. Proceedings of the Workshop on 3. Grabmeier, J. and Rudolph, A. (2002). Techniques of Self-Organizing Maps (WSOM2003), Hibino, Cluster Algorithms in Data Mining. Data Minning Kitakyushu, Japan. and Knowledge Discovery, 6(4):303-360. 4. Hoffman, P.E. and Grinstein, G.G. (2002). A Survey 14. Yin, H. (2001) Visualization Induced SOM (ViSOM). of Visualizations for High-Dimensional Data Mining. In Allinson, N., Yin, H., Allinson, L., and Slack, j., In Fayyad et al. (2002), chapter 2, pages 47-82. editors, Advances in Self-Organizing Maps. 180 http://sites.google.com/site/ijcsis/ ISSN 1947-5500