Density Based Clustering Algorithm using Sparse Memory Mapped File

Document Sample
Density Based Clustering Algorithm using Sparse Memory Mapped File Powered By Docstoc
					                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                            Vol. 8, No. 5, 2010

   Density Based Clustering Algorithm using Sparse
                Memory Mapped File
                       J. Hencil Peter                                            A. Antonysamy
              Department of Computer Science                                   Department of Mathematics
         St. Xavier’s College, Palayamkottai , India.                 St. Xavier’s College, Kathmandu, Nepal.
            hencilpeter@hotmail.com                                           fr_antonysamy@hotmail.com


Abstract:                                                                  and section 4 explains the proposed solution. After the new
                                                                           algorithm’s explanation, section 5 shows the Experimental
The DBSCAN [1] algorithm is a popular algorithm in Data                    Results and final section 6 presents the conclusion and future
Mining field as it has the ability to mine the noiseless arbitrary         work associated with this algorithm.
shape Clusters in an elegant way. As the original DBSCAN
algorithm uses the distance measures to compute the distance
between objects, it consumes so much processing time and it’s
computation complexity comes as O(N2). In this paper we have                                    II RELATED WORK
proposed a new algorithm for mining the density based clusters
using Sparse Memory Mapped File (Spares MMF) [3]. All the
                                                                                    The DBSCAN (Density Based Spatial Clustering of
given objects are initially loaded into their corresponding
Sparse Memory Mapped File’s locations and during the                       Application with Noise) [1] is the basic clustering algorithm
SparseMemoryRegionQuery            operation     each     objects’         to mine the clusters based on objects density. In this
surrounding cells will be visited for the neighbour objects                algorithm, first the number of objects present within the
instead of computing the distance between each of the objects              neighbour region (Eps) is computed. If the neighbour objects
in the data set. Using the Sparse MMF approach, it is proved               count is below the given threshold value, the object will be
that the DBSCAN algorithm can process huge amount of                       marked as NOISE. Otherwise the new cluster will be formed
objects without having any runtime issues and the new                      from the core object by finding the group of density
algorithm’s performance analysis shows that proposed solution              connected objects that are maximal w.r.t density-reachability.
is super fast than the existing algorithm.
                                                                                     The OPTICS [4] algorithm adopts the original
   Keywords: Sparse Memory Mapped File; Sparse MMF;                        DBSCAN algorithm to deal with variance density clusters.
Sparse Memory; Neighbour Cells; Sparse Memory DBSCAN.
                                                                           This algorithm computes an ordering of the objects based on
                                                                           the reachability distance for representing the intrinsic
                                                                           hierarchical clustering structure. The Valleys in the plot
                       I. INTRODUCTION                                     indicate the clusters. But the input parameters ξ is critical
                                                                           for identifying the valleys as ξ clusters.
          Data mining is a fast growing field in which
clustering plays a very important role. Clustering is the                           The DENCLUE [5] algorithm uses kernel density
process of grouping a set of physical or abstract objects into             estimation. The result of density function gives the local
classes of similar objects [2]. Among the many algorithms                  density maxima value and this local density value is used to
proposed in the clustering field, DBSCAN is one of the most                form the clusters. If the local density value is very small, the
popular algorithms due to its high quality of noiseless output             objects of clusters will be discarded as NOISE.
clusters.
                                                                                    A Fast DBSCAN (FDBSCAN) Algorithm[6] has
          The most of the Density Based Clustering                         been invented to improve the speed of the original DBSCAN
algorithms requires O (N2) computation time and requires                   algorithm and the performance improvement has been
huge amount of main memory to process in the real time                     achieved through considering only few selected
scenario. Since the seed object list grows during run time, it             representative objects belongs inside a core object’s
is very difficult to predict the required memory to process the            neighbour region as seed objects for the further expansion.
entire objects present in the data set. If the memory is                   Hence this algorithm is faster than the basic version of
insufficient to process the growing seed objects, the                      DBSCAN algorithm and suffers with the loss of result
DBSCAN algorithm will crash in the run time. So to get rid                 accuracy.
of the instability problem and improve the performance, a
new solution has been proposed in this paper.                                       The MEDBSCAN [7] algorithm has been proposed
                                                                           recently to improve the performance of DBSCAN algorithm,
         Rest of the paper is organised as follows. Section 2              at the same time without loosing the result accuracy. In this
gives the brief history about the related works in the same                algorithm totally three queues have been used, the first queue
area. Section 3 gives the introduction of original DBSCAN                  will store the neighbours of the core object which belong




                                                                     122                                 http://sites.google.com/site/ijcsis/
                                                                                                         ISSN 1947-5500
                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                          Vol. 8, No. 5, 2010
inside Eps distance, the second queue is used to store the            An Object p is density connected to another object q if there
neighbours of the core object which belong inside 2*Eps               is an object o such that both, p and q are density reachable
distance and the third queue is the seeds queue which store           from o w.r.t Eps and MinObjs.
the unhandled objects for further expansion. This algorithm
guarantees some notable performance improvement if Eps                Definition 6: Cluster
value is not very sensitive.
                                                                      A Cluster C is a non-empty subset of a Database D w.r.t Eps
                                                                      and MinObjs which satisfying the following conditions.
         Though the DBSCAN algorithm’s complexity can
be reduced to O(N * logN) using some spatial trees, it is an          For every p and q, if p ∈ cluster C and q is density reachable
extra effort to construct, organize the tree and the tree             from p w.r.t Eps and MinObjs then q ∈ C.
requires an additional memory to hold the objects. In this             For every p and q, q ∈ C; p is density connected to q w.r.t
new algorithm different new complexity O (N * 2Eps) has               Eps and MinObjs.
been achieved and it is proved that the new complexity better
than the previous version of DBSCAN algorithms when the
Eps value is minimal.                                                 Definition 7: Noise

        II. INTRODUCTION TO DBSCAN ALGORITHM                          An object which doesn’t belong to any cluster is called noise.

         The working principles of the DBSCAN algorithm                        The DBSCAN algorithm finds the Eps
are based on the following definitions:                               Neighbourhood of each object in a Database during the
                                                                      clustering process. Before the cluster expansion, if the
Definition 1: Eps Neighbourhood of an object p                        algorithm finds any non core object, it will be marked as
                                                                      NOISE. With a core object, algorithm initiate a cluster and
The Eps Neighbourhood of an object p is referred as                   surrounding objects will be added into the queue for the
NEps(p), defined as                                                   further expansion. Each queue objects will be popped out
 NEps(p) = {q ∈ D | dist(p,q) <=Eps}.                                 and find the Eps neighbour objects for the popped out object.
                                                                      When the new object is a core object, all its neighbour
Definition 2: Core Object Condition                                   objects will be assigned with the current cluster id and its
                                                                      unprocessed neighbour objects will be pushed into queue for
An Object p is referred as core object, if the neighbour              further processing. This process will be repeated until there
objects count >= given threshold value (MinObjs). i.e.                is no object in the queue for the further processing.

|NEps(p)|>=MinObjs                                                                      IV. PROPOSED SOLUTION

Where MinObjs refers the minimum number of neighbour                           A new algorithm has been proposed in this paper to
objects to satisfy the core object condition. In the above            improve the performance as well as to process huge amount
case, if p has neighbours which are exist within the Eps              of data. This algorithm is totally relying on Sparse MMF and
radius count is >= MinObjs, p can be referred as core object.         the Sparse MMF concept has been explained below briefly:

Definition 3: Directly Density Reachable Object                       A. Sparse Memory Mapped File (Sparse MMF)
An Object p is referred as directly density reachable from
                                                                                The Sparse MMF [3] is the derived mechanism of
another object q w.r.t Eps and MinObjs if
                                                                      Memory Mapped File. The Memory Mapped File [3] is like
p ∈ NEps(q) and
                                                                      virtual memory and it allows reserving a region of address
                                                                      space and committing physical storage to the region. The
                                                                      difference is that the physical storage comes from a file that
|NEps(q)|>= MinObjs (Core Object condition)
                                                                      is already on the disk instead of the system’s paging file. The
                                                                      memory mapped file can be used to access the data file on
Definition 4: Density Reachable Object
                                                                      disk (even very huge files), load and execute executable files
                                                                      and libraries and allowing multiple processes running on the
An object p is referred as density reachable from another
                                                                      same machine to share data with each other. The Sparse
object q w.r.t Eps and MinObjs if there is a chain of objects
                                                                      MMF is similar to Memory Mapped File but it occupies only
p1,…,pn, p1=q, pn=p such that pi+1 is directly density
                                                                      the required storage space in the physical file. If we use
reachable from pi.
                                                                      Memory Mapped File to reserve the region of memory, while
                                                                      committing the changes to the file on disk, the file size will
Definition 5: Density connected object
                                                                      be equivalent of the created Memory Mapped File size.
                                                                      Instead if we replace the same with Sparse MMF, final file’s
                                                                      size will be equivalent to the e non-zero element which is




                                                                123                                http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                    Vol. 8, No. 5, 2010
stored in the Sparse MMF. So Sparse MMF gives better                        Address(CenterObject)) and next time when the new object
storage result and hence it has been used in our research.                  found, current object’s offset will be stored in the previous
                                                                            object’s NextSeedObject field and so on. Eventually last
B. Object’s Structure                                                       object’s NextSeedObject field will be assigned with NULL.
                                                                            Thus the extra memory as well as buffer/queue requirement
          As this algorithm’s core is Spare MMF, the objects                to store the seed objects has been removed in this solution.
that needs to be processed by this algorithm are organized bit              This function has been customized to update the neighbour
differently and each objects’ structure will have three                     objects offset in the either field NextSeedObectOffset or
additional fields NextObjectOffset, NextSeedObjectOffset                    NextTempObjectOffset. If this function receives an update
and NextTempObjectOffset.                                                   flag UpdateMasterSeedOffset, neighbour objects offset will
                                                                            be stored in NextSeedObectOffset field and input update flag
                                                                            is UpdateTempSeedOffset then the NextTempObjectOffset
                                                                            will be updated with the neighbour object(s) offset.

                                                                                     The DBSCAN algorithm’s computation complexity
                                                                            varies based on the RegionQuery function and it uses
                                                                            distance function to compute the neighbours present with in
                                                                            the certain radius (Eps). In this new approach, distance
                                                                            computation during the SparseMemoryRegionQuery function
                                                                            call has been removed and it visit’s the required number of
                                                                            neighbour cells from the center cell.




             Figure 1. Sparse Memory Mapped File Object’s Structure




          While loading all the objects in Sparse MMF, all the
objects are chained in a sequence like linked list (but not
exactly linked list).          The first additional field
NextObjectOffset will hold the Offset value of the next
object, second object will hold the offset of its immediate
successor object, etc and the final object’s NextObjectOffset
                                                                                     Figure 2. Neighbour Cells Diagram
will set to NULL to indicate that there are no more objects
further to visit during the clustering process. So the first
object’s address should be retained always to visit the entire                       In this proposed solution, we have selected two
objects loaded in the Sparse MMF. The other two fields                      dimensional dataset for the experiment and the above
NextSeedObjectOffset and NextTempObjectOffset fields are                    diagram shows the neighbour cells with different distance.
used by SparseMemoryRegionQuery function call and it is                     The center cell has been painted in red colour and it’s
explained in the below section.                                             distance of object stored in the cell will be zero, next
                                                                            immediate neighbours whose distance is 1 from the center
C. SparseMemoryRegionQuery function                                         cell have been painted in blue colour, the yellow colour cells
                                                                            distance are greater than 1 and <=2 and so on. These
                                                                            neighbour cells offsets are pre-computed and stored in M X 2
The proposed algorithm doesn’t uses any extra buffer or                     dimensional array and it will be passed to the
queue to store the seed objects as well as neighbour objects                SparseMemoryRegionQuery function to visit only the
during the run time, instead each object has the                            required number of neighbour cells to process. Thus the
corresponding Offset field and in which the exact offset of                 distance computation between objects is not required.
the next seed object will be stored. In the original DBSCAN
algorithm, RegionQuery function has been used to retrieve
the neighbour objects and in this new algorithm
SpareMemoryRegionQuery function has been introduced
instead of RegionQuery. This function visits all the required
surrounding cells in memory and the non empty cell objects
will be chained and return back as seed objects. i. e The
function start from the center cell and visit the neighbour
cells one by one. When the non empty object found in the
first time, center object’s NextSeedOffset field will be
assigned the Offset of new object (Address(NewObject) –



                                                                      124                                           http://sites.google.com/site/ijcsis/
                                                                                                                    ISSN 1947-5500
                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                            Vol. 8, No. 5, 2010
                                                                         based on the maximum possible Eps value supported by the
                                                                         algorithm and based on this K value also determined. So
                                                                         these two array values are populated with the required values
                                                                         before the actual clustering process.

                                                                         E. Algorithm

                                                                         1) Input D, Eps, MinObjs.
                                                                         2) Create SparseMemoryMapped File.
                                                                         3) Load the pre-computed Neighbour Cells Offset
                                                                             Array “NCOArray” and Offset Index Array
                                                                             “OIArray” Values.
D. Neighbour Cells and Index Offset Array                                4) Initialize the SparseMemoryMapped file with the dataset
                                                                             D, assign ClusterID field of all objects with
                                                                             UNCLASSIFIED and preserve the First Object’s
                                                                              Address.
                                                                         5) ClusterID = NOISE, CurrentObject = FirstObject.
                                                                         6) WHILE CurrentObject <> NULL
                                                                         7) If (CurrentObject.ClusterID == UNCLASSIFIED)
                                                                                                                               Then
                                                                         8)     Call SparseMemoryRegionQuery function with
                                                                                CurrentObject, Eps, UpdateMasterSeedOffset,
                                                                                NCOArray and OIArray parameter and the function
                                                                                returns FirstSeedObject, LastSeedObject and
                                                                                SeedObjectsCount.
                                                                         9)    If (SeedObjectsCount >= MinObjs) Then// Core
                                                                                                                     Object condition
                                                                         10)      ClusterID = GetNextID(ClusterID).
                                                                         11)       Assign the ClusterID to all the seed objects.
                                                                         12)       Move CurrentSeedObject to point its next seed
                                                                                   object using the OffsetValue and assign NULL
                                                                                  value to previous CurrentSeedObject’s
                                                                                  NextSeedObjectOffset field.
                                                                         13)      WHILE CurrentSeedObject <> NULL
                                                                         14)      Call SparseMemoryRegionQuery function with
                                                                                 CurrentSeedObject, Eps, UpdateTempSeedOffset,
                                                                                 NCOArray and OIArray parameter and the function
                                                                                 returns TempFirstSeedObject, TempLastSeedObject
                                                                                 and TempSeedObjectsCount.
                                                                         15)       If (TempSeedObjectsCount >= MinObjs) Then
                                                                         16)       TempCurrentSeedObject = TempFirstSeedObject.
                                                                         17)       For I = 1 to TempSeedObjectsCount
                                                                         18)         If TempCurrentSeedObject .ClusterID IN
                                                                                                    {UNCLASSIFIED, NOISE} Then
                                                                         19)           If TempCurrentSeedObject.ClusterID ==
                                                                                                              UNCLASSIFIED Then
                                                                         20)           Append the TempCurrentSeedObject to the
         Figure 3. NCOArray and IOArray                                                                           LastSeedObject.
                                                                         21)           End If
        Two additional arrays are been used in this algorithm            22)         TempCurrentSeedObject .ClusterID =
to avoid the distance computation and improve the                                                                    ClulsterID.
performance. The first array Neighbour Cells Offset Array                23)       End If
(NCOArray) is an M X 2 array and it stores the offset values             24)      Move TempCurrentSeedObject to point its next
of neighbour cells from the center object. The Second Index                        seed object using the OffsetValue and assign
Offset Array (IOArray) is K X 1 dimensional array and it                 NULL
stores the NCOArray’s last index value for the corresponding                       value to previous TempCurrentSeedObject’s
Eps value sequence starting from 0. For example if the Eps                        NextTempSeedObjectOffset field.
value is 1 then IOArray[1] tells that NCOArray array                     25)      End For
elements starting from 0 to 4 have the cells offset that need to         26)      End If
be visited by       SparseMemoryRegionQuery during the
neighbour objects computation. The value M will be decided               27)    If (CurrentSeedObject. NextObjectOffset == 0)



                                                                   125                               http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                           Vol. 8, No. 5, 2010
                                                   Then                  neighbour       objects     will   be     visited     using
28)       CurrentSeedObject =NULL.                                       TempObjectNextSeedOffset              instead            of
29)     Else                                                             ObjectNextSeedOffset and the UNCLASSIFIED cluster id
30)    Move CurrentSeedObject to point its next seed object              type objects present in the temporary seed chain will be
        using the OffsetValue and assign NULL value to                   appended to the LastSeedObject (main seed chain) for the
        previous CurrentSeedObject’s NextSeedObjectOffset                further processing and all the UNCLASSIFIED and NOISE
        field.                                                           type objects present in the temporary seed list will be
31)    End If                                                            assigned with the current Cluster ID. The LastSeedObject
32)    END WHILE                                                         member will always point the last object in the seed chain.
33)    Else //Non Core Object                                            The entire object present in the main seed chain will be
34)      CurrentObject.ClusterID = NOISE.                                processed one by one and cluster expansion will stop when
35)     Assign NULL value to all the SeedObjects’                        the traverse reaches the LastSeedObject and no more seed
         NextSeedOffset member.                                          objects to process further. The complete clustering process
36)   End If                                                             will stop once the initial loop process the entire objects
37)   End If                                                             present in the data set.
38)   If (CurrentObject. NextObjectOffset == 0) Then
39)     CurrentObject=NULL.
40)   Else
41)    Move CurrentObject to point its next object using the
       OffsetValue.
42)   End If
43)   END WHILE


           This algorithm starts with creating the Sparse MMF
with the required size and loads the Neighbour Cell Offset
and Index Offset array values. The dataset D will be read
one by one and each object will be placed in the
corresponding memory locations. As mentioned in the
section 4(B), while initializing the Sparse MMF with objects,
each successive object’s memory offset will be stored in the                              Figure 4. Result of Dataset 1
previous objects NextObjectOffset field and last object’s
NextObjectOffset field will be assigned with NULL value.
                                                                         F. Advantages
Thus it is very essential to preserve the FirstObject’s address
to visit all the remaining objects.
                                                                                   The proposed algorithm is very stable. The main
         The algorithm starts the traverse from the first                drawback of original DBSCAN algorithm is instability.
object and visits the next objects one by one using the next             Though all the objects present in the data set can be loaded
object’s offset stored in the current object itself. When it             by the DBSCAN algorithm, if we don’t have sufficient main
finds the object and its cluster ID is UNCLASSIFIED,                     memory to hold the growing seeds objects, DBSCAN
SparseMemoryRegionQuery function will be called with                     algorithm will crash during run time. But the new algorithm
required parameter. As the new cluster is not yet formed,                doesn’t rely on the growing seeds and it will give guarantee
SparseMemoryRegionQuery function needs to be called with                 to process all the objects as long as it is able to load. The
UpdateMasterSeedField flag to update the seed objects’                   second advantage of the new algorithm is capable of
NextObjectSeedOffset       field.          The     output     of         processing huge amount of objects. Since this algorithm is
SparseMemoryRegionQuery will give FirstSeedObject,                       based on the Sparse MMF, it can support few GBs of data in
LastSeedObject and SeedObjectsCount. If the current object               a 32 bit Operating System where traditional approach
is a non core object, the current object will be market as               supports only few MBs of data in the real time scenario.
NOISE and all its seed objects NextObjectSeedOffset field                Also this algorithm can be customised to process very huge
will be market with NULL value. Otherwise the cluster                    data set (e.g > 10 GB) using the Sparse MMF. Then the
expansion will start with creating a new cluster ID as the               beauty of Sparse MMF is, though we pre-allocate more
current object is a core object. The new Cluster ID will be              memory in the beginning, the real memory occupying is
assigned to all the seed objects that are chained starting from          based on the consumption. Eventually the performance is
FirstSeedObject.      Now the remaining objects (except                  really fast as the algorithm directly works on the memory.
FirstSeedObject) present in the seed chain will be processed
one by one and for all the remaining seed objects                        G. Limitations
SparseMemoryRegionQuery           will     be    called     with
UpdateTempSeedOffset          flag       to     update       the                  As this algorithm uses Sparse MMF and only very
TempObjectNextSeedOffset field. This will avoid the                      few languages support this feature, scope for implementing
overwriting of seed objects which are already exist in the               this algorithm is limited. Second limitation is memory
main seed list chain. So if the object is a core object, the



                                                                   126                                 http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                Vol. 8, No. 5, 2010
customization. If we are planning to apply this algorithm to                                                                                     different sizes of 2 dimensional synthetic datasets were used
support multidimensional dataset, memory needs to be                                                                                             and running time results are given below:
customized accordingly and the computation complexity may
vary. Also if the minimum distance between one object and                                                                                        TABLE 2. RUNNING TIME OF DBSCAN AND DBSCANSMMF IN SECONDS
the immediate nearest object is greater than one unit or less
than one unit, offset array values will change and it should be
recomputed. Moreover creating and populating values in




                                                                                                                                                                   1.DBSCANSMMF




                                                                                                                                                                                                         2.DBSCANSMMF
Offset arrays are an extra task. Last drawback of this
algorithm is this doesn’t support duplicate objects. As the




                                                                                                                                                                                           1.DBSCAN




                                                                                                                                                                                                                                 2.DBSCAN
                                                                                                                                                   Number of
object loaded in the corresponding memory location, it is not




                                                                                                                                                   Objects
possible to overwrite another object in the same location.
These are the notable limitations of this algorithm.
                                                                                                                                                  1500         0.0007             0.3892              0.0005            0.2176
                                                                                                                                                  3000         0.0043             0.5395              0.0051            0.5684
H. Computation Complexity                                                                                                                         6000         0.0081             1.8030              0.0094            1.8920
                                                                                                                                                  10000        0.0137             4.9124              0.0166            5.1122
          The DBSCAN algorithm’s complexity has been                                                                                              20000        0.0261             20.4426             0.0255            18.2351
calculated based on the number of RegionQuery function                                                                                            30000        0.0377             43.3875             0.0269            41.1765
call. In which each RegionQuery function call need N
                                                                                                                                                  40000        0.0545             77.6204             0.0587            79.6543
distance computation and hence the computation complexity
becomes O (N2) for processing all the N objects present in                                                                                        60000        0.0799             195.8284            0.0676            181.8745
the dataset. As the new algorithm’s SparseRegionQuery
process the neighbour cells, the complexity varies based on
the Eps value and each SparseRegionQuery requires not
more than 2(Eps+1) cells traversal. Eventually for processing
all the N objects, our algorithm requires O (N * 2(Eps+1) )
time. The constant 1 can be removed as it is very small and
the final complexity comes as O (N * 2Eps). This complexity
is really a reduction when the Eps value is reasonable (e.g
1~10) and N value is very large. At the same time, if we have
very less number of objects and the Eps value is too big, this
new complexity won’t be an attractive one. However the real
processing time will be very faster than the traditional
RegionQuery function call as the SparseRegionQuery
traverse the memory directly.

         TABLE 1. COMPARISON OF ALGORITHMS
                                                                                                                                                               Fig 5. Scalability of Algorithm with different size of dataset
                                                                                                 Better Performance




                                                                                                                                                           The above table and graph figures show that new
                                                                            Supports Duplicate




                                                                                                                      Doesn’t depend on
                                                                                                                      distance function.
                                                       Ability to process
                  growing Seed )?




                                                                                                                                                 algorithm gives better performance when the algorithm’s
                                    Doesn’t Require




                                                       huge dataset?




                                                                                                                                                 input data set size grows. This is the expected obvious result
                                    extra Buffer
                  (because of




  Algorithm
                                                                                                                                                 as the new algorithm visits only the required neighbour cells
                                                                            Objects?
                  Stability




                                                                                                                                                 during the SparseMemoryRegionQuery function call instead
                                                                                                                                                 of the computing distance between center and the entire
                                                                                                                                                 objects in the data set. Another reason is directly accessing
   DBSCAN           No                 No                No                   Yes                         No            No                       the memory is much faster than using the buffers to process
                                                                                                                                                 the data that are usually used to implement the algorithms.
DBSCANSMMF         Yes                Yes              Yes                     No                      Yes            Yes
                                                                                                                                                               VI. CONCLUSION AND FUTURE ENHANCEMENT

Above table show the comparison of some key features and                                                                                                   In this paper we have proposed DBSCANSMMF
DBSCANSMMF is superior in most of the features.                                                                                                  algorithm to improve the performance as well as to process
                                                                                                                                                 the huge amount of data using Sparse MMF. This new
         V. EXPERIMENTAL RESULTS                                                                                                                 algorithm doesn’t uses any growing seed list which causes
                                                                                                                                                 the crash during the run time when there is no sufficient
                                                                                                                                                 memory to store the seed objects. Instead the new algorithm
        The newly proposed algorithm and the original                                                                                            just maintains the seed list using the offset values and these
DBSCAN algorithm have been implemented in Visual C++                                                                                             values are stored in each objects corresponding offset field
(2008) on Windows Vista OS and ran on PC with a 2.0 GHZ                                                                                          internally. So there is no need of creating duplicate objects
processor and 4 GB RAM to observe the performance. The



                                                                                                                                           127                                                         http://sites.google.com/site/ijcsis/
                                                                                                                                                                                                       ISSN 1947-5500
                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                  Vol. 8, No. 5, 2010
for processing the objects. Also this new algorithm takes O                         Tirunelveli. His interested research area is algorithms inventions in
(N * 2Eps)      computation complexity and this is better                           data mining.
complexity as long as Eps value is reasonable.
                                                                                    Email: hencilpeter@hotmail.com
         Future work will be to customize this algorithm to
support duplicate objects. This can be achieved using the
internal counter which will give the number of similar
objects and the SparseMemoryRegionQuery also needs to be
customized accordingly to support correct output. The next
expansion will be customizing this algorithm to process super
big data set (e.g. 50 GB). One of the real uses of Memory
Mapped File is mapping the required portion of the file into
memory to process and, un map the current mapped region
and remap the next consecutive file region to process later.
Like this we can process any big file and this algorithm needs
to be customized to support this feature.


                              REFERENCES
[1] Ester M., Kriegel H.-P., Sander J., and Xu X. (1996) “A Density-Based           Dr.A. Antonysamy is Principal of St. Xavier’s College,
Algorithm for Discovering Clusters in Large Spatial Databases with Noise”           Kathmandu, Nepal. He completed his Ph.D in Mathematics for the
In Proceedings of the 2nd International Conference on Knowledge
Discovery and Data Mining (KDD’96), Portland: Oregon, pp. 226-231
                                                                                    research on “An algorithmic study of some classes of intersection
                                                                                    graphs”. He has guided and guiding many research students in
[2] J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan               Computer Science and Mathematics. He has published many
Kaufman, 2006.                                                                      research papers in national and international journals. He has
                                                                                    organized Seminars and Conferences in state and national level.
[3] Jeffrey Richter and Christophe Nasarre, WINDOWS VIA C/C++,
Microsoft Press, 2008.                                                              Email: fr_antonysamy@hotmail.com.
[4]M. Ankerst, M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS:
Ordering Objects to Identify the Clustering Structure, Proc. ACM
SIGMOD,” in International Conference on Management of Data, 1999, pp.
49–60.

[5] A. Hinneburg and D. Keim, “An efficient approach to clustering in large
multimedia data sets with noise,” in 4th International Conference on
Knowledge Discovery and Data Mining, 1998, pp. 58–65.

[6]SHOU Shui-geng, ZHOU Ao-ying JIN Wen, FAN Ye and QIAN Wei-
ning.(2000)
"A Fast DBSCAN Algorithm" Journal of Software: 735-744

[7] Li Jian; Yu Wei; Yan Bao-Ping; , "Memory effect in DBSCAN
algorithm," Computer Science & Education, 2009. ICCSE '09. 4th
International Conference on , vol., no., pp.31-36, 25-28 July 2009.

                          AUTHOR PROFILES




J. Hencil Peter is Research Scholar, St. Xavier’s College
(Autonomous), Palayamkottai, Tirunelveli, India. He earned his
MCA (Master of Computer Applications) degree from
Manonmaniam Sundaranar University, Tirunelveli. Now he is
doing Ph.D in Computer Applications and Mathematics
(Interdisciplinary) at Manonmaniam Sundranar University,




                                                                              128                                   http://sites.google.com/site/ijcsis/
                                                                                                                    ISSN 1947-5500

				
DOCUMENT INFO
Description: IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems.