MINING PATTERNS WITH SEQUENCE AND SEGMENT PERIODICITY IN TIME

Document Sample
MINING PATTERNS WITH SEQUENCE AND SEGMENT PERIODICITY IN TIME Powered By Docstoc
					  National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                     534


MINING PATTERNS WITH SEQUENCE AND SEGMENT PERIODICITY IN TIME
                                                                SERIES
                                            T.SARANYA, Reg no. 97310405015
                                       The Rajaas Engineering College, Vadakkankulam.
                                                   E-mail:tsaraniasbe@gmail.com
                         Abstract                                    instance, there is a traffic jam twice a day when the schools are
                                                                     open; number of transactions in a superstore is high at certain
           In this project one main method is periodic               periods during the day, certain days during the week, and so on.
pattern mining or periodic detection. This method is a               In other words, periodicity detection is a process for finding
tool that helps in predicting the behavior of time                   temporal regularities within the time series, and the goal of
series data. The method has a number of applications,                analyzing a time series is to find whether and how frequent a
such as prediction, forecasting, detection of unusual                periodic pattern (full or partial) is repeated within the series.[21]
activities, etc. That is used to detect periodicity within                       In general, three types of periodic patterns can be
a subsection of a time series. The main advantage in                 detected in a time series: 1) symbol periodicity, 2) sequence
                                                                     periodicity or partial periodic patterns, and 3) segment or full-
this project is to reduce the noise and different
                                                                     cycle periodicity. A time series is said to have symbol
periodicity types (namely symbol, sequence, and                      periodicity if at least one symbol is repeated periodically. For
segment) are to be investigated. In proposed work, an                example, in time series T = abd acb aba abc, symbol a is
algorithm which can detect symbol, sequence                          periodic with periodicity p =3, starting at position zero (stPos
(partial), and segment (full cycle) periodicity in time              =0). Similarly, a pattern consisting of more than one symbol
series. The algorithm uses suffix tree as the data                   maybe periodic in a time series and this leads to partial periodic
structure. The algorithm is noise resilient. It has been             patterns. For instance, in time series T = bbaa abbd abca abbc
successfully demonstrated to work with replacement,                  abcd, the sequence          ab is periodic with periodicity p =4,
                                                                     starting at position 4 (stPos = 4); and the partial periodic pattern
insertion, deletion, or a mixture of these types of                  ab ** exists in T, where * denotes any symbol or don’t care
noise. The proposed algorithm on both synthetic and                  mark. Finally, if the whole time series can be mostly represented
real data from different domains, including protein                  as a repetition of a pattern or segment, then this type of
sequences. It is generally more time-efficient and                   periodicity is called segment or full-cycle periodicity. For
noise-resilient than existing algorithms.                            instance, the time series T = abcab abcab abcab has segment
           *********************                                     periodicity of 5 (p =5) starting at the first position (stPos =0),
                                                                     i.e., T consists of only three occurrences of the segment abcab.
                      Introduction                                               To develop a noise-resilent algorithm in the
                                                                     periodicity pattern has three types of issues:                     1)
                      Data mining is the extraction of               Identifying all periodicity patterns.2) Handling asynchronous
hidden predictive information from large databases, is a             periodicity by locating all periodic patterns.3) Investigating
powerful new technology with great potential to help companies       whole time series. [13], [14], [15]. The algorithm proposed in
focus on the most important information in their data                this paper can detect periodic patterns found in subsections of
warehouses. Data mining is the process of discovering new            the time series, prune redundant periods and follow various
patterns from large data sets involving methods from statistics      optimization strategies. It is also applicable to biological data
and artificial intelligence but also database management. In         sets and has been analyzed for time performance and space
contrast to machine learning, the emphasis lies on the discovery     consumption. Contributions of our work can be summarized as
of previously unknown patterns as opposed to generalizing            follows:
known patterns to new data .It is the form of large scale data or           1. The development of suffix-tree-based comprehensive
information processing such as collection, extraction,                           algorithm that can simultaneously detect symbol,
warehousing, analysis and statistics. But also generalized to any                sequence and segment periodicity.
kind of computer decision support system including artificial               2. Finding periodicity within subsection of the series
intelligence, machine learning and business intelligence.                   3. Nonredundant periods by applying pruning
                       The key term is discovery commonly                        techniques to eliminate redundant periods.
defined as detecting something new. Data mining tools predict               4. This algorithm analysis for time performance and
future trends and behaviors are: Allowing businesses to make                     space consumption by considering three cases such
proactive, knowledge-driven decisions. They scour databases for                  as worst case, the average case and the best case.
hidden patterns, finding predictive information that experts may            5. A number of optimization strategies are presented.
miss because it lies outside their expectations. Most companies             6.    the proposed algorithm is shown to be applicable to
already collect and refine massive quantities of data. Data                      biological data sets such as DNA and protein
mining techniques can be implemented rapidly on existing                         sequences and the results are compared with those
software and hardware platforms to enhance the value of                          produced by other existing algorithms like SMCA
existing information resources and can be integrated with new                    [9].
products and systems as they are brought on-line. When                      7. Various experiments have been conducted to
implemented on high performance client/server or parallel                        demonstrate the time efficiency, robustness,
processing computers. Data mining tools can analyze massive                      scalability and accuracy of the reported results by
databases.                                                                       comparing the proposed algorithm with existing
            A time series is a collection of data values gathered                state-of-the-art algorithms like CONV [4], WARP
generally at uniform interval of time to reflect certain behavior                [5] and partial periodic patterns (ParPer) [12].
of an entity. Real life has several examples of time series such     2 RELATED WORKS
as weather conditions of a particular location, spending patterns,               Existing literature on time series analysis roughly
stock growth, transactions in a superstore, network delays,          covers two types of algorithms: 1) the first category includes
power consumption, computer network fault analysis and               algorithms that require the user to specify the period and then
security breach detection, earthquake prediction, gene               look only for patterns occurring with that period. The algorithm
expression data analysis [1], etc. A time series is mostly           is used for all possible periods in the time series.2)Another way
characterized by being composed of repeating cycles. For             to classify existing algorithms is based on the type of periodicity


Department of CSE, Sun College of Engineering and Technology
  National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                        535


they detect only symbol periodicity and some detect only                string, to find the frequent substring and other string matching
sequence or partial periodicity while others detect only full           problems. A suffix tree for a string represents all its suffixes; for
cycle or segment periodicity.                                           each suffix of the string there is a distinguished path from the
             Earlier algorithms, e.g., [11], [3], require the user to   root to a corresponding leaf node in the suffix tree. Given that a
provide the expected period value                                       time series is encoded as a string, the most important aspect of
which is used to check the time series for corresponding                the suffix tree, related to our work, is its capability to very
periodic patterns. For example, in power consumption time               efficiently capture and highlight the repetitions of substrings
series, a user may test for weekly, biweekly, or monthly periods.       within a string. Fig. 1shows a suffix tree for the string
However, it is usually difficult to provide expected period             abcabbabb$, where $ denotes end marker for the string; it is a
values; and this approach prohibits finding unexpected periods,         unique symbol that does not appear anywhere in the string.
which might be more useful than the expected ones.
            Recently, there are few algorithms, e.g., [4], [7], [10]
which look for all possible periods by considering the range (2 ≤
p≤n/2). They proposed two separate algorithms to detect symbol
and segment periodicity in time series. Their first algorithm
(CONV) is based on the convolution technique with reported
complexity of O (nlogn). Although their algorithm works well
with data sets having perfect periodicity, it fails to perform well
when the time series contains insertion and deletion noise.
Realizing the need to work in the presence of noise, Elfeky et al.
later presented an O (n2) algorithm (WARP) [5], which performs
well in the presence of insertion and deletion noise. WARP uses
the time warping technique to accommodate insertion or
deletion noise in the data. But, besides having O (n2)
complexity, WARP can only detect segment periodicity. It
cannot find symbol or sequence periodicity. Also, both CONV
and WARP can detect periodicity which last till the very end of
the time series, i.e., they cannot detect patterns which are
periodic only in a subsection of the time series.                                Fig. 1. The suffix tree for the string abcabbabb$.
            Sheng et al. [17] developed an algorithm which is                       The path from the root to any leaf represents a suffix
based on Han’s [8] ParPer algorithm to detect periodic patterns         for the string. Since a string of length n can have exactly n
in a section of the time series; their algorithm utilizes               suffixes, the suffix tree for a string also contains exactly n
optimization steps to find dense periodic areas in the time series.     leaves. Each edge is labeled by the string that it represents. Each
But their algorithm, being based on ParPer, requires the                leaf node holds a number that represents the starting position of
            Recently, Huang and Chang [9] presented their               the suffix yield when traversing from the root to that leaf. Each
algorithm for finding asynchronous periodic patterns, where the         intermediate node holds a number which is the length of the
periodic occurrences can be shifted in an allowable range along         substring read when traversing from the root to that intermediate
the time axis. This is somehow similar to how our algorithm             node. Each intermediate edge reads a string, which is repeated at
deals with noisy data by utilizing the time tolerance window for        least twice in the original string.
periodic occurrences.                                                               A suffix tree can be generated in linear time using
3 PROPOSED WORKS                                                        Ukonen’s famous algorithm described in [19]. Ukonen’s
                       To overcome the problems in existing             algorithm works online, i.e., a suffix tree can be extended as
schemes, the proposed method is periodic pattern or periodicity         new symbols are added to the string. A suffix tree for a string of
detection based on different periodicity types: 1) Symbol               length n can have at most 2n nodes and the average depth of the
periodicity 2) Sequence periodicityand3) Segment periodicity            suffix tree is of order log (n) [6], [16]. It is not necessary to
                        An algorithm which can detect symbol,           always keep a suffix tree in memory; there are algorithms for
sequence (partial), and segment (full cycle) periodicity in time        handling disk-based suffix tree, making it a very preferred
series. This algorithm uses suffix tree as the data structure. The      choice for processing very large sized strings such as time series
algorithm is noise-resilient and works with replacement,                and DNA sequences which grow in billions [20].
insertion, deletion, or any mixture of these types of noise. That
is used to check for all possible periods starting from all             4.2 Occurrence Vector and difference
possible positions within a prespecified range, whether the                 Vector
whole time series or a subsection of the time series and applies
various redundant period pruning techniques to output a small                        The second phase is Occurrence vector and
number of useful periods by removing most of the redundant              difference      vector. To traverse the tree in bottom-up order to
periods. That has been analyzed for time performance[1].                construct the occurrence vector for each edge connecting an
                                                                        internal node to its parent. We start with nodes having only leaf
4 METHODS                                                               nodes as children. Each such node passes the values of its
            It involves three phases. In the first phase, build the     children (leaf nodes) to the edge connecting it to its parent node.
suffix tree for the time series and in the second phase, use the        The values are used by the latter edge to create its occurrence
suffix tree to calculate the periodicity of various patterns in the     vector (denoted occur_vec in the algorithm). The occurrence
time series. The third phase is redundant period pruning, i.e., to      vector of edge e contains index positions at which the substring
ignore a redundant period ahead of time. As immediate benefit           from the root to edge e exist in the original string. Second, we
of redundant period pruning, the algorithm does not waste time          consider each node v having a mixture of leaf and nonleaf nodes
to investigate a period which has already been identified as            as children[1].
redundant. This saves considerable time and also results in                          The occurrence vector of the edge connecting v to its
reporting fewer but more useful periods. This is the primary            parent node is constructed by combining the occurrence
reason why our algorithm, intentionally, reports significantly          vector(s) of the edge(s) connecting v to its nonleaf child node(s)
fewer numbers of periods without missing any existing periods           and the value(s) coming from its leaf child node(s). Finally, until
during the pruning process[1].                                          we reach all direct children of the root, we recursively consider
4.1 First Phase—Suffix-Tree-Based                                       each node u having only nonleaf children. The occurrence
    Representation                                                      vector of the edge connecting u to its parent node is constructed
            Suffix tree is a famous data structure [1] that has         by combining the occurrence vector(s) of the edge(s) connecting
been proven to be very useful in string processing [1], [7], [19].      u to its child node(s). Applying this bottom-up traversal process
It can be efficiently used to find a substring in the original          on the suffix tree shown in Fig. 1 will produce the occurrence
                                                                        vectors reported in Fig. 2.

Department of CSE, Sun College of Engineering and Technology
  National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                     536


                                                                      insertion and/or deletion noise, linear-distance-based algorithms
                                                                      do not perform well. A specified limit called time tolerance
                                                                      (denoted as tt in the algorithm). This means that if the periodic
                                                                      occurrence is expected to be found at positions x, x + p, x + 2p; .
                                                                      . . , then with time tolerance, occurrences at x, x + p ± tt, x + 2p
                                                                      ± tt… would also be considered valid.
                                                                      4.2.3 Periodicity Detection in a Subsection of Time Series
                                                                                   The periodicity detection algorithm calculates all
                                                                      patterns which are periodic starting from any position less than
                                                                      half the length of the time series (stPos <n/2) and continues till
                                                                      the end of the time series or till the last occurrence of the
                                                                      pattern. But, in real-life situations, there might be a case that a
                                                                      pattern is periodic only within a section and not in the entire
                                                                      time series. For example, the traffic pattern during semester
                                                                      break near schools is different than during the semester, movie
                                                                      rental pattern in winter is different than in summer, hourly
                                                                      number of transactions at a superstore may show different
                                                                      periodicity during Holidays season (15 November to 15
     Fig. 2. Suffix tree for string abcabbabb$ after bottom-up        January) than regular periodicity. By considering time series T=
                               traversal.                             bcdabababababccdabcadad, pattern ab is perfectly periodic only
             The periodicity detection algorithm uses the             in the range [3, 12]. Such type of periodicity might be very
occurrence vector of each intermediate edge to check whether          interesting in DNA sequences in particular and in regular time
the string represented by the edge is periodic. The tree traversal    series.
process is implemented using the nonrecursive explicit stack-                      In order to detect such periodicity, we employ the
based algorithm presented in [2], which prevents the program          concept introduced in [1], where two parameters dmax and
from throwing the stack-overflow-exception.                           minLength are utilized. The first parameter dmax denotes the
4.2.1 Periodicity Detection Algorithm                                 maximum distance between any two occurrences of a pattern to
             The periodicity detection algorithm applies at each      be part of the same periodic section. Consequently, having the
intermediate edge using its occurrence vector. Our algorithm is       distance between two occurrences more than dmax may
linear distance-based, where we take the difference between any       potentially mark the end of one periodic section and/or the start
two successive occurrence vector elements leading to another          of new periodic section for the same pattern.
vector called the difference vector. It is important to understand
that we actually do not keep any such vector in the memory but        4.5 Redundant Period Pruning Techniques
this is considered only for the sake of explanation. Table 1                    Periodicity detection algorithms generally do not
presents example occurrence and difference vectors.                   prune or prohibit the calculation of redundant periods. The
             Each value in the difference vector is a candidate       immediate drawback is reporting a huge number of periods,
period starting from the corresponding occurrence vector value        which makes it more challenging to find the few useful and
(the value of occur vec in the same row). Recall that each period     meaningful periodic patterns within the large pool of reported
can be represented by 5-tuple (X, p, stPos, endPos, conf),            periods.
denoting the pattern, period value, starting position, ending                                   TABLE 2
position, and confidence, respectively. For each candidate                An Example Result of a Periodicity Detection Algorithm
period (p = diff_ vec[j]), the algorithm scans the occurrence           Pattern(X)      Period(p)        Starting Position(stops)
vector starting from its corresponding value (stPos = occur                 ab               5                        0
vec[j]), and increases the frequency count of the period freq (p)           ab              10                       0
if and only if the occurrence vector value is periodic with regard          ab              25                       0
to stPos and p.                                                              a               5                        0
                              TABLE 1                                        a              15                       0
                An Example Occurrence Vector and                             b               5                        1
               Its Corresponding Difference Vector                           b              10                       1
           Occur_vec              Diff_vec                                   a               5                       10
                    0                   3                                   ab               5                       20
                    3                   9
                   12                   4                                         Assume the periods reported by an algorithm are as
                   16                   5                             presented in Table 2. Looking carefully at this result, it can be
                   21                   3                             easily seen that all these periods can be replaced by just one
                   24                   3                             period, namely (X =ab, p = 5, stPos= 0) and all other periods
                   27                  11                             maybe considered as just the mere repetition of period X.
                   38                   7                                         Empowered by redundant period pruning, our
                   45                   3                             algorithm not only saves the time of the users observing the
4.2.2 Periodicity Detection in Presence of Noise                      produced results, but it also saves the time for computing the
             Three types of noise generally considered in time        periodicity by the mining algorithm itself. We implemented the
series data are replacement, insertion and deletion noise. In         redundant period pruning techniques as prohibitive steps which
replacement noise, some symbols in the discretized time series        prohibit the algorithm from handling redundant periods.
are replaced at random with other symbols. In case of insertion       5 EXPERIMENTAL RESULTS
and deletion noise, some symbols are inserted or deleted,                         In this section, present the results of several
respectively, randomly at different positions (or time values).       experiments that have been conducted using both synthetic and
Noise can also be a mixture of these three types. For instance,       real data; we also report the results of testing various
RI type noise means the uniform mixture of replacement (R) and        characteristics of our algorithm against other existing
insertion (I) noise.The insertion and deletion noise expand or        algorithms. Hereafter, our algorithm is denoted as STNR
contract the time axis leading to shift of the original time series   (Suffix-Tree-based Noise-Resilient algorithm).
values. For example, time series T =abcabcabc after inserting
symbol b at positions 2 and 6 would be T = abbcabcabbc. The           5.1 Accuracy
occurrence vector symbol a in T is (0, 3, 6), while it is (0, 4, 7)             The first sets of tests are dedicated to demonstrate
in T . It is very clear that when the time series is distorted by     the completeness of STNR in the sense that it should be able to



Department of CSE, Sun College of Engineering and Technology
  National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                           537


find a period once it exists in the time series. STNR satisfies this         found periodic for symbol a and its patterns are: [148187-
on both synthetic and real data.                                             155084], [181413-186097] and [300522-304372]. Accordingly,
5.1.1 Synthetic Data                                                         the target of this set of experiments is to confirm that STNR can
            The parameters controlled during data generation are             also find the periodicity within a subsection of the time series.
data distribution (uniform or normal), alphabet size, size of the            5.2 Time Performance
data, period size, and the type and amount of noise in the data. A                       The time performance of STNR compared to ParPer,
datum may contain replacement, insertion, and deletion noise or              CONV, and WARP. We test the time behavior of the compared
any mixture of these types of noise [4].                                     algorithms by considering three perspectives: varying data size,
            The algorithm can find all periodic patterns with 100            period size and noise ratio.
percent confidence regardless of the data distribution, alphabet             5.2.1 Varying Data Size
size, period size and data size. This is an immediate benefit of                         First, we compare the performance of STNR against
using the suffix tree which guarantees identifying all repeating             ParPer [8], which finds the periodic patterns for a specific
patterns. Since the algorithm checks the periodicity for all                 period. The synthetic data used in the testing have been
repeating patterns, the algorithm can detect all existing periods            generated by following uniform distribution with alphabet size
in the inerrant data. Results with noisy data are presented later.           of 10 and embedded period value of 32. The algorithms have
5.1.2 Real Data                                                              been run on these data by varying the time series size from
             In this paper, used data set (denoted PACKET)                   100,000 to 1,000,000. In case ParPer and the other similar
which as described next contains the packet Round Trip Time                  algorithms are extended to find patterns with all period values,
(RTT) delay [17]. These data have been selected to demonstrate               their complexity would jump up to O (n2). Even with O (n2)
the adaptability of STNR for handling cases where periods are                complexity, ParPer only finds partial periodic patterns while
contained only within a subsection of the time series. The data              STNR can find all the three types of periodic patterns in the
have been discretized uniformly into 26 symbols.                             data, namely symbol, sequence, and segment periodicity.
            TABLE 3 Periodic Patterns in Packet Data                         Further, compared to STNR, ParPer does not detect periodicity
Pattern        Period      StPos        EndPos       Confidence              within subsection of a time series and it is not resilient to noise
     aa        2           146698       155081       0.42                    (especially insertion/deletion). Due to the differences
     aa        2           180136       186061       0.42                    highlighted above, the quality of the results produced by STNR
     aa        2           297362       304371       0.47                    could be classified as better than that produced by ParPer
    aaa        3           146772       155064       0.44                    algorithm [8].
    aaa        3           180897       186065       0.43
    aaa        3           297525       304367       0.47                    STNR can even achieve this result when the series contains
   aaa*a       5           147030       155044       0.44                    insertion and deletion noise and when the periodicity is only
                                                                             found in a subsection of the time series and not in the entire time
   aaaa*       5           182390       186064       0.36
                                                                             series. ParPer is not capable of achieving this [8]. Similarly
   aaa**       5           297480       304324       0.44
                                                                             ParPer cannot detect the periodicity of singular events (termed
 ****aa*       7           147203       155042       0.46                    as symbol periodicity) which might be prevalent in the time
 **aa**a       7           148841       155042       0.46                    series.
 aa*****       7           149394       155070       0.46                    5.2.2 Varying Period Size
 a**aaaa       7           182091       186059       0.4                                This set of experiments is intended to show the
 a*a*aa*       7           299362       304324       0.49                    behavior of STNR by varying the embedded period size. For
                                                                             this experiment, we fixed the time series length and the number
           The Packet data have been used in [17], where the                 of alphabets in the series and vary the embedded period size
authors found the dense periodic patterns, i.e., the area where              from 5 to 100. The time taken by STNR for both uniform and
the time series is mostly periodic. The three regions which were             normal distribution has been recorded and the corresponding




                                            Fig. 3. Time behavior with varying period size (alphabet size= 10).


curves are plotted in Fig. 5. From Fig. 5, we can easily observe             Time performance of CONV remains the same as it checks for
the effect of changing period value on time for ParPer [8] and               all possible periods irrespective of the data set.
CONV [4]. The results show that
the time performance of both STNR and CONV does not change                   5.2.3 Varying Noise Ratio
significantly and remains more or less the same. But ParPer                  The next set of experiments measure the impact of noise ratio on
does get affected as the period size increased; this is true                 the time performance of STNR. For this experiment, we fixed
because when the period value is large, the partial periodic                 the time series length, period size, alphabet size, and data
pattern is large as well, and so is the maxsubpattern tree [8].              distribution and measured




Department of CSE, Sun College of Engineering and Technology
  National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                       538


the impact of varying noise ratio on time performance of the
algorithm. We tested two sets of data; one contains replacement
noise and the other contains insertion noise. The results are
plotted in Fig. 6. Again these experiments have been conducted
using the three algorithms: ParPer, STNR, and CONV. It is true
that STNR takes more time when the noise ratio is small (10-15
percent); but when the noise ratio is increased, STNR tends to
take the similar time. As we ignore the intermediate edges (or the
periodic patterns) which carry less than 1 percent leaf nodes (or
patterns which appear rarely), mostly noise is caught into these
infrequent patterns and does not affect the time performance of
STNR by reasonable margin. As the results show, the noise ratio
does not affect the time performance of CONV and ParPer.
5.2.4 Effect of Data Distribution
            Finally, we tested the time performance of STNR,
ParPer [12], WARP [7], and CONV [6] for the combination of
data distribution and period size. We tested two series; one has       Fig. 6. Execution time with and without checking first level
been generated with uniform data distribution while the other          occurrence vectors.
inhibits normal distribution. Period size of 25 and 32 are
embedded in the time series and the time performance
is measured between 100,000 and 1,000,000 series length. The
results are presented in Fig. 7. For STNR, the uniform distribution
seems to take less amount of time compared to the normal
distribution. Since the shape of the suffix tree depends on the data
distribution, it is very understandable that the different data
distributions would take different amounts of time. But, an
important observation is that the time behavior or the time pattern
is similar for both uniform and normal distributions and for
different period size combinations.
             For CONV and WARP, the data distribution does not
affect the time performance as well because they check the
periods exhaustively. The time performance of ParPer does not
get affected by the data distribution but it seems to take more time
when the period size is increased (as pointed out earlier).
5.3 Optimization Strategies
            Optimization strategies used to improve the time           Fig. 7. Number of periods detected with and without checking
performance of the algorithm in practice. the optimization             first level occurrence vectors.
techniques mainly test for: 1) sorted and unsorted occurrence
vectors of edges in terms of the number of values they 2)              5.3.2 Periodicity Detection at First Level
calculating the periodicity and executing STNR by                                  Periodicity detection can be avoided for occurrence
including/excluding the edges at the first level of the suffix tree    vectors at the first level of the suffix tree for many data sets
and 3) ignoring or considering the edges whose occurrence              because usually the patterns at first level are subsets of the larger
vectors carry fewer values .                                           patterns. But this depends heavily on the data set and the periodic
5.3.1 Tree Traversal Guided by Sorted Edges                            pattern size (or length). In these experiments, observed the
            The tree traversal is guided by the number of leaves a     runtime and the number of nonredundant periods detected by
node (or edge) carries. The edge which leads to more leaves is         STNR with and without calculating the periodicity based on
traversed first and the edge which leads to fewer numbers of           occurrence vectors at the first level of the tree. The time
leaves is traversed later. This results in fewer changes in the        performance of the algorithm improves significantly when
occurrence vectors maintained by STNR. Fig. 8 presents the time        occurrence vectors at the first level are ignored. Time
taken by STNR to process the same data set with and without            improvement is expected because occurrence vectors at the first
employing sorting on occurrence vectors. As expected, STNR             level always contain more elements (leaves) than those at other
conserves time when using the sorted occurrence vectors.               levels. By not considering the first level, there is hardly any
                                                                       difference in the number of periods detected in the Wal-Mart data
                                                                       sets because almost all the patterns in these data sets consist of
                                                                       more than one symbol (the daily pattern) and hence are caught at
                                                                       levels deeper than the first level. But as PACKET is concerned,
                                                                       the algorithm could not find a single periodic pattern when the
                                                                       first level is ignored. This is because all periodic patterns in
                                                                       PACKET consist of single symbol and do not have their supersets
                                                                       at the deeper levels of the tree. Therefore, we may conclude that
                                                                       although ignoring the occurrence vectors at the first level
                                                                       considerably improves the time performance, it may miss some
                                                                       very useful patterns if the data set contains very small periodic
                                                                       patterns (usually symbol periodicity only). Hence, this strategy
                                                                       should be followed very carefully.

Fig. 5. Time performance with and without sorted edges.




Department of CSE, Sun College of Engineering and Technology
  National Conference on Role of Cloud Computing Environment in Green Communication 2012                                                                539


                                                                             while STNR and WARP perform better because we take into
                                                                             account asynchronous periodic occurrences which drift from the
                                                                             expected position within an allowable limit.
                                                                             5.5 Experiments with Biological Data
                                                                                         Biological data, e.g., DNA or protein sequences, also
                                                                             exhibit periodicity for certain patterns. DNA sequences are
                                                                             constructed using four alphabets A, T, C, and G, while protein
                                                                             sequences are based on 20 alphabets (from A to T). These
                                                                             sequences have their own specific properties such as the periodic
                                                                             patterns are only found in a subsection of the sequence and do not
                                                                             span the entire series, the pattern occurrence may drift from the
                                                                             expected position, there is a concept of alternative substrings
                                                                             where a substring may replace another substring without any
                                                                             change in the semantics. For example, in the data set considered
                                                                             in [13], TA may replace TT. In this section, we present two
                                                                             experiments where we applied the algorithm for detecting the
                                                                             periodicity in protein sequences. The two protein sequences
                                                                             namely P09593 and P14593 can be retrieved from the Expert
                                                                             Protein Analysis System (ExPASy) database server
Fig. 8. Execution time with and without fewer leaves check.                  (www.expasy.org). The protein sequence P09593 is S antigens
                                                                             protein in Plasmodium falciparum v1 strain. S antigens are
                                                                             soluble heat-stable proteins present in the sera of some infected
                                                                             individuals.
                                                                                                          TABLE 5
                                                                                              Periodic Pattern Found in P14593
                                                                             Per stpos StPosMod EndPos confidence pattern repetitions
                                                                              4     106          2           162         0.33          d        5
                                                                              4     108          0           369           1          gn        66
                                                                              4     115          3           369           1         agn        64
                                                                              4     122          2           369         0.94       aagn        58

                                                                                        For the second experiment, we applied STNR on the
                                                                             protein sequence P14593 which is the code for circum sporozoite
                                                                             protein. It is the immune dominant surface antigen on the
                                                                             sporozoite. The sequence of this protein has interesting repeats
                                                                             which represent the surface antigen of the organism.
Fig. 9. Number of periods detected with and without fewer leaves             6 CONCLUSIONS AND FUTURE WORK
check.                                                                                                The periodicity detection method using
                                                                             detection is an essential process in periodicity mining to discover
5.3.3 Ignoring Edges Carrying Fewer Leaves                                   potential periodicity rates. The key idea is the single algorithm
            We may ignore edges (or nodes) that carry very small             can find symbol, sequence (partial periodic) and segment (full
number of leaves say 1 or 2 percent leaves. Such edges usually do            cycle) periodicity in the time series. It can also find the
not lead to any valid periodic patterns. Since the synthetic data            periodicity within a subsection of the time series. This method
sets maybe biased, we decided to run this set of experiments again           used to show the time behavior, accuracy and noise resilience
on the Wal-Mart and PACKET data sets. Here, we ignored all                   characteristics of the data. The algorithms run on both real and
edges which carry less than 3 percent leaves. The results are                synthetic data.
presented in Figs. 11 and 12. The results show that ignoring such                                      In phase I first phase-suffix-tree-based
minority edges hardly affects the number of nonredundant periods             representation was completed. In phase II I will complete the
detected in the considered data sets. Still there is a possibility that      occurrence vector and difference vector and periodicity in the
some (although very small number of) periods might be missed                 time series. That is detect the presence of noise and calculate the
because of this strategy, but again it does conserve the algorithm’s         subsection of the time series
running time. Here, it is important to note that these strategies                                     The phase I suffix tree is used. It can be
might be very useful when the data set is very large and disk-               efficiently used to find a substring in the original string, to find
based Implementation is to be used where only a small portion of             the frequent substring and other string matching problems. A
the suffix tree is to be kept in memory.                                     suffix tree for a string represents all its suffixes. For each suffix of
5.4 Noise Resilience                                                         the string there is a distinguished path from the root to a
            We have already demonstrated in [14] the noise-                  corresponding leaf node in the suffix tree. Given that a time series
resilient features of the algorithm where we compared the                    is encoded as a string, the most important aspect of the suffix tree
resilience to noise of STNR and the other algorithms based on                and related to this work it is capability to very efficiently capture
each of the five combinations of noise, i.e., replacement,                   and highlight the repetitions of substrings within a string.
insertion, deletion, insertion-deletion, and replacement-insertion
deletion. The results show that our algorithm compares well with
WARP [7] and performs better than AWSOM [12], CONV [6],
and STB [15]. The latter three algorithms do not perform well
because they only consider synchronous periodic occurrences
                               TABLE 4
                        Periodic Pattern Found in P09593

Per    Stpos     EndPos      Confidence         pattern        repetitions
11      104       338           0.9             ggpgse             19
11      105       313           0.95          gpgsegpkgtg          18




Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012   540


REFERENCES

   1.    Faraz Rasheed, Mohammed Alshalalfa and Alhajj,”
         Efficient Periodicity Mining in           Time Series
         Databases Using Suffix Trees”, vol. 23, No. 1,
         Jan2011.
   2.    A.Al-Rawi, A. Lansari, and F. Bouslama, “A New
         Non-Recursive Algorithm for Binary Search Tree
         Traversal,” vol. 2, pp. 770-773, Dec. 2003.
   3.    C.Berberidis, W. Aref, M. Atallah, I. Vlahavas, and A.
         Elmagarmid, “Multiple and Partial Periodicity Mining
         in Time Series Databases,” July 2002.
   4.    M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid,
         “Periodicity Detection in Time Series Databases,” vol.
         17, no. 7, pp. 875-887, July 2005.
   5.    M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid,
         “WARP: Time Warping for Periodicity Detection,”
         Nov. 2005.
   6.    J. Fayolle and M.D. Ward, “Analysis of the Average
         Depth in a Suffix Tree under a Markov Model,” pp.
         95-104, 2005.
   7.    R. Grossi and G.F. Italiano, “Suffix Trees and Their
         Applications in String Algorithms,” pp. 57-76, Sept.
         1993.
   8.    J. Han, Y. Yin, and G. Dong, “Efficient Mining of
         Partial Periodic Patterns in Time Series Database,” p.
         106, 1999.
   9.    K.-Y. Huang and C.-H. Chang, “SMCA: A General
         Model for Mining Asynchronous Periodic Patterns in
         Temporal Databases,” vol. 17, no. 6, pp. 774-785, June
         2005.
   10.   P. Indyk, N. Koudas, and S. Muthukrishnan,
         “Identifying Representative Trends in Massive Time
         Series Data Sets Using Sketches,” Sept. 2000.
   11.   S. Ma and J. Hellerstein, “Mining Partially Periodic
         Event Patterns with Unknown Periods,” Apr. 2001.
   12.   S. Papadimitriou, A. Brockwell, and C. Faloutsos,
         “Adaptive, Hands Off-Stream Mining,” pp. 560-571,
         2003.
   13.   F. Rasheed, M. Alshalalfa, and R. Alhajj, “Adapting
         Machine Learning Technique for Periodicity Detection
         in Nucleosomal Locations in Sequences” pp. 870-879,
         Dec. 2007.
   14.   F. Rasheed and R. Alhajj, “STNR: A Suffix Tree
         Based Noise Resilient Algorithm for Periodicity
         Detection in Time Series Databases,” vol. 32, no. 3,
         pp. 267-278, 2010.
   15.   F. Rasheed and R. Alhajj, “Using Suffix Trees for
         Periodicity Detection in Time Series Databases,” Sept.
         2008.
   16.   Y.A. Reznik, “On Tries, Suffix Trees, and Universal
         Variable- Length-to-Block Codes,” p. 123, 2002.
   17.   C. Sheng, W. Hsu, and M.-L. Lee, “Mining Dense
         Periodic Patterns in Time Series Data,” p. 115, 2005.
   18.   Y. Tian, S. Tata, R.A. Hankins, and J.M. Patel,
         “Practical Methods for Constructing Suffix Trees” vol.
         14, no. 3, pp. 281-299, Sept. 2005.
   19.   E. Ukkonen, “Online Construction of Suffix Trees,”
         vol. 14, no. 3, pp. 249-260, 1995.
   20.   N. Va¨lima¨ki, W. Gerlach, K. Dixit, and V. Ma¨kinen,
         “Compressed Suffix Tree—A Basis for Genome-Scale
         Sequence Analysis,” vol. 23, pp. 629-630, 2007.
   21.   A.Weigend and N. Gershenfeld, Time Series
         Prediction: Forecasting the Future and Understanding
         the Past. Addison-Wesley, 1994.
   22.   J. Yang, W. Wang, and P. Yu, “Info Miner+: Mining
         Partial Periodic Patterns with Gap Penalties,” Dec.
         2002.




Department of CSE, Sun College of Engineering and Technology

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:17
posted:7/26/2012
language:English
pages:7