National Conference on Role of Cloud Computing Environment in Green Communication 2012 534
MINING PATTERNS WITH SEQUENCE AND SEGMENT PERIODICITY IN TIME
T.SARANYA, Reg no. 97310405015
The Rajaas Engineering College, Vadakkankulam.
Abstract instance, there is a traffic jam twice a day when the schools are
open; number of transactions in a superstore is high at certain
In this project one main method is periodic periods during the day, certain days during the week, and so on.
pattern mining or periodic detection. This method is a In other words, periodicity detection is a process for finding
tool that helps in predicting the behavior of time temporal regularities within the time series, and the goal of
series data. The method has a number of applications, analyzing a time series is to find whether and how frequent a
such as prediction, forecasting, detection of unusual periodic pattern (full or partial) is repeated within the series.
activities, etc. That is used to detect periodicity within In general, three types of periodic patterns can be
a subsection of a time series. The main advantage in detected in a time series: 1) symbol periodicity, 2) sequence
periodicity or partial periodic patterns, and 3) segment or full-
this project is to reduce the noise and different
cycle periodicity. A time series is said to have symbol
periodicity types (namely symbol, sequence, and periodicity if at least one symbol is repeated periodically. For
segment) are to be investigated. In proposed work, an example, in time series T = abd acb aba abc, symbol a is
algorithm which can detect symbol, sequence periodic with periodicity p =3, starting at position zero (stPos
(partial), and segment (full cycle) periodicity in time =0). Similarly, a pattern consisting of more than one symbol
series. The algorithm uses suffix tree as the data maybe periodic in a time series and this leads to partial periodic
structure. The algorithm is noise resilient. It has been patterns. For instance, in time series T = bbaa abbd abca abbc
successfully demonstrated to work with replacement, abcd, the sequence ab is periodic with periodicity p =4,
starting at position 4 (stPos = 4); and the partial periodic pattern
insertion, deletion, or a mixture of these types of ab ** exists in T, where * denotes any symbol or don’t care
noise. The proposed algorithm on both synthetic and mark. Finally, if the whole time series can be mostly represented
real data from different domains, including protein as a repetition of a pattern or segment, then this type of
sequences. It is generally more time-efficient and periodicity is called segment or full-cycle periodicity. For
noise-resilient than existing algorithms. instance, the time series T = abcab abcab abcab has segment
********************* periodicity of 5 (p =5) starting at the first position (stPos =0),
i.e., T consists of only three occurrences of the segment abcab.
Introduction To develop a noise-resilent algorithm in the
periodicity pattern has three types of issues: 1)
Data mining is the extraction of Identifying all periodicity patterns.2) Handling asynchronous
hidden predictive information from large databases, is a periodicity by locating all periodic patterns.3) Investigating
powerful new technology with great potential to help companies whole time series. , , . The algorithm proposed in
focus on the most important information in their data this paper can detect periodic patterns found in subsections of
warehouses. Data mining is the process of discovering new the time series, prune redundant periods and follow various
patterns from large data sets involving methods from statistics optimization strategies. It is also applicable to biological data
and artificial intelligence but also database management. In sets and has been analyzed for time performance and space
contrast to machine learning, the emphasis lies on the discovery consumption. Contributions of our work can be summarized as
of previously unknown patterns as opposed to generalizing follows:
known patterns to new data .It is the form of large scale data or 1. The development of suffix-tree-based comprehensive
information processing such as collection, extraction, algorithm that can simultaneously detect symbol,
warehousing, analysis and statistics. But also generalized to any sequence and segment periodicity.
kind of computer decision support system including artificial 2. Finding periodicity within subsection of the series
intelligence, machine learning and business intelligence. 3. Nonredundant periods by applying pruning
The key term is discovery commonly techniques to eliminate redundant periods.
defined as detecting something new. Data mining tools predict 4. This algorithm analysis for time performance and
future trends and behaviors are: Allowing businesses to make space consumption by considering three cases such
proactive, knowledge-driven decisions. They scour databases for as worst case, the average case and the best case.
hidden patterns, finding predictive information that experts may 5. A number of optimization strategies are presented.
miss because it lies outside their expectations. Most companies 6. the proposed algorithm is shown to be applicable to
already collect and refine massive quantities of data. Data biological data sets such as DNA and protein
mining techniques can be implemented rapidly on existing sequences and the results are compared with those
software and hardware platforms to enhance the value of produced by other existing algorithms like SMCA
existing information resources and can be integrated with new .
products and systems as they are brought on-line. When 7. Various experiments have been conducted to
implemented on high performance client/server or parallel demonstrate the time efficiency, robustness,
processing computers. Data mining tools can analyze massive scalability and accuracy of the reported results by
databases. comparing the proposed algorithm with existing
A time series is a collection of data values gathered state-of-the-art algorithms like CONV , WARP
generally at uniform interval of time to reflect certain behavior  and partial periodic patterns (ParPer) .
of an entity. Real life has several examples of time series such 2 RELATED WORKS
as weather conditions of a particular location, spending patterns, Existing literature on time series analysis roughly
stock growth, transactions in a superstore, network delays, covers two types of algorithms: 1) the first category includes
power consumption, computer network fault analysis and algorithms that require the user to specify the period and then
security breach detection, earthquake prediction, gene look only for patterns occurring with that period. The algorithm
expression data analysis , etc. A time series is mostly is used for all possible periods in the time series.2)Another way
characterized by being composed of repeating cycles. For to classify existing algorithms is based on the type of periodicity
Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012 535
they detect only symbol periodicity and some detect only string, to find the frequent substring and other string matching
sequence or partial periodicity while others detect only full problems. A suffix tree for a string represents all its suffixes; for
cycle or segment periodicity. each suffix of the string there is a distinguished path from the
Earlier algorithms, e.g., , , require the user to root to a corresponding leaf node in the suffix tree. Given that a
provide the expected period value time series is encoded as a string, the most important aspect of
which is used to check the time series for corresponding the suffix tree, related to our work, is its capability to very
periodic patterns. For example, in power consumption time efficiently capture and highlight the repetitions of substrings
series, a user may test for weekly, biweekly, or monthly periods. within a string. Fig. 1shows a suffix tree for the string
However, it is usually difficult to provide expected period abcabbabb$, where $ denotes end marker for the string; it is a
values; and this approach prohibits finding unexpected periods, unique symbol that does not appear anywhere in the string.
which might be more useful than the expected ones.
Recently, there are few algorithms, e.g., , , 
which look for all possible periods by considering the range (2 ≤
p≤n/2). They proposed two separate algorithms to detect symbol
and segment periodicity in time series. Their first algorithm
(CONV) is based on the convolution technique with reported
complexity of O (nlogn). Although their algorithm works well
with data sets having perfect periodicity, it fails to perform well
when the time series contains insertion and deletion noise.
Realizing the need to work in the presence of noise, Elfeky et al.
later presented an O (n2) algorithm (WARP) , which performs
well in the presence of insertion and deletion noise. WARP uses
the time warping technique to accommodate insertion or
deletion noise in the data. But, besides having O (n2)
complexity, WARP can only detect segment periodicity. It
cannot find symbol or sequence periodicity. Also, both CONV
and WARP can detect periodicity which last till the very end of
the time series, i.e., they cannot detect patterns which are
periodic only in a subsection of the time series. Fig. 1. The suffix tree for the string abcabbabb$.
Sheng et al.  developed an algorithm which is The path from the root to any leaf represents a suffix
based on Han’s  ParPer algorithm to detect periodic patterns for the string. Since a string of length n can have exactly n
in a section of the time series; their algorithm utilizes suffixes, the suffix tree for a string also contains exactly n
optimization steps to find dense periodic areas in the time series. leaves. Each edge is labeled by the string that it represents. Each
But their algorithm, being based on ParPer, requires the leaf node holds a number that represents the starting position of
Recently, Huang and Chang  presented their the suffix yield when traversing from the root to that leaf. Each
algorithm for finding asynchronous periodic patterns, where the intermediate node holds a number which is the length of the
periodic occurrences can be shifted in an allowable range along substring read when traversing from the root to that intermediate
the time axis. This is somehow similar to how our algorithm node. Each intermediate edge reads a string, which is repeated at
deals with noisy data by utilizing the time tolerance window for least twice in the original string.
periodic occurrences. A suffix tree can be generated in linear time using
3 PROPOSED WORKS Ukonen’s famous algorithm described in . Ukonen’s
To overcome the problems in existing algorithm works online, i.e., a suffix tree can be extended as
schemes, the proposed method is periodic pattern or periodicity new symbols are added to the string. A suffix tree for a string of
detection based on different periodicity types: 1) Symbol length n can have at most 2n nodes and the average depth of the
periodicity 2) Sequence periodicityand3) Segment periodicity suffix tree is of order log (n) , . It is not necessary to
An algorithm which can detect symbol, always keep a suffix tree in memory; there are algorithms for
sequence (partial), and segment (full cycle) periodicity in time handling disk-based suffix tree, making it a very preferred
series. This algorithm uses suffix tree as the data structure. The choice for processing very large sized strings such as time series
algorithm is noise-resilient and works with replacement, and DNA sequences which grow in billions .
insertion, deletion, or any mixture of these types of noise. That
is used to check for all possible periods starting from all 4.2 Occurrence Vector and difference
possible positions within a prespecified range, whether the Vector
whole time series or a subsection of the time series and applies
various redundant period pruning techniques to output a small The second phase is Occurrence vector and
number of useful periods by removing most of the redundant difference vector. To traverse the tree in bottom-up order to
periods. That has been analyzed for time performance. construct the occurrence vector for each edge connecting an
internal node to its parent. We start with nodes having only leaf
4 METHODS nodes as children. Each such node passes the values of its
It involves three phases. In the first phase, build the children (leaf nodes) to the edge connecting it to its parent node.
suffix tree for the time series and in the second phase, use the The values are used by the latter edge to create its occurrence
suffix tree to calculate the periodicity of various patterns in the vector (denoted occur_vec in the algorithm). The occurrence
time series. The third phase is redundant period pruning, i.e., to vector of edge e contains index positions at which the substring
ignore a redundant period ahead of time. As immediate benefit from the root to edge e exist in the original string. Second, we
of redundant period pruning, the algorithm does not waste time consider each node v having a mixture of leaf and nonleaf nodes
to investigate a period which has already been identified as as children.
redundant. This saves considerable time and also results in The occurrence vector of the edge connecting v to its
reporting fewer but more useful periods. This is the primary parent node is constructed by combining the occurrence
reason why our algorithm, intentionally, reports significantly vector(s) of the edge(s) connecting v to its nonleaf child node(s)
fewer numbers of periods without missing any existing periods and the value(s) coming from its leaf child node(s). Finally, until
during the pruning process. we reach all direct children of the root, we recursively consider
4.1 First Phase—Suffix-Tree-Based each node u having only nonleaf children. The occurrence
Representation vector of the edge connecting u to its parent node is constructed
Suffix tree is a famous data structure  that has by combining the occurrence vector(s) of the edge(s) connecting
been proven to be very useful in string processing , , . u to its child node(s). Applying this bottom-up traversal process
It can be efficiently used to find a substring in the original on the suffix tree shown in Fig. 1 will produce the occurrence
vectors reported in Fig. 2.
Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012 536
insertion and/or deletion noise, linear-distance-based algorithms
do not perform well. A specified limit called time tolerance
(denoted as tt in the algorithm). This means that if the periodic
occurrence is expected to be found at positions x, x + p, x + 2p; .
. . , then with time tolerance, occurrences at x, x + p ± tt, x + 2p
± tt… would also be considered valid.
4.2.3 Periodicity Detection in a Subsection of Time Series
The periodicity detection algorithm calculates all
patterns which are periodic starting from any position less than
half the length of the time series (stPos <n/2) and continues till
the end of the time series or till the last occurrence of the
pattern. But, in real-life situations, there might be a case that a
pattern is periodic only within a section and not in the entire
time series. For example, the traffic pattern during semester
break near schools is different than during the semester, movie
rental pattern in winter is different than in summer, hourly
number of transactions at a superstore may show different
periodicity during Holidays season (15 November to 15
Fig. 2. Suffix tree for string abcabbabb$ after bottom-up January) than regular periodicity. By considering time series T=
traversal. bcdabababababccdabcadad, pattern ab is perfectly periodic only
The periodicity detection algorithm uses the in the range [3, 12]. Such type of periodicity might be very
occurrence vector of each intermediate edge to check whether interesting in DNA sequences in particular and in regular time
the string represented by the edge is periodic. The tree traversal series.
process is implemented using the nonrecursive explicit stack- In order to detect such periodicity, we employ the
based algorithm presented in , which prevents the program concept introduced in , where two parameters dmax and
from throwing the stack-overflow-exception. minLength are utilized. The first parameter dmax denotes the
4.2.1 Periodicity Detection Algorithm maximum distance between any two occurrences of a pattern to
The periodicity detection algorithm applies at each be part of the same periodic section. Consequently, having the
intermediate edge using its occurrence vector. Our algorithm is distance between two occurrences more than dmax may
linear distance-based, where we take the difference between any potentially mark the end of one periodic section and/or the start
two successive occurrence vector elements leading to another of new periodic section for the same pattern.
vector called the difference vector. It is important to understand
that we actually do not keep any such vector in the memory but 4.5 Redundant Period Pruning Techniques
this is considered only for the sake of explanation. Table 1 Periodicity detection algorithms generally do not
presents example occurrence and difference vectors. prune or prohibit the calculation of redundant periods. The
Each value in the difference vector is a candidate immediate drawback is reporting a huge number of periods,
period starting from the corresponding occurrence vector value which makes it more challenging to find the few useful and
(the value of occur vec in the same row). Recall that each period meaningful periodic patterns within the large pool of reported
can be represented by 5-tuple (X, p, stPos, endPos, conf), periods.
denoting the pattern, period value, starting position, ending TABLE 2
position, and confidence, respectively. For each candidate An Example Result of a Periodicity Detection Algorithm
period (p = diff_ vec[j]), the algorithm scans the occurrence Pattern(X) Period(p) Starting Position(stops)
vector starting from its corresponding value (stPos = occur ab 5 0
vec[j]), and increases the frequency count of the period freq (p) ab 10 0
if and only if the occurrence vector value is periodic with regard ab 25 0
to stPos and p. a 5 0
TABLE 1 a 15 0
An Example Occurrence Vector and b 5 1
Its Corresponding Difference Vector b 10 1
Occur_vec Diff_vec a 5 10
0 3 ab 5 20
12 4 Assume the periods reported by an algorithm are as
16 5 presented in Table 2. Looking carefully at this result, it can be
21 3 easily seen that all these periods can be replaced by just one
24 3 period, namely (X =ab, p = 5, stPos= 0) and all other periods
27 11 maybe considered as just the mere repetition of period X.
38 7 Empowered by redundant period pruning, our
45 3 algorithm not only saves the time of the users observing the
4.2.2 Periodicity Detection in Presence of Noise produced results, but it also saves the time for computing the
Three types of noise generally considered in time periodicity by the mining algorithm itself. We implemented the
series data are replacement, insertion and deletion noise. In redundant period pruning techniques as prohibitive steps which
replacement noise, some symbols in the discretized time series prohibit the algorithm from handling redundant periods.
are replaced at random with other symbols. In case of insertion 5 EXPERIMENTAL RESULTS
and deletion noise, some symbols are inserted or deleted, In this section, present the results of several
respectively, randomly at different positions (or time values). experiments that have been conducted using both synthetic and
Noise can also be a mixture of these three types. For instance, real data; we also report the results of testing various
RI type noise means the uniform mixture of replacement (R) and characteristics of our algorithm against other existing
insertion (I) noise.The insertion and deletion noise expand or algorithms. Hereafter, our algorithm is denoted as STNR
contract the time axis leading to shift of the original time series (Suffix-Tree-based Noise-Resilient algorithm).
values. For example, time series T =abcabcabc after inserting
symbol b at positions 2 and 6 would be T = abbcabcabbc. The 5.1 Accuracy
occurrence vector symbol a in T is (0, 3, 6), while it is (0, 4, 7) The first sets of tests are dedicated to demonstrate
in T . It is very clear that when the time series is distorted by the completeness of STNR in the sense that it should be able to
Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012 537
find a period once it exists in the time series. STNR satisfies this found periodic for symbol a and its patterns are: [148187-
on both synthetic and real data. 155084], [181413-186097] and [300522-304372]. Accordingly,
5.1.1 Synthetic Data the target of this set of experiments is to confirm that STNR can
The parameters controlled during data generation are also find the periodicity within a subsection of the time series.
data distribution (uniform or normal), alphabet size, size of the 5.2 Time Performance
data, period size, and the type and amount of noise in the data. A The time performance of STNR compared to ParPer,
datum may contain replacement, insertion, and deletion noise or CONV, and WARP. We test the time behavior of the compared
any mixture of these types of noise . algorithms by considering three perspectives: varying data size,
The algorithm can find all periodic patterns with 100 period size and noise ratio.
percent confidence regardless of the data distribution, alphabet 5.2.1 Varying Data Size
size, period size and data size. This is an immediate benefit of First, we compare the performance of STNR against
using the suffix tree which guarantees identifying all repeating ParPer , which finds the periodic patterns for a specific
patterns. Since the algorithm checks the periodicity for all period. The synthetic data used in the testing have been
repeating patterns, the algorithm can detect all existing periods generated by following uniform distribution with alphabet size
in the inerrant data. Results with noisy data are presented later. of 10 and embedded period value of 32. The algorithms have
5.1.2 Real Data been run on these data by varying the time series size from
In this paper, used data set (denoted PACKET) 100,000 to 1,000,000. In case ParPer and the other similar
which as described next contains the packet Round Trip Time algorithms are extended to find patterns with all period values,
(RTT) delay . These data have been selected to demonstrate their complexity would jump up to O (n2). Even with O (n2)
the adaptability of STNR for handling cases where periods are complexity, ParPer only finds partial periodic patterns while
contained only within a subsection of the time series. The data STNR can find all the three types of periodic patterns in the
have been discretized uniformly into 26 symbols. data, namely symbol, sequence, and segment periodicity.
TABLE 3 Periodic Patterns in Packet Data Further, compared to STNR, ParPer does not detect periodicity
Pattern Period StPos EndPos Confidence within subsection of a time series and it is not resilient to noise
aa 2 146698 155081 0.42 (especially insertion/deletion). Due to the differences
aa 2 180136 186061 0.42 highlighted above, the quality of the results produced by STNR
aa 2 297362 304371 0.47 could be classified as better than that produced by ParPer
aaa 3 146772 155064 0.44 algorithm .
aaa 3 180897 186065 0.43
aaa 3 297525 304367 0.47 STNR can even achieve this result when the series contains
aaa*a 5 147030 155044 0.44 insertion and deletion noise and when the periodicity is only
found in a subsection of the time series and not in the entire time
aaaa* 5 182390 186064 0.36
series. ParPer is not capable of achieving this . Similarly
aaa** 5 297480 304324 0.44
ParPer cannot detect the periodicity of singular events (termed
****aa* 7 147203 155042 0.46 as symbol periodicity) which might be prevalent in the time
**aa**a 7 148841 155042 0.46 series.
aa***** 7 149394 155070 0.46 5.2.2 Varying Period Size
a**aaaa 7 182091 186059 0.4 This set of experiments is intended to show the
a*a*aa* 7 299362 304324 0.49 behavior of STNR by varying the embedded period size. For
this experiment, we fixed the time series length and the number
The Packet data have been used in , where the of alphabets in the series and vary the embedded period size
authors found the dense periodic patterns, i.e., the area where from 5 to 100. The time taken by STNR for both uniform and
the time series is mostly periodic. The three regions which were normal distribution has been recorded and the corresponding
Fig. 3. Time behavior with varying period size (alphabet size= 10).
curves are plotted in Fig. 5. From Fig. 5, we can easily observe Time performance of CONV remains the same as it checks for
the effect of changing period value on time for ParPer  and all possible periods irrespective of the data set.
CONV . The results show that
the time performance of both STNR and CONV does not change 5.2.3 Varying Noise Ratio
significantly and remains more or less the same. But ParPer The next set of experiments measure the impact of noise ratio on
does get affected as the period size increased; this is true the time performance of STNR. For this experiment, we fixed
because when the period value is large, the partial periodic the time series length, period size, alphabet size, and data
pattern is large as well, and so is the maxsubpattern tree . distribution and measured
Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012 538
the impact of varying noise ratio on time performance of the
algorithm. We tested two sets of data; one contains replacement
noise and the other contains insertion noise. The results are
plotted in Fig. 6. Again these experiments have been conducted
using the three algorithms: ParPer, STNR, and CONV. It is true
that STNR takes more time when the noise ratio is small (10-15
percent); but when the noise ratio is increased, STNR tends to
take the similar time. As we ignore the intermediate edges (or the
periodic patterns) which carry less than 1 percent leaf nodes (or
patterns which appear rarely), mostly noise is caught into these
infrequent patterns and does not affect the time performance of
STNR by reasonable margin. As the results show, the noise ratio
does not affect the time performance of CONV and ParPer.
5.2.4 Effect of Data Distribution
Finally, we tested the time performance of STNR,
ParPer , WARP , and CONV  for the combination of
data distribution and period size. We tested two series; one has Fig. 6. Execution time with and without checking first level
been generated with uniform data distribution while the other occurrence vectors.
inhibits normal distribution. Period size of 25 and 32 are
embedded in the time series and the time performance
is measured between 100,000 and 1,000,000 series length. The
results are presented in Fig. 7. For STNR, the uniform distribution
seems to take less amount of time compared to the normal
distribution. Since the shape of the suffix tree depends on the data
distribution, it is very understandable that the different data
distributions would take different amounts of time. But, an
important observation is that the time behavior or the time pattern
is similar for both uniform and normal distributions and for
different period size combinations.
For CONV and WARP, the data distribution does not
affect the time performance as well because they check the
periods exhaustively. The time performance of ParPer does not
get affected by the data distribution but it seems to take more time
when the period size is increased (as pointed out earlier).
5.3 Optimization Strategies
Optimization strategies used to improve the time Fig. 7. Number of periods detected with and without checking
performance of the algorithm in practice. the optimization first level occurrence vectors.
techniques mainly test for: 1) sorted and unsorted occurrence
vectors of edges in terms of the number of values they 2) 5.3.2 Periodicity Detection at First Level
calculating the periodicity and executing STNR by Periodicity detection can be avoided for occurrence
including/excluding the edges at the first level of the suffix tree vectors at the first level of the suffix tree for many data sets
and 3) ignoring or considering the edges whose occurrence because usually the patterns at first level are subsets of the larger
vectors carry fewer values . patterns. But this depends heavily on the data set and the periodic
5.3.1 Tree Traversal Guided by Sorted Edges pattern size (or length). In these experiments, observed the
The tree traversal is guided by the number of leaves a runtime and the number of nonredundant periods detected by
node (or edge) carries. The edge which leads to more leaves is STNR with and without calculating the periodicity based on
traversed first and the edge which leads to fewer numbers of occurrence vectors at the first level of the tree. The time
leaves is traversed later. This results in fewer changes in the performance of the algorithm improves significantly when
occurrence vectors maintained by STNR. Fig. 8 presents the time occurrence vectors at the first level are ignored. Time
taken by STNR to process the same data set with and without improvement is expected because occurrence vectors at the first
employing sorting on occurrence vectors. As expected, STNR level always contain more elements (leaves) than those at other
conserves time when using the sorted occurrence vectors. levels. By not considering the first level, there is hardly any
difference in the number of periods detected in the Wal-Mart data
sets because almost all the patterns in these data sets consist of
more than one symbol (the daily pattern) and hence are caught at
levels deeper than the first level. But as PACKET is concerned,
the algorithm could not find a single periodic pattern when the
first level is ignored. This is because all periodic patterns in
PACKET consist of single symbol and do not have their supersets
at the deeper levels of the tree. Therefore, we may conclude that
although ignoring the occurrence vectors at the first level
considerably improves the time performance, it may miss some
very useful patterns if the data set contains very small periodic
patterns (usually symbol periodicity only). Hence, this strategy
should be followed very carefully.
Fig. 5. Time performance with and without sorted edges.
Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012 539
while STNR and WARP perform better because we take into
account asynchronous periodic occurrences which drift from the
expected position within an allowable limit.
5.5 Experiments with Biological Data
Biological data, e.g., DNA or protein sequences, also
exhibit periodicity for certain patterns. DNA sequences are
constructed using four alphabets A, T, C, and G, while protein
sequences are based on 20 alphabets (from A to T). These
sequences have their own specific properties such as the periodic
patterns are only found in a subsection of the sequence and do not
span the entire series, the pattern occurrence may drift from the
expected position, there is a concept of alternative substrings
where a substring may replace another substring without any
change in the semantics. For example, in the data set considered
in , TA may replace TT. In this section, we present two
experiments where we applied the algorithm for detecting the
periodicity in protein sequences. The two protein sequences
namely P09593 and P14593 can be retrieved from the Expert
Protein Analysis System (ExPASy) database server
Fig. 8. Execution time with and without fewer leaves check. (www.expasy.org). The protein sequence P09593 is S antigens
protein in Plasmodium falciparum v1 strain. S antigens are
soluble heat-stable proteins present in the sera of some infected
Periodic Pattern Found in P14593
Per stpos StPosMod EndPos confidence pattern repetitions
4 106 2 162 0.33 d 5
4 108 0 369 1 gn 66
4 115 3 369 1 agn 64
4 122 2 369 0.94 aagn 58
For the second experiment, we applied STNR on the
protein sequence P14593 which is the code for circum sporozoite
protein. It is the immune dominant surface antigen on the
sporozoite. The sequence of this protein has interesting repeats
which represent the surface antigen of the organism.
Fig. 9. Number of periods detected with and without fewer leaves 6 CONCLUSIONS AND FUTURE WORK
check. The periodicity detection method using
detection is an essential process in periodicity mining to discover
5.3.3 Ignoring Edges Carrying Fewer Leaves potential periodicity rates. The key idea is the single algorithm
We may ignore edges (or nodes) that carry very small can find symbol, sequence (partial periodic) and segment (full
number of leaves say 1 or 2 percent leaves. Such edges usually do cycle) periodicity in the time series. It can also find the
not lead to any valid periodic patterns. Since the synthetic data periodicity within a subsection of the time series. This method
sets maybe biased, we decided to run this set of experiments again used to show the time behavior, accuracy and noise resilience
on the Wal-Mart and PACKET data sets. Here, we ignored all characteristics of the data. The algorithms run on both real and
edges which carry less than 3 percent leaves. The results are synthetic data.
presented in Figs. 11 and 12. The results show that ignoring such In phase I first phase-suffix-tree-based
minority edges hardly affects the number of nonredundant periods representation was completed. In phase II I will complete the
detected in the considered data sets. Still there is a possibility that occurrence vector and difference vector and periodicity in the
some (although very small number of) periods might be missed time series. That is detect the presence of noise and calculate the
because of this strategy, but again it does conserve the algorithm’s subsection of the time series
running time. Here, it is important to note that these strategies The phase I suffix tree is used. It can be
might be very useful when the data set is very large and disk- efficiently used to find a substring in the original string, to find
based Implementation is to be used where only a small portion of the frequent substring and other string matching problems. A
the suffix tree is to be kept in memory. suffix tree for a string represents all its suffixes. For each suffix of
5.4 Noise Resilience the string there is a distinguished path from the root to a
We have already demonstrated in  the noise- corresponding leaf node in the suffix tree. Given that a time series
resilient features of the algorithm where we compared the is encoded as a string, the most important aspect of the suffix tree
resilience to noise of STNR and the other algorithms based on and related to this work it is capability to very efficiently capture
each of the five combinations of noise, i.e., replacement, and highlight the repetitions of substrings within a string.
insertion, deletion, insertion-deletion, and replacement-insertion
deletion. The results show that our algorithm compares well with
WARP  and performs better than AWSOM , CONV ,
and STB . The latter three algorithms do not perform well
because they only consider synchronous periodic occurrences
Periodic Pattern Found in P09593
Per Stpos EndPos Confidence pattern repetitions
11 104 338 0.9 ggpgse 19
11 105 313 0.95 gpgsegpkgtg 18
Department of CSE, Sun College of Engineering and Technology
National Conference on Role of Cloud Computing Environment in Green Communication 2012 540
1. Faraz Rasheed, Mohammed Alshalalfa and Alhajj,”
Efficient Periodicity Mining in Time Series
Databases Using Suffix Trees”, vol. 23, No. 1,
2. A.Al-Rawi, A. Lansari, and F. Bouslama, “A New
Non-Recursive Algorithm for Binary Search Tree
Traversal,” vol. 2, pp. 770-773, Dec. 2003.
3. C.Berberidis, W. Aref, M. Atallah, I. Vlahavas, and A.
Elmagarmid, “Multiple and Partial Periodicity Mining
in Time Series Databases,” July 2002.
4. M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid,
“Periodicity Detection in Time Series Databases,” vol.
17, no. 7, pp. 875-887, July 2005.
5. M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid,
“WARP: Time Warping for Periodicity Detection,”
6. J. Fayolle and M.D. Ward, “Analysis of the Average
Depth in a Suffix Tree under a Markov Model,” pp.
7. R. Grossi and G.F. Italiano, “Suffix Trees and Their
Applications in String Algorithms,” pp. 57-76, Sept.
8. J. Han, Y. Yin, and G. Dong, “Efficient Mining of
Partial Periodic Patterns in Time Series Database,” p.
9. K.-Y. Huang and C.-H. Chang, “SMCA: A General
Model for Mining Asynchronous Periodic Patterns in
Temporal Databases,” vol. 17, no. 6, pp. 774-785, June
10. P. Indyk, N. Koudas, and S. Muthukrishnan,
“Identifying Representative Trends in Massive Time
Series Data Sets Using Sketches,” Sept. 2000.
11. S. Ma and J. Hellerstein, “Mining Partially Periodic
Event Patterns with Unknown Periods,” Apr. 2001.
12. S. Papadimitriou, A. Brockwell, and C. Faloutsos,
“Adaptive, Hands Off-Stream Mining,” pp. 560-571,
13. F. Rasheed, M. Alshalalfa, and R. Alhajj, “Adapting
Machine Learning Technique for Periodicity Detection
in Nucleosomal Locations in Sequences” pp. 870-879,
14. F. Rasheed and R. Alhajj, “STNR: A Suffix Tree
Based Noise Resilient Algorithm for Periodicity
Detection in Time Series Databases,” vol. 32, no. 3,
pp. 267-278, 2010.
15. F. Rasheed and R. Alhajj, “Using Suffix Trees for
Periodicity Detection in Time Series Databases,” Sept.
16. Y.A. Reznik, “On Tries, Suffix Trees, and Universal
Variable- Length-to-Block Codes,” p. 123, 2002.
17. C. Sheng, W. Hsu, and M.-L. Lee, “Mining Dense
Periodic Patterns in Time Series Data,” p. 115, 2005.
18. Y. Tian, S. Tata, R.A. Hankins, and J.M. Patel,
“Practical Methods for Constructing Suffix Trees” vol.
14, no. 3, pp. 281-299, Sept. 2005.
19. E. Ukkonen, “Online Construction of Suffix Trees,”
vol. 14, no. 3, pp. 249-260, 1995.
20. N. Va¨lima¨ki, W. Gerlach, K. Dixit, and V. Ma¨kinen,
“Compressed Suffix Tree—A Basis for Genome-Scale
Sequence Analysis,” vol. 23, pp. 629-630, 2007.
21. A.Weigend and N. Gershenfeld, Time Series
Prediction: Forecasting the Future and Understanding
the Past. Addison-Wesley, 1994.
22. J. Yang, W. Wang, and P. Yu, “Info Miner+: Mining
Partial Periodic Patterns with Gap Penalties,” Dec.
Department of CSE, Sun College of Engineering and Technology