A Wavelet-based index scheme for a plant genomics database

Reviews
A WAVELET-BASED INDEX SCHEME FOR A PLANT GENOMICS DATABASE ALEXANDRA MARTINEZ Department of Computer Science and Engineering, University of Florida, Gainesville, FL 32611, USA NICHOLAS PELFORT Department of Computer Science and Engineering, University of Florida, Gainesville, FL 32611, USA We present an approach to efficient searching in plant genomics databases using an adaptation of the indexing technique proposed in [1]. Our indexing scheme extracts the frequency vector and the wavelet coefficients vector of each nucleotide sequence in the database, and then uses this information to efficiently retrieve database sequences that are similar to a query string. Because we are mapping string sequences to vectors of fixed-length, our search space becomes a fixed-size multidimensional space, thus reducing the complexity of our problem. Querying the database is therefore done by computing some distance function between the vector of the query string and those from the database sequences. The experimental results show that our technique provides high selectivity (filtering) while still retrieving relevant data. The results of our wavelet-based index were compared to those of a k-gram index over the same database. 1. Introduction During recent years, the scientific community has devoted enormous efforts to do research in the area of Bioinformatics. Numerous projects have emerged in universities, governmental institutions, and the industry. Computer scientists, biologists, statisticians and mathematicians are collaborating to solve challenging problems in this area. Modern biologists rely on the existence of large biological repositories (databases) and analysis tools to accomplish their work. With the emergence of this e-science era, there is no doubt that better and faster tools for searching the existing biological data sources are needed. One of the ways in which this issue has been addressed is by means of efficient indexing schemes that allow fast retrieval of similar sequences from a database. Indexing solutions for efficient similarity searching on biological databases need to tackle a fundamental issue: what does “similarity” mean in the context of nucleotide and protein sequences. The notion of similarity chosen in a particular indexing scheme then defines what kind of feature extraction, clustering and search are used. The classical similarity measure for strings is called Edit Distance, and it is based on the use of three edit operations, namely insert, delete 1 2 and replace. The difference between two strings s1 and s2 is then defined as the minimum number of edit operations needed to transform s1 into s2. The value of this difference is a measure of the degree of similarity between the two strings. When the difference is small, the two strings are considered “similar”. The most efficient implementation of edit distance is a dynamic programming solution, which runs in O(mn) time, where m is the length of s1 and n is the length of s2. Kahveci and Singh [1] propose a technique that indexes the substrings of a database using a wavelet-based approach and special multi-resolution index structure. For a given query string, it then identifies regions that are able to produce a strong global alignment. Balasubramanian [4] proposes a technique that uses k-gram vectors to cluster and index biological sequences. A base vector is constructed from the |Σ|k possible k-tuples, where |Σ| is the size of the alphabet and k is the size of the tuples under consideration. The input sequence s is scanned from one direction and the observed k-tuple count is incremented in the k-gram. After all k-tuples in the sequence have been observed, the k-gram vector is normalized by the total number of observations, |s|–k+1, where |s| is the sequence length. The author suggests the use of linear projections of k-gram vectors in order to cluster sequences in large databases, which significantly reduces the space storage requirements. BLAST [2] is a popular tool for similarity search on biological sequences that evaluates local sequence alignments (cannot be used to find similar substrings to the entire query string). BLAST first transforms the input query into a set of fixed-size words that are matched against the database. These preliminary matches are then extended to the left and to the right in order to decide if they meet a score criteria specified by the user. Although this heuristic filter has been optimized to reduce the chances of missing matches, it is not completely accurate. As a result, some matches (records that meet the specified score criteria) may not be returned as part of the result set. Another problem is that the space requirement of BLAST is more than the size of the database [1]. Furthermore, BLAST is known to have some inherent performance problems [3], hence the need for indexing techniques that accelerate the search speed in large genomic databases. This paper presents an approach to efficient searching in plant genomics databases. We implemented a variation of the indexing technique proposed by Kahveci and Singh [1]. Our indexing scheme first preprocesses the data by extracting the frequency vector and the wavelet coefficients vector of each nucleotide sequence in the database. Then, it uses these feature vectors to efficiently retrieve database sequences that are similar to a query string. The advantage of having a vector search space rather than a string search space is 3 that distance (representing similarity) between two sequences can be performed much faster if the sequences are mapped to a fixed-dimension vector space. Similarity search techniques in the string domain are known to be computational intensive and therefore, expensive in time. Once strings are transformed to a vector space of fixed-size, the complexity of similarity search algorithms is greatly reduced. The remainder of this paper is organized as follows. Section 2 describes the index construction algorithm and feature vectors extraction. Section 3 describes the search technique and vector distance functions. Section 4 describes the implementation details and presents the experimental results. Finally, Section 5 contains our conclusions and suggestions for future work. 2. Index Construction Our index is constructed by associating two feature vectors (frequency vector and wavelet coefficients vector) to each nucleotide sequence in the database. Such vectors are extracted from the sequences themselves during a preprocessing stage. When a query is issued, feature vectors of the query sequence are computed dynamically and compared to the pre-computed feature vectors in the database. The first feature vector corresponds to the frequency vector of the sequence, whose dimension is equal to the size of the DNA alphabet. The DNA alphabet consists of four symbols {A,C,G,T}. The second feature vector is composed of the wavelet coefficients of the sequence, and its dimension is twice the size of the DNA alphabet. We name this vector as the wavelet vector. 2.1. Frequency vector extraction The frequency vector of a string s from an alphabet Σ = {α1, α2, … , ασ} is defined as: f(s) = [n1, n2, …, nσ] ( 1) where ni is the number of occurrences of the symbol αi in s for 1≤i≤σ. Frequency vectors present three key properties:    The dimension of f(s) is σ, independent of the length of s. The sum over the components of f(s) is equal to the length of s, i.e. |s|. None of the components of f(s) is negative. 4 2.2. Wavelet vector extraction The wavelet vector of a string s = s1s2…sn from an alphabet Σ = {α1, α2, … , ασ} is defined as: ψ(s) = [A, B] ( 2) where A and B are the first and second wavelet coefficients, respectively. Both A and B are vectors of dimension σ, so the dimension of ψ(s) is 2σ [1]. The recursive formulas for computing the coefficients of the kth-level wavelet transformation of s, ψk(s), are given below. The kth-level wavelet transformation is defined for 0 ≤ k ≤ log2n, where n = |s|. Ak,i = A k-1,2i + Ak-1,2i+1 0 f (si) k=0 0 < k < log2n (3) Bk,i = k=0 0 < k < log2n Ak-1,2i - Ak-1,2i+1 where 0 ≤ i < (n/2k), and f(si) is the frequency vector of the ith symbol of s. Using the formulas in (3), expression (2) can then be rewritten as:  log n ( s ) = [ Alog 2 2 n,0 , Blog 2 n , 0 ] 4) ( The above formulas work under the assumption that the sequence length, n, is a power of 2, which ensures that recursive splitting into halves is possible. However, this property about the sequence size does not hold for most of the DNA sequences in our database. Therefore, a technique to expand the sequences to the required length is needed. We explored two techniques for this purpose. The first technique, named circular expansion, uses the idea of a circular list, in which the end element of the list is linked to the start element. It works by appending a prefix of the sequence s to s itself, where the size of the prefix is defined by the number of symbols that s needs in order to reach the next power of 2. The second 5 technique, named dummy padding, uses the common idea of padding the sequence s with dummy symbols until the desired length is reached. In the context of DNA sequences, a dummy symbol can be any symbol that does not belong to the alphabet (e.g. $). The former technique was chosen as the „default‟ expansion method, and it was used in the experimental results presented in this paper. The reason for choosing this technique over the latter one was that it avoids introducing non-biologically meaningful symbols. 3. Search Technique In this section, we examine the steps involved in our query processing mechanism. Given a query sequence q and an error threshold e, our query processing method performs as follows: Step 1: Extract the frequency and wavelet vectors of the query sequence. Step 2: Compute the frequency distance between the query‟s frequency vector and every frequency vector from the database. Step 3: Compute the wavelet distance between the query‟s wavelet vector and every wavelet vector from the database. Step 4: Compute the maximum frequency distance between the query and each sequence in the database, based on the results from steps 2 and 3. Step 5: If the distance from step 4 is below e  q , then the corresponding database sequence is added to the result set. Otherwise, it is discarded. 3.1. Frequency distance If u and v are integer points in a σ-dimensional space representing the frequency vectors of two strings s1 and s2, the frequency distance FD1(u, v) between u and v is defined in terms of the following equations: posDist  i:ui  vi  (u  v ) i i negDist  i:ui  vi  (v  u ) i i (5) FD1(u, v) = max { posDist, negDist } 6 The frequency distance between the frequency vectors of two strings is a lower bound on the edit distance between those strings [1]. 3.2. Wavelet distance If ψ(s1) = [A1, B1] and ψ(s2) = [A2, B2] are points in 2σ-dimensional space that represent the wavelet vectors of two strings s1 and s2, the wavelet distance FD2(ψ(s1), ψ(s2)) between ψ(s1) and ψ(s2) is defined by the following equations: pos  neg  i:a1,i  a2 ,i  (a 1,i  a2,i )   a1,i )  i:b1,i b2 ,i  (b 1,i  b2,i )  b1,i ) (6) i:a1,i  a2 ,i  (a 2 ,i i:b1,i b2 ,i  (b 2 ,i where A1 = [a1,1, a1,2 , … ,a1,σ], B1 = [b1,1, b1,2, … , b1,σ], A2 = [a2,1, a2,2,…, a2,σ], and B2 = [b2,1, b2,2, … ,b2,σ]. The wavelet distance between the wavelet vectors of two strings is also a lower bound to the edit distance between the strings [1]. 3.3. Maximum frequency distance If s1 and s2 are strings from an alphabet Σ = {α1, α2, …, ασ}, the maximum frequency distance FD (s1, s2) between s1 and s2 is defined as: FD(s1,s2) = max { FD1(f (s1), f (s2)), FD2(ψ(s1),ψ(s2)) } (7) where FD1 and FD2 are the frequency and wavelet distance (as defined in preceding sections), respectively. Since the maximum frequency distance is the maximum of the frequency and the wavelet distances, and each of these is a lower bound to the edit distance, 7 then the maximum frequency distance is also a lower bound to the edit distance [1]. We consider the maximum because it is a tighter lower bound. 4. Experimental Results In this section we explain some implementation details, describe our data set and queries, and report the results obtained. 4.1. Implementation Details Our wavelet-based indexing schema was tested on a database of cellulose synthase genes (plants) [5]. This database contains over 700 DNA sequences from over 100 species. It also contains protein sequences. The index was built over the nucleotide sequences. We decided to test our index over nucleotide data rather than protein data due to space concerns. Indexing protein data involves storing vectors of dimension 20 (frequency vector) and 40 (wavelet vector) for each protein sequence in the database since the protein alphabet consist of 20 amino acid symbols; and that would be a significant space overhead. We developed the index construction procedure (initial extraction of feature vectors) and search algorithm (computation of different vector distance measures) in Java. Our program interacts to the database through JDBC. Two input parameters are needed by the search program: a query sequence and an error threshold. The query sequence should be a valid DNA sequence, and the error threshold is the maximum number of mismatches allowed between the query and a database sequence, relative to the length of the query. 4.2. Selectivity of Search Here we examine the query selectivity of our search technique. We are interested in knowing whether the search algorithm filters a significant percentage of the database or not. Query selectivity is changed by varying the error threshold value, e, and observing the size of the result set (number of matching sequences retrieved from the database). For this experiment, the e value was varied from 0.025 (2.5%) to 1.0 (100%). Figure 1 plots the results of the experiment. The two lines plotted correspond to results obtained for two different query sequences. Both queries were taken from the species Populus tremuloides. We can observe from the figure that higher e values are associated to larger result sets, which corresponds to the expected behavior that a higher error threshold implies a lower selectivity (filtering). On the other hand, less relaxed 8 matches have lower e values and are therefore associated to smaller result sets, as expected. Selectivity 800 700 600 Number of Matches 500 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Error Threshold Figure 1. The selectivity of our search technique: effect of the error threshold on the size of the result set (number of matches). The two lines correspond to results for two different query sequences. 4.3. Comparison of distance functions We present here a comparison of the behavior of different distance functions. The distance functions analyzed in this experiment are: frequency distance, wavelet distance, maximum frequency distance, and edit distance (see Section 4). Figures 2 and 3 illustrate this comparison. 9 Figure 2. A comparison of different distance functions used in the search process with respect to the Edit Distance. As we can observe from Figure 2, the maximum frequency distance and the frequency distance are linear with respect to the edit distance. On the other hand, the wavelet distance is a staircase whose step‟s size increase exponentially. This behavior is caused by the fact that in order to apply our wavelet transform we expand the size of the sequences to the next power of 2. In Figure 3 we can verify that the maximum frequency distance is a lower bound to the edit distance for most cases. It can be noted that for few sequence lengths, the value obtained for the maximum frequency distance is indeed slightly greater than the edit distance. The reason for this behavior is that the maximum frequency distance could correspond to the wavelet distance (if this was larger than the frequency distance); and the wavelet distance is computed using the expanded version of the sequences rather than the original sequences, as the edit distance does. Hence, the wavelet distance is not a lower bound on the edit distance of original sequences but on the edit distance of the expanded sequences. Therefore, it could higher than the edit distance of the original sequences. Figure 3. Maximum frequency distance and edit distance as functions of the sequence length. 10 4.4. Comparison with K-gram We compared the performance of our indexing scheme to the K-gram index proposed by Balasubramanian [4]. Our test set for this benchmark consisted of 117 queries, where each query sequence was taken from different species in the database. To implement the similarity search algorithm based on k-grams, a collection of projections P = {P1, P2, …, Pn} was included into the database. These projections are used to cluster sequences that fall within -windows with respect to P ( is a window size parameter). The search procedure first projects every sequence in the database by each Pi and records its -window, and then it applies a scoring mechanism. Two scoring techniques can be used: the p-score or c-score. The p-scoring technique takes an input query and calculates the clusters for all projections in P. Only those sequences matching in all projections are returned as matching sequences. On the other hand, the c-score method chooses c random projections in P and returns those sequences matching in all selected projections. We show now some sample results obtained from our wavelet search and from the k-gram search (see Tables 1 and 2). Each match in the result sets has species name, sequence (gene) ID, and a score. The score of a sequence within the K-gram index result set is defined as: (# of projections) - (c-score), and it represents a measure of how close the result sequence is to the query string. Smaller scores indicate higher similarity. The score of a sequence within the Wavelet index result set is defined as the maximum frequency distance between that sequence and the query string. The smaller the distance is, the more similar the strings are. Table 1. Results of a query from species malus x domestica. Each match has species name, sequence (gene) ID, and score. Wavelet Index malus_x_domestica, ces_a1, 0 K-gram Index malus_x_domestica, ces_a1, 0 suaeda_maritima_salsa, ces_a1, 2 suaeda_maritima_salsa, ces_a2, 2 Table 2. Results of a query from species apium graveolens. Wavelet Index apium_graveolens, ces_a1, 0.0 pinus_radiata, ces_a4, 21.5 glycine_clandestina, ces_a1, 22.5 K-gram Index apium_graveolens, ces_a1, 0 citrus_sinensis, ces_a3, 2 prunus_persica, ces_a14, 2 11 mesembryanthemum_crystallinum, ces_a8, 30.0 allium_cepa, ces_a36, 30.0 allium_cepa, ces_a31, 33.0 As we can observe from Tables 1 and 2, the two result sets have only one sequence in common, which actually corresponds to the same sequence that was given as a query for the search engine. We wanted to know if this was the behavior for the result sets on all the queries. Since it was difficult to evaluate the similarity of the result sets by just looking at the data coming from the 117 queries, we had to define a measure of similarity between both result sets and evaluate our technique based on that. For this purpose, we first define the agreement between two records from the result sets. We say that two records from each result set agree if they have the same species name and gene ID. Then, we define the degree of agreement D between two result sets A and B as: 1| A B | | B  A| D    2 | A| |B|    (8) Note that D lies between 0 and 1. When the sets are disjoint D is zero, and when the sets are equal D is one. Figure 4 shows the degree of agreement between the result sets from the Wavelet and K-gram indexes. 1.2 1 0.8 0.6 0.4 0.2 0 1 9 17 25 33 41 49 57 65 Experiments 73 81 89 97 105 113 Figure 4. Degree of agreement between the Wavelet and K-gram result sets for each of the experiments. Two records from each result set agree if their species name and gene ID are equal. Agreement of Results 12 Figure 5 also shows the degree of agreement between the result sets from the Wavelet and K-gram indexes, but using a different measure of agreement between two records of the result sets. Here, two records from the result sets are said to agree if their species name are equal regardless of their gene ID value. The rationale behind this is that from a biological perspective, matching the exact same sequence (species name and gene ID) may not be as relevant as matching sequences from the same species. 1.2 1 0.8 0.6 0.4 0.2 0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 Experiments Figure 5. Degree of agreement between the Wavelet and K-gram result sets for each of the experiments. Here, two records from the result sets agree if their species name are equal. We can observe from Figures 4 and 5 that the average degree of agreement between the Wavelet and K-gram result sets is about 0.5, ranging from 0.1 to 1. This means that for each sequence in one of the result sets, the probability of it appearing in the other result set is about 0.5. Since the Wavelet and K-gram indexing schemes significantly differ in the extracted features and search algorithm, it was expected to have an average degree of agreement far from 1. Another interesting observation from Figures 4 and 5 is that the degree of agreement between the Wavelet and K-gram result sets was 1 for very few experiments. We also noted that there are no intermediate values between 0.8 and 1 for the degree of agreement. This made us think that the cases for which we obtained a perfect agreement corresponded to experiments having small result sets. The reason for this is that the degree of agreement when we have small result sets has high granularity, which implies that only very few values are possible. For example, if our sets are of size no bigger than five, the minimum distance between possible degrees of agreement is 0.2, and therefore, values between 0.8 and 1 are not feasible. Agreement of Results 13 In order to verify the above hypothesis, we plotted the degree of agreement of the result sets with respect to their size in Figure 6. This figure shows that as the size of the Wavelet index result set approaches 1, its average degree of agreement with the K-gram index result set increases. Note that although the average degree of agreement for small result sets was high, it was not exactly 1, which implies that still some small result sets exhibit a non-perfect degree of agreement. 0.9 0.8 Average Agreement 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 10 11 13 16 17 20 21 22 Wavelet Result Set Size Figure 6. Average degree of agreement between the Wavelet and K-gram result sets and its standard deviation for different sizes of the result set. Figure 7 shows a comparison between the size of the Wavelet and K-gram result sets for each of the queries. The size of the K-gram result set is stacked over the size of the wavelet result set in this figure. 250 K-gram 200 Wavelet Result Set Size 150 100 50 0 1 11 21 31 41 51 61 71 81 91 101 111 Experiments 14 Figure 7. Comparison between the size of the Wavelet and K-gram result sets for each of the experiments. The size of the K-gram result set is stacked over the size of the wavelet result set. It can be observed from Figure 7 that the Wavelet index produces smaller result sets in most cases. Actually, all of the result sets were of size no larger than 25 when using the Wavelet index. This is a good indicative of the high selectivity level of our search technique. Figure 7 also shows the low correlation that exists between the size of the result sets of both techniques. This confirms that the Wavelet and K-gram indexes utilize very different feature extraction and search methods. 5. Conclusions In this paper, we considered the problem of similarity searching in plant genomics databases by using a wavelet-based indexing technique. This indexing scheme extracts the frequency and wavelet vectors of each nucleotide sequence in the database, and then uses this information to compute some distance functions to efficiently retrieve database sequences that are similar to a query string. The results of our Wavelet-based index were compared to those of a K-gram index over the same database. The experimental results show that our technique exhibits a higher selectivity level while still retrieving relevant data. Also, we found that the two indexing techniques are very uncorrelated. Both index schemes should be used as a filter to quickly prune non relevant data, and identify a small set of candidate matches which can then be further examined by a more accurate technique such as the edit distance. The reason for this is that the current results may contain false positives, i.e. sequences that should not match the query. By applying a more refined filter over the candidate sequences, we could obtain a more accurate result set. As part of future work, we will investigate the use of our index technique on protein databases, for which data compression mechanisms will need to be applied in order to reduce the amount of storage required for the vectors. Further experimentation is recommended to find the underlying biological meaning of the results obtained with each of the indexing techniques analyzed. References 1. T. Kahveci and A. K. Singh. An Efficient Index Structure for String Databases. In VLDB, pages 351–360, 2001. 15 2. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403– 410, 1990. E. Hunt, M. P. Atkinson, and R.W. Irving. A Database Index to Large Biological Sequences. In VLDB, pages 139–148, 2001. K. Balasubramanian. An Euclidean Metric for Genetic Sequence Comparison (Thesis). Princeton University. CS-TR-332-91, (1981). Cellulose Synthase Genes Database. (http://gator.lite.cise.ufl.edu:8080/test2/new_data/test.asp ) 3. 4. 5.

Related docs
premium docs
Other docs by Shame Ona