UBICC Journal
Ubiquitous Computing and Communication Journal 2008 Volume 3 . 2008-01-15 . ISSN 1992-8424
The 2007 International Conference on Information and Knowledge Engineering
Special Issue on
UBICC Publishers © 2008 Ubiquitous Computing and Communication Journal
Edited by Usman Tariq.
Special Co-Editor Dr. Hamid R. Arabnia
Ubiquitous Computing and Communication Journal
Book: 2007 Volume 3 Publishing Date: 2008-01-15 Proceedings ISSN 1992-8424
This work is subjected to copyright. All rights are reserved whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illusions, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication of parts thereof is permitted only under the provision of the copyright law 1965, in its current version, and permission of use must always be obtained from UBICC Publishers. Violations are liable to prosecution under the copy right law.
UBICC Journal is a part of UBICC Publishers www.ubicc.org
© UBICC Journal Printed in South Korea
Typesetting: Camera-ready by author, data conversation by UBICC Publishing Services, South Korea
UBICC Publishers
Table of Contents
Papers
47 Clustering time series online Hamid R Arabnia, Junfeng Qu, Yinglei Song, khaled Rasheed, Byron Jeff . . . . . . . . . . . . . . . . . . . . . 1 48 Performance analysis for skewed data Shafiq Ahmad, Mali Abdollahian, Panlop Zeephongsekul, Babak Abbasi 49 Bringing information retrieval back to database management systems Khaled Nagi ..................................................................... 16 ...................... 8
50 Web-based decision support systems as knowledge repositories for knowledge management systems Yuri Boreisha, Oksana Myronovych .................................................. 22
51 Bringing information retrieval back to database Khaled Nagi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
52 A set-theoretic data model for evolving database environments E. J. Yannakoudakis, P. K. Andrikopoulos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
53 Knowledge processing, codification and reuse model for communities Farzad Khosrowshahi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
54 Signal denoising by wavelet packet transmission on FPGA technology Fatma Hanafy Elfoly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
CLUSTERING TIME SERIES ONLINE IN A TRANSFORMED SPACE
Hamid R. Arabnia, Junfeng Qu, Yinglei Song, Khaled Rasheed, Byron Jeff United State of America {hra, khaled}@cs.uga.edu, {jqu, bjeff@clayton.edu}, ysong@umes.edu
ABSTRACT Similarity-based retrieval has attracted an increasing amount of attention in recent years. Although there are many different approaches, most are based on a common premise of dimensionality reduction and spatial access methods. Relative change of the time series data provides more meaning and insight view of problem domain.. This paper presents our efforts on considering the relative changes of time series during the time series matching process. A similarity distance measure that based on transformed difference space of a series of critical points is proposed. Based on experiments with financial time series data, it can be concluded that our distance measure works as good as the Euclidean distance measure based normalized data without any shifting and scaling and PAA approach. The distance measure proposed is a general distance metric and is suitable to deal with online similarity matching because it does not maintain stream statistics over data streams. Keywords: data mining, time series, clustering, similarity matching, Euclidean distance.
1
INTRODUCTION
1
Humans are good at telling the similarity between time series by just looking at their plots. Such knowledge must be encoded in the computer if we want to automate the detection of similarity among time series. In general, given any pair of time series, their similarity is usually measured by their correlation or distance. If we treat a time series as high dimensional points, which in time series it is, the Euclidean distance appears to be a natural choice for distance between time series. The Euclidean distance is defined as: Given two time series sequence and with n=m, their Euclidean distance is defined as:
An earlier version of the manuscript was published in the proceedings of the 2007 International Conference on Information and Knowledge Engineering(IKE’07: June 2007). Hamid R. Arabnia is with the Department of Computer Science, the University of Georgia, Athens, GA 30602, USA (e-mail: hra@cs. uga.edu). Junfeng Qu, corresponding author, is with the Department of Information Technology, Clayton State University, Morrow, GA 30260 USA (corresponding author to provide phone: 678-4664406; e-mail: jqu@clayton.edu). Yinglei Song is with the Department of Mathematics and Computer Science, University of Maryland at East Shore, Princess Anne, MD 21853 USA (e-mail: ysong@umes.edu) Khaled Rasheed is with the Department of Computer Science, the University of Georgia, Athens, GA 30602 (e-mail: khaled@cs.uga.edu). Byron Jeff is with the Department of Information Technology, Clayton State University, Morrow, GA 30260 (e-mail: byronjeff@clayton.edu ).
1
D( X , Y ) ≡
n
∑(x
i =1
i
− yi ) 2
(1)
We define that two sequences X and Y are in -match if D(X,Y ) is less than or equal to . We define ndimensional distance computation as the operation that computes the distance between two sequences of length n. Basically, there are essentially two ways the data might be organized[1]. • Whole sequence matching: In the whole sequence matching, all the time series that assumed to be compared are at the same length. The query time series q is of length n too. The Euclidean distance between the query time series and any time series to be compared with can be computed in linear time. Given a query threshold, the answer to a whole sequence similarity query for q is all the time series in the data set whose Euclidean distance with q are less than the threshold. • Subsequence Matching: Here the time series in the data set can have different lengths. The lengths of these candidate time series are usually larger than the length of the query time series. The answer to a subsequence query is any subsequence of any candidate time series whose distance with q is less than . Shasha and Zhu[2] pointed out that the Euclidean distance measure is not adequate as a
UbiCC Journal, Volume 3, January 2008
1
flexible similarity measure between time series because: • Two time series can be very similar even though they have different base lines or amplitude scales. • The Euclidean distance between two time series of different lengths is undefined even though the time series are similar to each other. • Two time series could be very similar even though they are not perfectly synchronized. The Euclidean distance that sums up the difference between each pair of corresponding data points between two time series is too rigid and will amplify the difference between time series. In a given time series, the related change between two adjacent data points are often thought of where information is resident in. Especially in financial market data analysis, the amplitude difference is more important than the time difference. Therefore, transform the time series into space of difference because any similarity matching is more meaningful and provides more insight view of problem domain, especially in financial data analysis. In this paper, we proposed our similarity measure on transformation of the original time series into a new series of critical change-points ( which contains the difference information of original series and the similarity clustering is based on). The rest of paper is organized as follows. Section 2 discussed related works on time series similarity matching. Section 3 describes our distance measure on the transformed space. Section 4 includes our experimental test with our similarity distance measure on financial time series data. In section 5, we conclude our research and point out future research direction.. 2 RELATED WORKS
There are two basic strategies to cope with highdimensional problems. The first is simply to use a subset of relevant variables to construct the model. That is, to find a subset of p′ variables where p′<
0, x ≠ y (non-negative definiteness); δ( x, y) = δ(y, x) (symmetry); δ(x, y) ≤ δ(x, z) + δ(z, y) (triangle inequality). Euclidean distance satisfies these properties. For new sequence E x and E y (with the same length), the distance function. To prove
d ( Ex , E y ) is a general metric
d ( Ex , E y ) is a general distance metric,
we need to prove it is non-negative, symmetric, reflexive, and it satisfies the triangle inequality. Obviously, d ( Ex , E y ) ≥ 0 and
. 4 EXPERIMENTS We have proved that our new developed distance measure for time series similarity matches is a metric function. Shasha et al[2] showed that Euclidean distance alone does not give an intuitive measure of similarity under the conditions of the time series compared are of different baselines and scales. Therefore, shifting transform or scaling transforms are often performed before measure Euclidean distance. Here the shifting transform is defined as the transformation of old time series by adding some real number to each item into a new time series. Scaling transform on a time series is to get a new time series by multiplying some real number to each item in the old time series. A simple way to make a similarity measure invariant to shifting and scaling is to normalize the time series. Define the normal form Norm(X) of a time series X is transformed from X by shifting the time series by its mean and then scaling by its standard deviation.
d ( E x , E y ) = d ( E y , Ex ) from our definition, also
d ( Ex , Ex ) = 0 , so d ( Ex , E y ) is non-negative,
symmetric and reflexive. Now we need to prove that d ( Ex , E y ) satisfies the triangle inequality, i.e.
Norm( X ) = ( X − avg ( X ) ) / std ( X )
It is trivial that the normalized time series have the properties avg(Norm(x))=0 and std(Norm(X))=n . The Euclidean distance between the normal forms of two time series is a similarity measure between time series that is invariant to shifting and scaling because they have the same baseline and scale[2]. we utilized seven stocks time series data (EOG, SM, HAL, CDIS, NOVL, SCOX, and WMT) from April 2005 to Oct.2005 to study our clustering measure,. For every possible pairing of the seven dataset from these stocks, we use group-average hierarchical clustering. The corresponding dendrogram of clustering based on different distance measure and transform techniques are shown. We compared three distance measures: 1. Euclidean: The Euclidean distance measure as presented in the introduction is tested to facilitate comparison to the large body of literature that utilize this distance measure PEuclidean: The PAA representation of time series with same compression ratio as our approach is also compared DSDistance: The critical change-points representation of time series that are
d ( E x , E y ) ≤ d ( E x , Ez ) + d ( E z , E y ) .
E y = {(e '1 , t '1 ), (e '2 , t '2 ),..., (e 'm , t 'm )}
E x = {( e1 , t1 ),( e2 , t2 ),...,( em , tm )}
Given
the
sequences
of
events and , we
transform them into relative difference space of the sequence of events respectively as Ex to E and Ey into E’ as: E = {( ∆e1 , ∆t1 ), ( ∆e2 , ∆t2 ),..., ( ∆em −1 , ∆tm −1 )} and
E ' = {( ∆e '1 , ∆t '1 ), ( ∆e '2 , ∆t '2 ), ..., ( ∆e 'm −1 , ∆t 'm −1 )}
where ∆ei = (ei +1 − ei ) and ∆e 'i = (e 'i +1 − e 'i ) . Now it is obvious that triangle inequality is satisfied based on the Pythagorean theorem and Euclidean space. After transforming the time series into series of critical change-points, how many of these critical change-points are necessary to represent the time series while retaining its structure and shape is critical. The compression ratio is defined to compare our approach with the PAA approach, which is different from ours and well studied[11].
2.
3.
UbiCC Journal, Volume 3, January 2008
4
clustered based transformed space
18
16
14
12
10
8
on
our
defined
Figure 4. Euclidean distance cluster on normalized time series data on PAA representation (compression ratio=25)
50
45
40
35
30
25
20
6
4
2
eog
15
10
5
sm
hal
cdis
novl
scox
wmt
scox
novl
eog
cdis
hal
wmt
sm
Figure 2. Euclidean distance cluster on normalized time series data
250
Figure 5. Euclidean distance cluster on raw time series data with PAA representation(compression ratio=25) Figure 6 and 7 show our critical change-point presentation with compression ratio of 25. The distance measure on transformed space is used.
200
150
29.5
100
29
50
28.5
28
scox
novl
eog
cdis
hal
wmt
sm
27.5
Figure 3. Euclidean distance cluster on raw time series data Our critical change-point representation has the compression ratio around 25. Therefore, the PAA representation is compared with the same compression ratio. The clustering results are shown in fig. 4 and 5.
27
sm
hal
novl
Figure 6. DSDistance clustering results on raw time series data (first cluster) The clustering algorithm automatically divides these seven time series data into three clusters, one cluster has only four critical change-points (EOG, CDIS and SCOX), one has five critical change-points (SM, HAL and NOVL), and one has no change-point at all (WMT).
3
2.5
2
1.5
1
0.5
eog
sm
cdis
hal
scox
novl
wmt
UbiCC Journal, Volume 3, January 2008
5
30
28
26
24
22
20
18
16
14
12
10
eog
cdis
scox
Figure 7. DSDistance clustering results on raw time series data (second cluster) With the same compression ratio, PAA representation obtained the same clustering results as shown in fig. 7 and 8 with different threshold values when clustering EOG, CDIS and SCOX in one group and SM, HAL and NOVL in another group. Fig. 2 and 3 show that Euclidean distance is sensitive to the different baseline and scale of time series, similar results also obtained by Shasha[2]. Based on Euclidean distance measure, the PAA representation of time series has the same clustering results on the raw time series data; A different clustering group on normalized time series based on PAA representation is obtained. The same compression ratio of PAA representation is the same as our critical change-points representation. Therefore, we can see that PAA representation does not lost the structure of time series on raw time series data, but not the normalized data when the comparable compression ratio is used to our data mapping method. Our critical change-points representation achieves the same clustering results as the PAA mapping method under the same compression ratio. The approach we proposed adapts to the structure of time series automatically. The author of PAA approach also proposed an adaptive piecewise constant approximation[12]. The comparison of similarity based on raw data shows that the scaling and shifting have no affect on price movements comparison based on our distance measure that considers the relative position of corresponding change-points in the time series because PEuclidean shows the same results on normalized time series. 5 CONCLUSTIONS
position of corresponding change-points in the time series. The distance measure proposed is suitable to deal with online similarity matching, such as data stream similarity matching, where traditional matching methods for time series are inefficient and the dimensionality reduction methods are very costly to apply repeatedly each time a new data arrives. Using our proposed method, it is also not necessary to keep statistics over the whole clustering process when new data come in. This distance measure is also not sensitive to the shifting and scale of the time series data. We also proved that our distance measure is a general distance metric. It works as good as the Euclidean distance measure that uses normalized data, and the well defined PAA mapping approach. The performance closes to human perceptual judgment as well. The distance measure proposed is well suited to online time series data stream because it does not maintain stream statistics over data streams. Future research can proceed to several directions. One is to use our distance measure only based on the landmarks[22] of the time series to reduce the computation time and dimensionality. The other is to incorporate indexing techniques in the searching algorithm because we have proved that our distance function is metric. 6 REFERENCES [1] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehoritra, "Dimensional Reduction for Fast Similarity Search in Large Time Series Databases," Knowledge and Information Systems, vol. 3, pp. 263-286, 2001 2001. [2] D. E. Shasha and Y. Zhu, High performance discovery in time series : techniques and case studies. New York: Springer, 2004. [3] R. Agrawal, C. Faloutsos, and A. Swami, "Efficient Similarity Search in Sequence Databases," Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithm, New York:Springer, 1993., 1993 1993. [4] C. Faloutsos, M. Rangenathan, and Y. Manolopoulos, "Fast Subsequence Matching in Time-Series Database," in Proc. ACM SIGMOD Conf., Minneapolis, 1994. [5] Y.-S. Moon, K.-Y. Whang, and W.-S. Han, "General Match: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows," SIGMOD, pp. 382393, 2002 2002. [6] K.-p. Chan and A. W.-c. Fu, "Efficient Time Series Matching by Wavelets," Proceedings of Internation Conference on Data Engineering (ICDE '99), Sydney, p. 126, 1999.
In this paper, we introduced a new distance measure for clustering that considers the relative
UbiCC Journal, Volume 3, January 2008
6
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss, "Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries," in Proceedings of the 27th VLDB Conference, Roma, Italy, 2001, pp. 79-88. Y. Huhtala, Kärkkäinen, J. & Toivonen, H. , " Mining for similarities in aligned time series using wavelets," Data Mining and Knowledge Discovery: Theory, Tools, and Technology, SPIE Proceedings Series, Orlando, FL, vol. 3695, pp. 150-160, Apr. 1999 D. Wu, D. Agrawal, and A. E. Abbadi, "Efficient Retrieval for Browsing Large Image Databases," in Proc. CIKM, Rockville, MD., 1996, pp. 11-18 F. Korn, H. V. Jagadish, and C. Falouts, "Efficient Supporting Ad Hoc Queries in Large Datasets of Time Sequences," in SIGMOD, 1997, pp. 289-300. E. Keogh and M. Pazzani, "Scaling up Dynamic Time Warping for Datamining applications," in KDD Boston, MA, 2000, pp. 285-289. E. Keogh, K. Chakrabarti, S. Mehoritra, and M. Pazzani, "Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases," in Proc. SIGMOD, Santa Barbara, California, 2001, pp. 151-162. B.-K. Yi, H. V. Jagadish, and C. Falouts, "Efficient Retrieval of Similar Time Sequences under Time Warping," ICDE, pp. 201-208, 1998 1998. G. Das, D. Gunopulos, and H. Mannila, "Finding Similar Time Series," in PKDD, 1997, pp. 88-100. H. Wu, B. Salzberg, and G. C. Sharp, "Subsequence Matching on Structured Time Series Data," in SIGMOD, Baltimore, Maryland, USA, 2005. M. Datar, A. Gionis, P. Indyk, and R. Motwani, "Maintaining Stream Statistics over Sliding Windows," SIAM Journal on Computing, vol. 31, pp. 1794-1813, 2002. Y. Zhu and D. Shasha, "Efficient Elastic Burst Detection in Data Streams," in SIGKDD Washington, DC, USA: ACM, 2003. R. Jin and G. Agrawal, "Efficient Decision Tree Construction on Streaming Data," in Conference on Knowledge Discovery in Data archive Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 2003, pp. 571-576. S. Muthukrishnan, R. Shah, and J. S. Vitter, "Mining Deviants in Time Series Data
[20]
[21]
[22]
Streams," in Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), Santorini Island, Greece, 2004. J. Gehrke, F. Korn, and D. Srivastava, "On Computing Correlated Aggregates Over Continual Data Streams," SIGMOD, pp. 126-133, 2001. J. Qu and H. R. Arabnia, "Mining Structural Changes in Financial Time Series with Gray System," in DMIN 2005, 2005. E. Keogh, "A Fast and Robust Method for Pattern Matching in Time Series Databases," in Proceedings of 9th Internatinoal Conference on Tools and Artificial Intelligence (ICTAI), 1997, pp. 578-584.
UbiCC Journal, Volume 3, January 2008
7
PERFORMANCE ANALYSIS FOR SKEWED DATA
S. Ahmad, 2M. Abdollahian, 3P. Zeephongsekul, 4B. Abbasi Department of Statistics and Operations Research, RMIT University, Melbourne, Australia, 1 shafiq.ahmad@rmit.edu.au, 2mali.abdollahian@rmit.edu.au, 3panlop.zeephongsekul@rmit.edu.au, 4 Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran 4 b.abbasi@gmail.com
1,2,3 1
ABSTRACT Information technology and the media have changed the face of business practices today. Customers now play the key role for the success of any business and noncompliance to their specifications will simply lead to failure of the business. Researchers across many disciplines have worked out several modifications of the traditional process capability measures to obtain better estimates of the products capability to meet market specifications. However, these conventional capability measures heavily depend upon the theory of normality. In this paper, we compare and contrast the Cumulative Distribution Function (CDF) method with the latest proposed process capability evaluation methods such as Burr percentile method and commonly used Clements percentile method when the underlying distribution is non-normal. A simulation study using Gamma, Weibull and Beta distributions is conducted and the comparison of the results is presented. Finally, a case study is presented using actual data from a manufacturing process. Keywords: Process Capability Index (PCI), Proportion of nonconforming in nonnormal process, CDF method, Quantile based capability indices. 1 INTRODUCTION
where C pu and C pl :
C pu = upper specificat − process mean ion 3σ
process mean − lower specificat ion 3σ
Process capability indices (PCIs), process yield and process expected losses are three basic means that have been widely used in measuring process performance. Of the three, PCI is least complex to understand and deploy to any process. The larger PCI value implies the higher process yield, and the larger PCI also indicates the lower process expected loss. Therefore, the PCI can be viewed as an effective and excellent means of measuring product quality and process performance [1]. The conventional process capability index C p is defined as:
Cp = usl − lsl 6σ
(3)
C pl =
(4)
where C pu and C pl refer to as upper and lower one sided capability indices µ and σ are the process mean and standard deviation respectively. Process capability index C p defined here is heavily based on certain assumptions such as collected data is from an in-control process, independent and identically distributed and follows normal distribution. However, most of the processes in the real world produce non- normal data and the quality practitioners need to consider the basic assumptions before deploying any conventional process capability index. The calculation of the conventional PCI measure requires the values of three points within the process distribution; the upper tail, the point of central tendency and the lower tail [2]. In normal distribution, in terms of quantiles; X0.99865 = µ + 3σ is the upper tail and X0. 5 = µ , in
(1)
where, usl and lsl (upper and lower specification limits) are the design tolerance limits also called customer specifications. The process ratio for off-center process C pk is defined as:
C pk = min{C pu , C pl }
(2)
UbiCC Journal, Volume 3, January 2008
8
general; is the median and X0.00135 = µ − 3σ corresponds to lower tail. In case of normal data, it is easy to estimate quantile points. However, for the non-normal data, it is not easy to estimate them. To deal with nonnormality; one approach is to transform the nonnormal data to approximately normal data using mathematical functions. Johnson [3] proposed a system of distributions based on the moment method called the Johnson transformation system. Box and Cox [4] also used transformation method for nonnormal data by presenting family of power transformations. Somerville and Montgomery [5] proposed using a square-root transformation to transform a skewed distribution into a normal one. The main objective of all these transformations is that one can apply conventional PCIs once the data is transformed to normal data. Clements [6] proposed a percentile method to calculate C p and C pk indices for the non-normal data using the Pearson family of curves. Liu and Chen [7] proposed a modified Clements PCI percentile method using Burr XII distribution. Ahmad et al. [8] compared Liu and Chen’s method with the commonly used Box-Cox method and concluded that Burr method provides slightly better estimates of PCI for the non-normal data. In this paper, we will review and compare CDF, Clements and Burr methods which are commonly used to evaluate the PCIs for the non-normal data. This paper is organized in the following manner. PCI methods for the comparison study are discussed in section 2. For illustrational purposes, a simulation study using Weibull, Gamma and Beta distributions is presented in section 3 & 4, an application example with real world data is presented in section 5 and the conclusion is given in section 6. 2. PCI FOR NON-NORMAL DATA In this section a brief review of the three different methods that are used in this paper is presented. 2.1 Clements Percentile PCI Method Clements method is popular among quality practitioners in industry. Clements [6] proposed that 6 σ in equation (1) be replaced by the length of the interval between the upper and lower 0.135 percentage points of the distribution of X. Therefore, the denominator in equation (1) can be replaced by ( U p − L p ) , i.e.
Cp
percentile of observations and L p is the lower percentile i.e. 0.135 percentile of observations. Since the median “M” is the preferred central value for a skewed distribution, so he defined C pu and C pl as follows:
C pu
= (usl − M ) /( U p − M )
(6) (7)
C pl
= ( M − lsl ) /( M − L p )
C pk = min{C pu , C pu }
and
(8)
Clements approach uses the standard estimators of skewness and kurtosis that are based on 3rd and 4th moments respectively, and may not be reliable for very small sample sizes [7]. Wu et al [9] have conducted a research study indicating that the Clements method cannot accurately measure the capability indices, especially when the underlying data distribution is skewed.
2.2 Burr Percentile PCI Method Burr [11] proposed a distribution called Burr XII distribution, whose probability density function is defined by:
ckx c − 1 f ( x ) = (1 + x c ) k + 1 0
if x ≥ 0 ; c , k ≥ 1 if y < 0
(9)
Cumulative distribution function is defined by:
F ( x) = 1 − 1 (1 + x c ) − k if x ≥ 0 ; c , k ≥ 1
(10)
where c and k represent the skewness and kurtosis coefficients of the Burr distribution respectively. Liu and Chen [7] introduced a modification based on the Clements method, whereby instead of using Pearson curve percentiles, they replaced them with percentiles from an appropriate Burr distribution. The proposed modified method is as follows
•
= (usl − lsl ) /( U p − L p )
where U p is the upper percentile i.e. 99.865
(5)
•
Estimate the sample mean, sample standard deviation, skewness and kurtosis of the original sample data. Calculate standardized moments of skewness ( α 3 )and kurtosis ( α 4 ) for the
UbiCC Journal, Volume 3, January 2008
9
given sample size n (see Appendix I for details)
•
Using CDF method C p and C pk are defined by;
Φ − 1 (0.5 + 0.5 ∫ usl f ( x)dx) lsl 3
Use the values of α 3 and α 4 to select the appropriate Burr parameters c and k , Burr IW [11]. Then use the standardized tails of the Burr distribution XII to determine standardized 0.135, 0.5, 99.865 percentiles (X).
C
p
=
(14)
C pk = min(C pu , C pl )
Φ − 1(0.5 + ∫T f ( x)dx) lsl 3
(15)
•
Calculate estimated percentiles using Burr table for lower, median, and upper percentiles as follows: Calculate estimated percentiles using Burr table for lower, median, and upper percentiles as follows:
where C
pl
=
(16)
•
C
pu
=
usl Φ − 1 (0.5 + ∫T f ( x)dx) 3
(17)
Lp = x + (X
0.00135
X
s)
(11)
where f (x) represents the probability density function of the process and T represents the process mean for normal data and process median for nonnormal data. In this paper f (x) in Equation (14) is replaced by Equation (9) i.e. Burr density function (see details in Appendix III).
3. SIMULATION STUDY
U p = x + (X
0 .99865
X
s)
(12)
M = x + (X
0.50
X
s)
(13)
•
Calculate process capability indices using equations 5-8.
Three non-normal distributions; Gamma, Weibull and Beta have been used to generate random data in this simulation. These distributions are used to investigate the effects of non-normal data on the process capability index. These distributions are known to have the parameter values that can represent mild to severe departures from normality. These parameters are selected so that we can compare our simulation results with existing results using the same parameters in the literature. The probability density function of Gamma distribution, with parameters α and β, is given by
xα − 1e β , α , β > 0, x ≥ 0 (18) − x
2.3 CDF PCI METHOD Wierda [12] introduced a new approach to evaluate process capability for a non-normal data using Cumulative Distribution Function (CDF). Castagliola [13] used CDF approach to compute proportion of non-conforming items and then estimate the capability index using this proportion. Castagliola showed the relationship between process capability and proportion of non-conforming items and used CDF method to evaluate PCI for non normal data by fitting a Burr distribution to the process data. He used a polynomial approximation to replace empirical function in the Burr distribution, and then used the proposed method given by equation (14). To calculate C p we give a short
f (x ) =
Γ (α ) β α
1
The parameters used in this simulation are shape=4.0 and scale= 0.5
0 .5 0 .4
0 .3
0 .2
0 .1
0 .0
0
1
2
3
4
5
6
7
proof of this well known result in Appendix II.
G am m a pro b ab ility d e nsity fu n ctio n
Figure 1: pdf of Gamma distribution with parameters (shape= 4.0, scale= 0.5)
UbiCC Journal, Volume 3, January 2008
10
The probability density function of Weibull distribution with shape ( α ) and scale ( β ) is given by
α β α f ( x ) = ( x α − 1 ) e − x , α , β > 0, x ≥ 0 β
corresponding distributions (i.e. Gamma, Weibull and Beta). Probability of non-conforming items (PNC) is calculated using equation (21) as suggested by Castagliola [13] for all three methods (e.g. for Gamma distribution with C pu value 0.8698, corresponding PNC value using equation (21) will be 0.0045351). Figure 4 presents flowchart of estimating PNC and PCI’s using different methods and different non-normal distributions. The exact PNC value (p) in this flow chart is obtained using following equation.
usl
(19)
The parameters used in this simulation are: α = 1.0 and β = 1.2
0 .9 0 .8 0 .7 0 .6 0 .5 0 .4 0 .3 0 .2 0 .1 0 .0
0 2 4 6 8 10 12
PNC = 1 −
∫ f ( x)dx
0
(22)
W eib u ll p ro b ab ility d en sity fu n ctio n
Figure 2: pdf of Weibull distribution with parameters ( α = 1.0, β = 1.2) The probability distribution function of Beta distribution with shape 1 ( α ) and shape 2 ( β ) is given by
f ( x) = [ Γ (α + β ) α −1 ] x (1 − x ) β −1 , < 0 x < 1 Γ (α )Γ ( β )
where f (x) represents the corresponding distribution function of Gamma, Weibull and Beta distributions.
Generate sample data using non-normal distribution (e.g. Gamma, Weibull, Beta etc.)
(20)
The parameters used in this simulation are: α = 4.4 and β = 13.3
4
Compute Cpu using CDF method (Equation (14)) and compute PNC for the corresponding Cpu, (Equation (21)), call it p1
3
Compute Cpu using Burr method and compute PNC for the corresponding Cpu, call it p2
2
1
0
0 .0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
Compute Cpu using Clements method and compute PNC for the corresponding Cpu, call it p3
B e ta p ro b a b ility d e n sity fu n c tio n
Figure 3: pdf of Beta distribution with parameters ( α = 4.4, β = 13.3)
3.1 Comparison Criteria The criterion for comparison in this simulation study is based on proportion of non-conformances (PNC). The proportion of non-confirming units for a normal distribution can be determined by [2]
Access the efficacy of different methods by comparing p1, p2, p3 and exact p (Equation (22))
Figure 4: Simulation methodology flowchart
3.2 Simulation Results These Cpu* values in table (1) are used to access the efficacy of the three method in estimating process capability index for non-normal data. Table (1) shows the results of this comparison.
PNC = Φ( −3C pu )
(21)
The C pu values in table (1) are computed using equation (17) where f (x) is replaced by the
UbiCC Journal, Volume 3, January 2008
11
Table (1) shows the results of this comparison
Distribution
Gamma(4,0.5)
Weibull(1,1.2) Beta(4.4,13.3)
USL
6.3405
5.0 0.5954
Cpu*
1.000
1.043 1.002
Cpu Clements
0.8698
0.9694 0.7434
Cpu Burr
0.9069
0.9738 0.7965
Cpu CDF
1.0000
1.0292 1.0028
• CDF method is the one for which the estimated C pu value deviates least from the target C pu value. • For the given sample size, PNC value obtained using CDF method is comparable with the targeted PNC value obtained from exact distribution.
*Computed from Equation (21) – percentile and exact distribution
The simulation results given in Table (1) show that C pu values obtained using Clements method are worse than those obtained using Burr and CDF methods. The C pu values obtained using the CDF method are the closet to those C pu values obtained using direct distribution percentiles in the conventional approach; thus, leading to better estimates of the PCIs compare with the Burr method. Our comparison criteria is that the method which yields expected proportion of non-nonconformities closest to that obtained using exact distribution would be the most superior method. Table (2) – proportion of nonconformance
Table (2) – proportion of nonconformance (PNC) Comparison of expected proportion of nonconformance (PNC) with exact PNC Distribution Clements Burr CDF Exact p3 p2 p1 p Gamma Weibull Beta
0.00454 0.00182 0.01287 0.00326 0.00170 0.00844 0.00135 0.00101 0.00131 0.0013 0.0010 0.0013
During simulation, we have observed that data having moderate departure from normality provides better estimates of capability indices compared with data having severe departures from normality.
5. REAL DATA EXAMPLE
A case study using data from a manufacturing industry is conducted. All three methods have been deployed to estimate the non-normal process capability for the experimental data. Data has been collected from an in-control manufacturing process. The data is the measurements of bonding area between two surfaces with upper specification limit (USL = 24). The summary statistics of the process data is: ~ µ (mean) =23.4809, σ = 0.5650, µ (median) = 23.3963, µ 3 (skewness) = 1.1098, 4.9740.
200
µ4 (kurtosis)
=
150
Frequency
100
50
Results in table 2 show that PNC values obtained using Clements method are worse than the other 2 methods. In this table PNC values using CDF method are close to the PNC values obtained using exact distribution. Thus the later method is giving better estimates of non-conformances as compared to the commonly used Clements and Burr methods.
4. DISCUSSION
0
22.5
23.0
23.5
24.0
Data
24.5
25.0
25.5
Figure 5: Histogram of the real data We have selected 30 samples of size 50 from these data points. For each sample; we computed the process capability index C pu and proportion of nonconforming PNC by using Clements, Burr and the CDF method. The mean and standard deviation of the estimated C pu values are given in table (3).
Simulation study shows that both Burr and PNC methods are estimating
C pu
values more accurately
than commonly used Clements method. Looking at the results as depicted in tables 1 & 2, we conclude that: • CDF method is superior to both percentile methods (Burr & Clements) • Burr method is still performing better than the commonly used Clements method.
UbiCC Journal, Volume 3, January 2008
12
Table (3) – result of the real example based on 30 samples of size n=50 Cpu → CDF Burr Clements Mean 0.313277 0.347917 0.360691 Standard deviation 0.023811 0.065859 0.080264 Expected PNC using Eq (21) 0.17365 0.14830 0.13961
This paper strongly recommends further research to extend the CDF method to non-normal multivariate PCI studies in this area.
APPENDIX I:
Standardized moments of skewness ( α 3 )and kurtosis ( α 4 ) for the given sample size n can be computed as follows:
α3 =
(n − 2) n ( n − 1) * Skewness
(1)
where
Figure 6: Comparison of three methods For CDF method; we have replaced the corresponding f ( x) by the Bur distribution. The Burr parameters for each sample have been estimated using maximum likelihood estimation. The exact PNC for experimental data is 0.168. This PNC value is obtained by using upper specification limit USL=24) and calculating the proportion of data that falls outside the specification limit. The results presented in table 3, indicates that the expected PNC based on 30 samples of size 50, using CDF method is the closest estimate to the exact PNC. Table 3 also indicates that CDF method has the least variability as compared to the other two methods.
6. CONCLUSIONS
(x j − x) Skewness = ∑ ( n − 1)( n − 2 ) s
n
3
(2)
where x is mean of the observations and s is the standard deviation.
* ( Kurtosis α 4 = ( n + 1 )( n − 1 ) ( n − 2 )( n − 3 ) + 3) * ( n − 1) ( n + 1)
(3)
where
4 x j − x 3(n − 1) 2 (4) n(n + 1) − Kurtosis = ∑ (n − 1)(n − 2)(n − 3) s (n − 2)(n − 3)
APPENDIX II:
Conventionally capability index Cp is defined as:
In this paper a comparison between three methods of estimating the process capability and the proportion of non-conformance in the manufacturing industry is presented. The CDF method is not sensitive to distribution of the process data and therefore can be applied to any real set of data as long as a suitable distribution can be fitted to it. However, to apply the CDF method, one must identify the corresponding distribution. One of the significant characteristics of Burr XII distribution is that, when mean, variance, skewness and kurtosis of the process data are obtained; using Burr tables (Liu and Chen [7]) we can fit a suitable Burr distribution. Therefore we can conclude that by replacing the probability density function f ( x) in the CDF method with the appropriate Burr density function would lead to a better estimate for PCI and PNC of nonnormal data. Simulation studies for different non-normal distributions show that the CDF method using Burr distribution produces better estimates of PCI.
Cp =
usl − lsl 6σ
(1)
If the process X is normally distributed with mean µ and standard deviation σ, i.e. , then (2) . On face value, it is And not obvious that (1) and (2) are equal. Here is the proof:
UbiCC Journal, Volume 3, January 2008
13
•
We first note that
.
•
(Draw a normal graph and you will see this!) Since , we must also have that
parameters the maximum likelihood function with sample size n is:
logL = n log( ) + log( ) − (1+ k)∑log( + xi ) + (c −1)∑logxi c k 1
c i=1 i=1
n
n
(2)
(3)
The deferential equations with respect to parameters c and k are:
n n ∂l n log xi log xi = + ∑ log xi − ( k + 1) ∑ c ∂ c c i =1 1 + xi i −1 c
which is equivalent to: (4) 1. Because the the origin, of is symmetric about
(3)
∂l n n c = − ∑ log(1 + xi ) ∂k k i=1
(5)
(4)
2. By equation (3). Finally,
In this paper, unknown Burr parameters c and k have been determined by maximizing equation (2) using systematic random search algorithm named “Simulated Annealing”.
REFERENCES
(**)
[1] [2]
where, we have used (3) and (5), which concludes the proof.
[3] [4] [5]
APPENDIX III:
In this paper we fit Burr distribution function
f (x) to process data and then evaluate the PCI using
CDF method. To fit the data distribution with Bur distribution, we need to estimate c and k parameters. The likelihood function of univariate Burr is:
[6] [7]
c k L ( c , k ; x1 ,...., x n ) =
n
n
n
C (x )
i i =1 c i
n
c −1
(1)
C (1 + x
i =1
)
k +1
[8]
In univariate Burr distribution there are two parameters c and k ; and to estimate these
M. Deleryd, K. Vannman ‘process capability plots—a quality improvement tool’ Qual. Reliab. Engng. Int. 15: 213–227 (1999). L C Tang, S E Than (1999) Computing process capability indices for non-normal data : a review and comparative study. Qual. Reliab. Engng. Int. 15: 339-353. Johnson NL (1949) System of frequency curves generated by methods of translation. Biometrika 36:149–176 Box GEP, Cox DR (1964) An analysis of transformation. J Roy Stat Soc B 26:211–243 Somerville S, Montgomery D (1996) Process capability indices and non-normal distributions. Quality Engineering 19(2):305– 316. Clements JA (1989) Process capability calculations for non-normal distributions. Quality Progress 22:95–100 Pei-Hsi Liu, Fei-Long Chen (2006), “Process capability analysis of non-normal process data using the Burr XII distribution”, Int J Adv Manuf Technol 27: 975–984 S. Ahmad, M. Abdollahian, P. Zeephongsekul (2007) Process capability analysis for nonquality characteristics using Gamma distribution. 4th international conference on information technology – new generations, USA, April, 02-04: 425-430
UbiCC Journal, Volume 3, January 2008
14
[9]
[10] [11] [12]
[13] [14] [15]. [16] [17]
[18] [19] [20]
Wu HH, Wang JS, Liu TL (1998) Discussions of the Clements-based process capability indices. In: Proceedings of the 1998 CIIE National Conference, pp 561–566 Burr IW (1942) Cumulative frequency distribution. Ann Math Stat 13:215–232 Burr IW (1973) Parameters for a general system of distributions to match a grid of á3 and á4. Commun Stat 2:1–21 Wierda SJ. A multivariate process capability index. ASQC Quality Congress Transactions, Boston, MA, 1993, American Society for Quality Control: Milwaukee, WI, 1993; 342– 348. Castagliola P (1996) Evaluation of non-normal process capability indices using Burr’s distributions. Qual Eng 8(4):587–593 Rodriguez RN (1977) A guide to the Burr type XII distributions. Biometricka, 64:129–134 Chou CY, Cheng PH (1997) Ranges control chart for non-normal data. J Chinese Inst Ind Eng 14(4):401–409 Hatke M.A. (1949) A certain cumulative probability function. Ann Math Stat, Vol. 20, No. 3:461-463. C.H. Yeh, F.C. Li, P.K. Wang, Economic design of control charts with Burr distribution for non- normally data under Weibull shock models: 12th international conference on Reliability and Quality in Design, (2006) 323327. V.E. Kane, Process capability indices, J. Qual. Technol. 18 (1986) 41–52 . Montgomery, D., ‘Introduction to Statistical Quality Control 5th edition, Wiley, New York, New York Zimmer WJ, Burr IW (1963) Variables sampling plans based on non normal populations. Ind Qual. Control July:18–36
UbiCC Journal, Volume 3, January 2008
15
BRINGING INFORMATION RETRIEVAL BACK TO DATABASE MANAGEMENT SYSTEMS
Khaled Nagi Dept. of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt. khaled.nagi@eng.alex.edu.eg
ABSTRACT Information retrieval emerged as independent research area from traditional database management system more than a decade ago. This was driven by the increasing functional requirements that modern full text search engines have to meet. Current database management systems (DBMS) are not capable of supporting such flexibility. However, with the increase of data to be indexed and retrieved and the increasing heavy workloads, modern search engines suffer from scalability, reliability, distribution and performance problems. The DBMS have a long tradition in coping with these challenges. Instead of reinventing the wheel, we propose using current DBMS as backend to existing full text search engines. This way, we bring back both worlds together. We present a new and simple way for integration and compare the performance of our system to the current implementations based on storing the full text index directly on the file system. Keywords: Full text search engines, DBMS, Lucene, performance evaluation, scalability.
1
INTRODUCTION
Most commercial database management systems offer basic phonetic full text search functionality. For example, Oracle has a module called Oracle Text [1]. Yet, seeking to add more functionality and intelligence to their search capabilities, many commercial applications use third party specialized full text search engines instead. There are several commercial products on the market. But certainly Lucene [2] is the most popular opensource product at the moment. It provides searching capabilities for the Eclipse IDE [3], the Encyclopedia Britannica CD-ROM/DVD, FedEx, New Scientist magazine, Epiphany, MIT’s Open-Courseware [4] and so on. All search engines build an index of the data to be retrieved in user queries. The index is always stored in the file system on disk and can be loaded at startup in the memory (optional in Lucene) for faster querying. However, this is not feasible for large indices due to memory size limitations. So, the standard storage usually remains the file system of the disk. However, with the increase of data to be indexed and retrieved under heavy workloads of user queries, search engines suffer from scalability problems both in providing adequate response times for their users and keeping good overall system throughput. To cope with these problems, search engines should provide more intelligent techniques for accessing the disk. Reliability becomes also a
problem. The possibility of corrupting the whole index during a system crash is much higher than loosing the data in a database after a similar crash. Restoring a defected index might also take several hours thus complicating the situation even further. The search engine must manage its read and write locks by itself as well. Distributing the index among several sites and providing efficient mirroring techniques is becoming an important issue to large scale search engine projects such as Nutch [5]. The database management systems have a long tradition in coping with these challenges. Instead of reinventing the wheel, we try to bring both world together again in a new way. We propose using current DBMS as backend to existing full text search engines as opposed to either reimplementing full text search engine functionality into DBMS or re-implementing core DBMS features into search engines. As a case study, we use the open-source Lucene and MySQL without loss of generality. We use real world data extracted from an electronic marketplace and simulate real world workload traces in order to demonstrate that the overall system throughput and query response time do not suffer with the introduction of DBMS as a backend with their inherent overhead. In some cases, some performance indices are also improved which paves the way to using the whole spectrum of basic infrastructural facilities offered by DBMS such as recovery, automatic replication, distribution, and segmentation. The rest of the paper is organized as follows.
UbiCC Journal, Volume 3, January 2008
16
Section 2 provides a background on full text search engines. Our proposed system integration is presented in Section 3. Section 4 contains the results of our performance evaluation and Section 5 concludes the paper. 2 BACKGROUND ON SEARCH ENGINES FULL TEXT
similarity searches. The search process begins with parsing the user query. The tokens and the Boolean operators are extracted. The tokens have to be analyzed by the same analyzer used for indexing. Then, the index is traversed for possible matches in order to return an ordered collection of hits. The fuzzy query processor is responsible for defining the match criteria during the traversal and the score of the hit.
2.1
Typical Features Full text search engines do not care about the source of the data or its format as long as it is converted to plain text. Text is logically grouped into a set of documents. The user application constructs the user query which is submitted to the search engine. The result of the query execution is a list of document IDs which satisfy the predicate described in the query. The results are usually sorted according to an internal scoring mechanism using fuzzy query processing techniques [6]. The score is an indication of the relevance of the document which can be affected by many factors. The phonetic difference between the search term and the hit is one of the most important factors. Some fields are boosted so that hits within these fields are more relevant to the search result as hits in other fields. Also, the distance between query terms found in a document can play a role in determining its relevance. E.g., searching for “John Smith”, a document containing “John Smith” has a higher score than a document containing “John” at its beginning and “Smith” at its end. Furthermore, search terms can be easily augmented by searches with synonyms. E.g., searching for “car” retrieves documents with the term “vehicle” or “automobile” as well. This opens the door for ontological searches and other semantically richer similarity searches. 2.2 Architecture As illustrated in Fig. 1, at the heart of a search engine resides an index. An index is highly efficient cross-reference lookup data structure. In most search engines, a variation of the well-known inverted index structure is used [7]. An inverted index is an inside-out arrangement of documents such that terms take center stage. Each term refers to a set of documents. Usually, a B+-tree is used to speed up traversing the index structure. The indexing process begins with collecting the available set of documents by the data gatherer. The parser converts them to a stream of plain text. For each document format, a parser has to be implemented. In the analysis phase, the stream of data is tokenized according to predefined delimiters and a number of operations are performed on the tokens. For example, the tokens could be lowercased before indexing. It is also desirable to remove all stop words. Additionally, it is common to reduce them to their roots to enable phonetic and grammatical
Figure 1: Architecture of a full text search engine 2.3 Typical Operations
2.3.1 Complete index creation This operation occurs usually once. The whole set of documents is parsed and analyzed in order to create the index from scratch. This operation can take several hours to complete. 2.3.2 Full text search This operation includes processing the query and returning page hits as a list of document IDs sorted according to their relevance. 2.3.3 Index update This operation is also called incremental indexing. It is not supported by all search engines. Typically, a worker thread of the application monitors the actual inventory of documents. In case of document insertion, update, or deletion, the index is changed on the spot and its content is immediately made searchable. Lucene supports this operation. 3 3.1 PROPOSED SYSTEM INTEGRATION
Architecture Lucene divides its index into several segments. The data in each segment is spread across several files. Each index file carries a certain type of information. The exact number of files that constitute a Lucene index and the exact number of segments vary from one index to another and depend on the
UbiCC Journal, Volume 3, January 2008
17
number of fields the index contains. The internal structure of the index file is public and is platform independent [8]. This ensures its portability. We take the index file as our basic building block and store it in the MySQL database as illustrated in Fig. 2. The set of files, i.e. the logical directory, is mapped to one database relation. Due to the huge variation in file sizes, we divide each file into multiple chunks of fixed length. Each chunk is stored in a separate tuple in the relation. This leads to better performance than storing the whole file as CLOB in the database. The primary key of the tuple is the filename and the chunk id. Other normal file attributes such as its size and timestamp of last change are stored in the tuple next to the content. We provide standard random file access operations based on the above mentioned mapping. Using this simple mapping, we do not violate the public index file format and present a simple yet elegant way of choosing between the different file storage media (file system, RAM files, or database).
ence to a new instance of the OutputStream class. We provide a database specific implementation, DBDirectory, which maps these operations to SQL operations on the database. Both InputStream and OutputStream are abstract classes that mimic the functionality of their java.io counterparts. Basically, they implement the transformation of the file contents into a stream of basic data types, such as integer, long, byte, etc., according to the file standardized internal format [8]. Actual reading and writing from the file buffer remain as abstract method to decouple the classes from their physical storing mechanism. Similar to FSInputStream and RAMInputStream, we provide the database dependent implementation of the readInternal and seekInternal methods. Moreover, the DBOutputStream provides the database specific flushing of the file buffer after the different write operations. Other buffer management operations are also implemented. Both DBInputStream and DBOutputStream use the central class DBFile. A DBFile object provides access to the correct file chunk stored in a separate tuple in the database. It also provides a clever caching mechanism for keeping recently used file chunks in memory. The size of the cache is dynamically adjusted to make use of the available free memory of the system. The class is responsible for guaranteeing the coherency of the cache.
Figure 2: Integrating Lucene index in MySQL database System Design Fig. 3 illustrates the UML class diagram of the store package of Lucene. We only include the relevant classes. The newly introduced classes are grayed. Directory is an abstract class that acts as a container for the index files. Lucene comes with two implementations for file system directory (FSDirectory) and in-RAM index (RAMDirectory). It provides the declaration of all basic file operations such as listing all file names, checking the existence of a file, returning its length, changing its timestamp, etc. It is also responsible for opening files by returning an InputStream object and creating a new file by returning a refer3.2
Figure 3: UML class diagram of the store package after modification
UbiCC Journal, Volume 3, January 2008
18
4
PERFORMANCE EVALUATION
In our order to evaluate the performance of our proposed system, we build a full text search engine on the data of a neutralized version of a real electronic marketplace. The index is build over the textual description of more than one million products. Each product contains approximately 25 attributes varying from few characters to more than 1300 characters each. We develop a performance evaluation toolkit around the search engine as illustrated in Fig. 4. The workload generator composes queries of single terms, which are randomly extracted from the product description. It submits them in parallel to the application. The product update simulator mimics product changes and submits the new content to the application in order to update the Lucene index. The application consists of the modified Lucene kernel supporting both file system and database storage options of the full text index. The application under test manages two pools of worker threads. The first pool consists of searcher threads that process the search queries coming from the workload generator. The second pool consists of index updater threads that process the updated content coming from the product update simulator. The performance of the system is monitored using the performance monitor unit.
We also monitor the response time of: • the searches, and • the index updates from the moment of submitting the request till receiving the result. 4.2 System Configuration In our experiments we use a dual core Intel Pentium 3.4 GHz processor, 2 GB RAM 667 MHz and one hard disk having 7200 RPM, access time of 13.2 ms, seek time of 8.9 ms and latency of 4 ms. The operating system is Windows XP. We use JDK 1.4.2, MySQL version 5.0, JDBC mysqlconnector version 3.1.12, and Lucene version 1.4.3. Experiment Results The performance evaluation considers the main operations: complete index creation, simultaneous full text search over single terms under various workloads, and - in parallel - performing index update as product data change. The experiments are conducted for the file system index and the database index. We drop the RAM directory from our consideration, since the index under investigation is too large to fit into the 1.5 GB heap size provided by Java under Windows. 4.3.1 Complete index creation Building the complete index from scratch on the file system takes about 28 minutes. We find that the best way to create the complete index for the database is to first create a working copy on the file system and then to migrate the index from the file system to the database using a small utility that we developed to migrate the index from one storage to the other. This migration takes 3 minutes 19 seconds to complete. Thus, the overhead in this one time operation is less than 12%. 4.3.2 Full text search In this set of experiments, we vary the number of search threads from 1 to 25 concurrent worker threads and compare the system throughput, illustrated in Fig. 5, and the query response time, illustrated in Fig. 6, for both index storage techniques. We find that the performance indices are enhanced by a factor > 2. The search throughput jumps from round 1,250,000 searches per hour to almost 3,000,000 searches per hour in our proposed system. The query response time is lowered by 40% by decreasing from 0.8 second to 0.6 second in average. This is a very important result because it means that we increase the performance and take the robustness and scalability advantages of database management systems on top in our proposed system. 4.3
Figure 4: Components of the performance evaluation toolkit. 4.1 Input Parameters and Performance Metrics We choose the maximum number of fetched hits to be 20 documents. This is a reasonable assumption taking into consideration that no more than 20 hits are usually displayed on a web page. The number of search threads is varied from 1 to 25 enabling the concurrent processing of 25 search queries. Due locking restrictions inherent in Lucene, we restrict our experiments to maximum one index update thread. We also introduce a think time varying from 20 to 100 milliseconds between successive index update requests to simulate the format specific parsing of the updated products. In all our experiments, we monitor the overall system throughput in terms of conducted: • searches per second, and • index updates per second.
UbiCC Journal, Volume 3, January 2008
19
remains under the absolute level of 25 seconds which is acceptable for most application taking into consideration the high update rate.
Figure 5: Search throughput in an update free environment
Figure 7: Index update throughput
Figure 6: Search response time in an update free environment 4.3.3 Index update In this set of experiments, we enable the incremental indexing option and repeat the above mentioned experiments of Section 4.3.2. for different settings of think time between successive updates. In order to highlight the effect of incremental indexing, we choose very high index update rates by varying the think time from 20 to 100 milliseconds. For readability purposes, we only plot the results of the experiments having a think time of 40 and 80 milliseconds. In real life, we do not expect this exaggerated index update frequency. Fig. 7 demonstrates that the throughput of the index update thread in our proposed system is slightly better than the file system based implementation. However, Fig. 8 shows that the response time of the index update operation in our system is worse than the original one. We attribute this to an inherent problem in Lucene. During index update, the whole index is exclusively locked by the index updater thread. This is too restrictive. In our implementation, we keep this exclusive lock although the database management system also keeps its own locking on the level of tuples which is less restrictive, which would allow for more than one index update thread and certainly more concurrent searches. The extra overhead of holding both locks lead to the increase in the system response time. The good news is that the response time always
Figure 8: Index update response time The search performance of our proposed system becomes very comparable to the original file system based implementation in an environment suffering from a high rate of index updates. Fig. 9 shows that the search throughput of the proposed system is slightly better than the file system based implementation; whereas Fig. 10 shows that our database index suffer from a slightly higher response time than the original system. Again, the effect of the exclusive lock over the whole index during index update is remarkable by comparing the performance indices of Fig. 5 and Fig. 6 to those of Fig. 9 and Fig. 10, respectively. The search throughput drops from 3,000,000 to round 1,100,000 searches per hour and the response time increases from 0.6 seconds to round 3 seconds.
UbiCC Journal, Volume 3, January 2008
20
evaluation toolkit to work on several sites of a distributed database. REFERENCES [1] Oracle Text. An Oracle Technical White Paper, http://www.oracle.com/technology/products/text /pdf/10gR2text_twp_f.pdf. (2005). [2] Apache Lucene, http://lucene.apache.org/java/docs/index.html. [3] B. Hermann, C. Müller, T. Schäfer, and M. Mezini: Search Browser: An efficient index based search feature for the Eclipse IDE, Eclipse Technology eXchange workshop (eTX) at ECOOP (2006). [4] MIT OpenCourseWare, MIT Reports to the President (2003–2004). [5] Nutch home page, http://lucene.apache.org/nutch/ [6] D. Cutting, J. Pedersen: Space Optimizations for Total Ranking, Proceedings of RIAO (1997). [7] D. Cutting, J. Pedersen: Optimizations for Dynamic Inverted Index Maintenance, Proceedings of SIGIR (1990). [8] Apache Lucene - Index File Formats, http://lucene.apache.org/java/docs/fileformats.ht ml.
Figure 9: Search throughput in an environment with high update rate.
Figure 10. Search response time in an environment with high update rate. 5 CONCLUSION AND FUTURE WORK
In this paper, we attempt to bring information retrieval back to database management systems. We propose using commercial DBMS as backend to existing full text search engines. Achieving this, today’s search engines directly gain more robustness, scalability, distribution and replication features provided by DBMS. In our case study, we provide a simple system integration of Lucene and MySQL without loss of generality. We build a performance evaluation toolkit and conduct several experiments on real data of an electronic marketplace. The results show that we reach comparable system throughout and response times of typical full text search engine operations to the current implementation, which stores the index directly in the file system on the disk. In several cases, we even reach much better results which mean that we take the robustness and scalability of DBMS on top. Yet, this is only the beginning. We plan on mapping the whole internal index structure into database logical schema instead of just taking the file chunk as the smallest building block. This will solve the restrictive locking problem inherent in Lucene and will definitely boost overall performance. We also plan on extending our performance
UbiCC Journal, Volume 3, January 2008
21
WEB-BASED DECISION SUPPORT SYSTEMS AS KNOWLEDGE REPOSITORIES FOR KNOWLEDGE MANAGEMENT SYSTEMS
Yuri Boreisha Minnesota State University Moorhead, USA Boreisha@mnstate.edu Oksana Myronovych North Dakota State University, USA Oksana.Myronovych@ndsu.nodak.edu
ABSTRACT Problem solving and learning processes conducted on the basis of contemporary Webbased DSS provide for development and enhancement of knowledge management systems. Knowledge objects form the foundation of the conceptual approach to the knowledge management based on the contemporary Internet technologies and knowledge accumulated in DSS. Keywords: knowledge management systems, decision support systems.
1
INTRODUCTION
Knowledge management (KM) has become an important theme as managers realize that much of their firm’s value depends on ability to create and manage knowledge. To transform information into knowledge a firm must use additional resources to discover patterns, rules, and context where the knowledge works [1-3]. Knowledge that is not shared and applied to the practical problems does not add business value. Today people can share their knowledge in three primary ways. Organizational information systems (IS) that store, manage, and deliver documents are called content management systems (CMS). With the arrival of modern communications technology, people can share their knowledge via collaborating knowledge management systems (KMS). In addition to content management and collaboration, the knowledge can be shared via expert systems. Comprehensive discussion of important dimensions of knowledge, the knowledge management value chain, and types of KMS can be found in [2, 3]. Web 2.0 companies use the Web as a platform to create collaborative, community-based sites (e.g., social networking sites, blogs, wikis, etc.). The Web has now become an application, development, delivery, and execution platform [4]. Software as a Service (SaaS) - application software that runs on a Web server rather than being installed on the client computer – has gained
popularity, particularly with businesses. Collaborating on projects with co-workers across the world is easier, since information is stored on a Web server instead of on a single desktop. Rich Internet Applications (RIAs) are Web applications that offer the responsiveness, “rich” features and functionality approaching that of desktop applications. RIAs are the result of today’s more advanced technologies (such as Ajax) that allow greater responsiveness and advanced GUIs. Web services have emerged and, in the process, have inspired the creation of many Web 2.0 businesses. Web services allow you to incorporate functionality from existing applications and Web sites into your own applications quickly and easily. Web 2.0 companies use “data mining” to extract as much meaning as they can from XHTML-encoded pages. XHTML-encoded content does not explicitly convey meaning, but XML-encoded content does. So if we can encode in XML (and derivative technologies) much or all of the content on the Web, we’ll take a great leap forward towards realizing the Semantic Web. Many people consider the Semantic Web to be the next generation in Web development, one that helps to realize the full potential of the Web – the “Web of meaning”. Though Web 2.0 applications are finding meaning in the content, the Semantic Web (heavily depended on XML and XML-based technologies) will attempt to make those meaning clear to computers as well as humans [5].
UbiCC Journal, Volume 3, January 2008
22
These trends in the Web Science – the new science of decentralized information systems – provide for new opportunities in the KM. In this paper we consider contemporary Decision Support Systems (DSS) as knowledge repositories that can be expanded to KMS using the Web 2.0 software development technologies and tools. This paper is based on a series of previous authors’ publications [6-11]. 2 KNOWLEDGE MANAGEMENT DECISION SUPPORT SYSTEMS AND
The AI representation principle states that once a problem is described using an appropriate representation, the problem is almost solved. Wellknown knowledge representation techniques include rule-based systems, semantic nets and frame systems [12]. KM refers to the set of business processes developed in an organization to create, store, transfer and apply knowledge. KM increases the ability of the organization to learn from its environment and to incorporate knowledge into business processes. There are three major categories of KMS: enterprise-wide KMS, knowledge work systems (KWS), and intelligent techniques [2, 3]. Enterprise-wide KMS are general purpose, integrated, firm-wide efforts to collect, store, disseminate, and use digital content and knowledge. Such systems provide databases and tools for organizing and storing structured and unstructured documents and other knowledge objects, directories and tools for locating employees with experience in a particular area, and increasingly, Web-based tools for collaboration and communication. KWS (such as computer-aided design, visualization, and virtual reality systems) are specialized systems built for engineers, scientists, and other knowledge workers charged with discovering and creating new knowledge for a company. Diverse group of intelligent techniques (such as data mining, neural networks, expert systems, casebased reasoning, fuzzy logic, genetic algorithms, and intelligent agents) have different objectives, from a focus on discovering knowledge (data mining and neural networks), to distilling knowledge in the form of rules for a computer program (expert systems and fuzzy logic), to discovering optimal solutions for problems (genetic algorithms). It is said that effective KM is 80% managerial and organizational, and 20% technology. One of the first challenges that firms face when building knowledge repositories of any kind is the problem of identifying the correct categories to use when classifying documents. Firms are increasingly using a
combination of internally developed taxonomies and search engine techniques. Organizations acquire knowledge in a number of ways, depending on the type of knowledge they seek. Once the corresponding documents, patters, and expert rules are discovered they must be stored so they can be retrieved and used. Knowledge storage generally involves databases, document management systems, expert systems, etc. To provide a return on investment, knowledge should become a systematic part of the organizational problem solving process. Ultimately, new knowledge should be built into a firm’s business processes and key application systems. KMS and related knowledge repositories should facilitate the problem solving process (Figure 1). During the process of solving problems managers engage into decision making, the act of selecting from alternative problem solutions. The different levels in an organization (strategic, management, and operational) have different decision-making requirements. Decisions can be structured, semi-structured or unstructured. The structured decisions are clustered at the operational level of the organization, and unstructured decisions at the strategic level. Management information systems (MIS) provide information on firm performance to help managers monitor and control the business, often in the form of fixed regularly scheduled reports based on data summarized from the firm’s transaction processing systems (TPS). MIS support structured decisions and some semi-structured decisions. DSS combine data, sophisticated analytical models and tools, and user-friendly software into a single powerful system that can support semistructured and unstructured decision making [3, 13, 14]. The main components of the DSS are the DSS database, the user interface, and the DSS software system (Figure 2). The DSS database is a collection of current data from a number of applications and groups. Alternatively, the DSS database may be a data warehouse that integrates the enterprise data sources and maintains historical data. The DSS user interface permits easy interactions between users of the system and the DSS software tools. Many DSS today have Web interfaces to take advantages of graphics displays, interactivity, and ease of use. The DSS software system contains the software tools that are used for data analysis. It may contain various OLAP tools, data mining tools, or a collection of mathematical and analytical models that easily can be made accessible to the DSS users.
UbiCC Journal, Volume 3, January 2008
23
Problem Alternative solutions (DSS) Constraints
Standards (Desired state) Information (Current state)
Problem solver (Manager)
Solution Figure 1: Elements of the problem solving process. The dialog manager is also in charge for the information visualization. Finally, access to the Internet, networks, and other computer-based systems permits the DSS to tie into other powerful systems, including the TPS or function-specific subsystems. There are many kinds of DSS. The first generic type of DSS is a Data-Driven DSS. These systems include file drawer and management reporting systems, data warehousing and analysis systems, Executive Information Systems and Spatial DSS. Data-Driven DSS emphasize access to and manipulation of large databases of structured data and especially a time-series of internal company data and sometimes external data. Relational databases accessed by query and retrieval tools provide an elementary level of functionality. Data warehouse systems that allow the manipulation of data by computerized tools tailored to a specific task and setting or by more general tools and operations provided additional functionality. Data-Driven DSS with Online Analytical Processing (OLAP) provide the highest level of functionality and decision support that is linked to analysis of large collections of historical data.
Internal Data
DSS Database/ Data Warehouse
External Data
DSS Software System Models OLAP Tools Data Mining Tools
User Interface (Dialog Manager)
Users Figure 2: Main components of the DSS.
UbiCC Journal, Volume 3, January 2008
24
A second category, Model-Driven DSS, includes systems that use accounting and financial models, representational models, and optimization models, and optimization models. Model-Driven DSS emphasize access to and manipulation of a model. Simple statistical and analytical tools provide an elementary level of functionality. Some OLAP systems that allow complex analysis of data may be classified as hybrid DSS providing modeling, data retrieval, and data summarization functionality. Model-Driven DSS use data and parameters provided by decision-makers to aid them in analyzing a situation, but they are not usually data intensive. Very large databases are usually not needed for Model-driven DSS. Knowledge-Driven DSS or Expert Systems can suggest or recommend actions to managers. These DSS are human-computer systems with specialized problem-solving expertise. The expertise consists of knowledge about a particular domain, understanding of problems within that domain, and skills at solving some of these problems (AI algorithms and solutions can be used). A related concept is data mining. It refers to a class of analytical applications that search for hidden patterns in a database. Data mining is the process of sifting through large amounts of data to produce data content relationships. Tools used for building Knowledge-Driven DSS are sometimes called Intelligent Decision Support methods.
Document-Driven DSS are evolving to help mangers retrieve and manage unstructured documents and Web pages. A Document-Driven DSS integrates a variety of storage and processing technologies to provide complete document retrieval and analysis. WWW provides access to large document databases including databases of hypertext documents, images, sounds and video. Examples of documents that would be accessed by Document-Driven DSS are policies and procedures, product specifications, catalogs, and corporate historical documents, including minutes of meetings, corporate records, and important correspondence. Search engines are powerful decision-aiding tools associated with DocumentDriven DSS. Group DSS (GDSS) came first, but now a broader category of Communications-Driven DSS or groupware can be identified. These DSS includes communication, collaboration and related decision support technologies. These are hybrid DSS that emphasize both the use of communications and decision models to facilitate the solution of problems by decision-makers working together as a group. Groupware supports electronic communication, scheduling, document sharing, and other group productivity and decision support enhancing activities. A DSS model that incorporates Group Decision Support, OLAP, and AI is shown on Figure 3.
Relational Database
Knowledge Database
Multidimensional Database
Relational DBMS
Inference Engine
Multidimensional DBMS
Report Writing Software
Mathematical Models
Groupware
Periodic and special reports
Outputs from mathematical models
Outputs from groupware
Solutions and explanations
Outputs from OLAP
Figure 3: A DSS model that incorporates GDS, OLAP, and AI.
UbiCC Journal, Volume 3, January 2008
25
DSS facilitate the decision-making. Decision making is an integrated part of the overall problem solving process. KMS should facilitate the problem solving process. In the next section we are going to discuss how Web-enabled DSS can be integrated into contemporary KMS. 3 WEB-ENABLED SYSTEMS DECISION SUPPORT
All types of DSS can be deployed using Web technologies and can become Web-based DSS. Managers increasingly have Web access to data warehouses and analytical tools. To discuss the recent trends in this area the latest achievements in the three-layer design, Rich Internet Applications (RIA), and Web services should be taken into account. Three-layer design is an effective approach to development robust and easy maintainable systems. The corresponding architecture is appropriate for systems that need to support multiple user interfaces. Contemporary Web applications are three-layer applications. The most common set of layers includes the following: Data layer that manages stored data, usually in one or more databases. Internal Data
Business logic (domain) layer that implements the rules and procedures of the business processing. View layer that accepts input and formats and displays processing results. RIA have two key attributes – performance and rich GUI. RIA performance comes from Ajax (Asynchronous JavaScript and XML), which uses client-side scripting to make Web applications more responsive by separating client-side user interaction and server communication, and running them in parallel. Various ways to develop Ajax applications are discussed in [5]. Web services promote software portability and reusability in applications that operate over the Internet. Web service is a transition to serviceoriented, component-based, distributed applications. Web services are applications implemented as Webbased components with well-defined interfaces, which offer certain functionality to clients via the Internet. Once deployed, Web services can be discovered, used/reused by consumers (clients, other services or applications) as building blocks via open industry-standard protocols. Web service architecture is built on open standards and vendor-neutral specifications. Services can be implemented in any programming language, deployed and then executed on any operating system or software platform. External Data
DSS Database/ Data Warehouse
Web Services provide access to DSS Software System
Ajax-Enabled Applications implement Dialog Manager
Internet Users Figure 4: Web-enabled DSS. The service-oriented architecture (SOA) provides the theoretical model for all Web services. The model behind Web services is a loosely coupled architecture, consisting of different software components working together. Consuming Web services is based on open standards managed by broad consortia (e.g., World
UbiCC Journal, Volume 3, January 2008
26
Wide Web Consortium, Organization for the Advancement of Structured Information Standards, Web Services Interoperability Organization). What makes Web services different from ordinary Web sites is the type of interaction that they can provide. Most of the enthusiasm surrounding Web services is based on the promise of interoperability. Every software application in the world can potentially talk to every other software application. This communication can take place across the old boundaries of location, operating system, language, protocol, and so on. Three-layer architecture maps well on the structure of main components of the DSS (see Figure
2). RIA provide for efficient implementation of the Dialog Manager GUI for DSS. Web services allow incorporating functionality from existing applications and due to this providing for access to the DSS Software System through the SOA. The components of the Web-enabled DSS are shown on Figure 4. We can call a group of the following related components a knowledge object (Figure 5). Discussed techniques allow to create new Web services (based on the existing ones and contemporary DSS software systems), and Ajax-enabled application interacting with these Web services. So we can talk about creation and modification of the knowledge objects.
DSS Database/ Data Warehouse
Web Service
Ajax-Enabled Application Figure 5: Structure of a knowledge object. Web-enabled DSS provide for expandable collections of the knowledge objects that constitute the knowledge repository of the corresponding KMS. From this point of view the knowledge objects can be considered as a knowledge representation technique. 4 PROBLEM SOLVING AND LEARNING built up over the years. This organizational knowledge can be captured and stored using case-based reasoning (CBR). In CBR description of the past experiences of human specialists, represented as cases, are stored in a database for the later retrieval when the user encounters a new case with similar parameters. The system searches for stored cases with problem characteristic similar to the new one, finds the closest fit, and applies the solution of the old case to the new case. Successful solutions are tagged to the new case and both are stored together with the other cases in the knowledge base. Unsuccessful solutions are also appended to the case database along with explanations as why the solutions did not work. Problem-based learning (PBL) is (along with active learning and cooperative/collaborative learning) one of the most important developments in contemporary higher education. PBL is based on the assumption that human beings evolved as individuals who are motivated to solve problems, and that problem solvers will seek and learn whatever knowledge is needed for successful problem solving. PBL is a typical example of an
AI distinguishes two general kinds of learning. The first kind is based on coupling new information to previously acquired knowledge. Typical examples include learning by analyzing differences, by managing multiple models, by explaining experience, and by correcting mistakes. The second kind is based on digging useful regularity out of data; a practice often refers as data mining. Typical examples include learning by recording cases, by building identification trees, by training neural nets, by training perceptrons, by training approximation nets, and by simulation evolution (e.g. genetic algorithms). Expert systems primarily capture the tacit knowledge of individual experts, but organizations also have collective knowledge and expertise that they have
UbiCC Journal, Volume 3, January 2008
27
application of the first type of learning in higher education [11].
Combining the main ideas of CBR and PBL the following problem solving and learning process can be depicted as it’s shown on Figure 6.
User describes the problem
User learns about the knowledge objects that facilitate the problem solving
System searches Repository of knowledge objects for the suitable ones
Repository of knowledge objects (based on a Web-enabled DSS)
System asks user additional questions to narrow search
System finds the closest fit and provides access to knowledge objects
System stores the problem description and the knowledge object in the repository
New knowledge object is created to better fit the problem Figure 6: Problem solving and learning with knowledge objects. 5 CONCLUSIONS 6 REFERENCES Supyuenyong, N. Islam: Knowledge Management Architecture: Building Blocks and Their Relationships, Technology Management for the Global Future, Vol. 3, pp. 1210-1219 (2006). K.C. Laudon, J.P. Laudon: Management Information Systems. Managing the Digital Farm, Prentice Hall, pp. 428-508 (2006). R. McLeod, G. Schell: Management Information Systems, 10th Edition, Prentice Hall, pp. 250-274 (2006). P.J. Deitel, H.M. Deitel: Internet and World Wide Web. How to Program, 4th Edition, Prentice Hall, pp. 50-117 (2008). T. Berners-Lee, et al: A Framework for Web Science, Foundations and Trends in Web Science, Vol. 1, No 1, pp. 1-130 (2006).
Knowledge is a complex phenomenon, and there are many aspects to the process of managing knowledge. Knowledge-based core competencies of firms are key organizational assets. Knowing how to do things effectively and efficiently in ways that other organizations cannot duplicate is a primary source of profit and competitive advantage that cannot be purchased easily by competitors in the marketplace. This paper discusses Web-enabled DSS, related knowledge repositories, and KMS that facilitate the problem solving and learning. The knowledge objects approach to the knowledge representation allows considering contemporary DSS as integrated parts of the corresponding KMS.
[1] V.
[2] [3] [4] [5]
UbiCC Journal, Volume 3, January 2008
28
[6] Y. Boreisha, O. Myronovych: Web-Based Decision
Support Systems in Knowledge Management and Education, Proceedings of the 2007 International Conference on Information and Knowledge Engineering, IKE’07, June 25-28, Las Vegas, USA, pp. 11-17 (2007). [7] Y. Boreisha, O. Myronovych: Web Services-Based Virtual Data Warehouse as an Integration and ETL Tool, Proceedings of the 2005 International Symposium on Web Services and Applications, ISWS’05, June 27-30, Las Vegas, USA, pp. 52-58 (2005). [8] Y. Boreisha, O. Myronovych: Data-Driven Web Sites, WSEAS Transactions on Computers, Vol. 2, No 1, pp. 79-83 (2003). [9] Y. Boreisha: Database Integration Over the Web, Proceedings of the International Conference on Internet Computing, IC’02, June 24-27, Las Vegas, USA, pp. 1088-1093 (2002). [10] Y. Boreisha: Internet-Based Data Warehousing, Proceedings of SPIE Internet-Based Enterprise Integration and Management, Vol. 4566, pp. 102-108 (2001).
[11]
Y. Boreisha, O. Myronovych: Knowledge Navigation and Evolutionary Prototyping in ELearning Systems, Proceedings of the E-Learn 2005 World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, October 24-28, Vancouver, Canada, pp. 552-559 (2005). [12] P.H. Winston: Artificial Intelligence, AddisonWesley, pp. 15-228 (1992). [13] S. French, M. Turoff: Decision Support Systems, Communications of the ACM, Vol. 50, No 3, pp. 39-40 (2007). [14] Chien-Chih Yu: A Web-Based ConsumerOriented Intelligent Decision Support System for Personalized E-Services, ACM International Conference Proceeding Series, Vol. 60, pp. 429437 (2004).
UbiCC Journal, Volume 3, January 2008
29
BRINGING INFORMATION RETRIEVAL BACK TO DATABASE MANAGEMENT SYSTEMS
Khaled Nagi Dept. of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt. khaled.nagi@eng.alex.edu.eg
ABSTRACT Information retrieval emerged as independent research area from traditional database management system more than a decade ago. This was driven by the increasing functional requirements that modern full text search engines have to meet. Current database management systems (DBMS) are not capable of supporting such flexibility. However, with the increase of data to be indexed and retrieved and the increasing heavy workloads, modern search engines suffer from scalability, reliability, distribution and performance problems. The DBMS have a long tradition in coping with these challenges. Instead of reinventing the wheel, we propose using current DBMS as backend to existing full text search engines. This way, we bring back both worlds together. We present a new and simple way for integration and compare the performance of our system to the current implementations based on storing the full text index directly on the file system. Keywords: Full text search engines, DBMS, Lucene, performance evaluation, scalability.
1
INTRODUCTION
Most commercial database management systems offer basic phonetic full text search functionality. For example, Oracle has a module called Oracle Text [1]. Yet, seeking to add more functionality and intelligence to their search capabilities, many commercial applications use third party specialized full text search engines instead. There are several commercial products on the market. But certainly Lucene [2] is the most popular opensource product at the moment. It provides searching capabilities for the Eclipse IDE [3], the Encyclopedia Britannica CD-ROM/DVD, FedEx, New Scientist magazine, Epiphany, MIT’s Open-Courseware [4] and so on. All search engines build an index of the data to be retrieved in user queries. The index is always stored in the file system on disk and can be loaded at startup in the memory (optional in Lucene) for faster querying. However, this is not feasible for large indices due to memory size limitations. So, the standard storage usually remains the file system of the disk. However, with the increase of data to be indexed and retrieved under heavy workloads of user queries, search engines suffer from scalability problems both in providing adequate response times for their users and keeping good overall system throughput. To cope with these problems, search engines should provide more intelligent techniques for accessing the disk. Reliability becomes also a
problem. The possibility of corrupting the whole index during a system crash is much higher than loosing the data in a database after a similar crash. Restoring a defected index might also take several hours thus complicating the situation even further. The search engine must manage its read and write locks by itself as well. Distributing the index among several sites and providing efficient mirroring techniques is becoming an important issue to large scale search engine projects such as Nutch [5]. The database management systems have a long tradition in coping with these challenges. Instead of reinventing the wheel, we try to bring both world together again in a new way. We propose using current DBMS as backend to existing full text search engines as opposed to either reimplementing full text search engine functionality into DBMS or re-implementing core DBMS features into search engines. As a case study, we use the open-source Lucene and MySQL without loss of generality. We use real world data extracted from an electronic marketplace and simulate real world workload traces in order to demonstrate that the overall system throughput and query response time do not suffer with the introduction of DBMS as a backend with their inherent overhead. In some cases, some performance indices are also improved which paves the way to using the whole spectrum of basic infrastructural facilities offered by DBMS such as recovery, automatic replication, distribution, and segmentation. The rest of the paper is organized as follows.
UbiCC Journal, Volume 3, January 2008
30
Section 2 provides a background on full text search engines. Our proposed system integration is presented in Section 3. Section 4 contains the results of our performance evaluation and Section 5 concludes the paper. 2 BACKGROUND ON SEARCH ENGINES FULL TEXT
similarity searches. The search process begins with parsing the user query. The tokens and the Boolean operators are extracted. The tokens have to be analyzed by the same analyzer used for indexing. Then, the index is traversed for possible matches in order to return an ordered collection of hits. The fuzzy query processor is responsible for defining the match criteria during the traversal and the score of the hit.
2.1
Typical Features Full text search engines do not care about the source of the data or its format as long as it is converted to plain text. Text is logically grouped into a set of documents. The user application constructs the user query which is submitted to the search engine. The result of the query execution is a list of document IDs which satisfy the predicate described in the query. The results are usually sorted according to an internal scoring mechanism using fuzzy query processing techniques [6]. The score is an indication of the relevance of the document which can be affected by many factors. The phonetic difference between the search term and the hit is one of the most important factors. Some fields are boosted so that hits within these fields are more relevant to the search result as hits in other fields. Also, the distance between query terms found in a document can play a role in determining its relevance. E.g., searching for “John Smith”, a document containing “John Smith” has a higher score than a document containing “John” at its beginning and “Smith” at its end. Furthermore, search terms can be easily augmented by searches with synonyms. E.g., searching for “car” retrieves documents with the term “vehicle” or “automobile” as well. This opens the door for ontological searches and other semantically richer similarity searches. 2.2 Architecture As illustrated in Fig. 1, at the heart of a search engine resides an index. An index is highly efficient cross-reference lookup data structure. In most search engines, a variation of the well-known inverted index structure is used [7]. An inverted index is an inside-out arrangement of documents such that terms take center stage. Each term refers to a set of documents. Usually, a B+-tree is used to speed up traversing the index structure. The indexing process begins with collecting the available set of documents by the data gatherer. The parser converts them to a stream of plain text. For each document format, a parser has to be implemented. In the analysis phase, the stream of data is tokenized according to predefined delimiters and a number of operations are performed on the tokens. For example, the tokens could be lowercased before indexing. It is also desirable to remove all stop words. Additionally, it is common to reduce them to their roots to enable phonetic and grammatical
Figure 1: Architecture of a full text search engine 2.3 Typical Operations
2.3.1 Complete index creation This operation occurs usually once. The whole set of documents is parsed and analyzed in order to create the index from scratch. This operation can take several hours to complete. 2.3.2 Full text search This operation includes processing the query and returning page hits as a list of document IDs sorted according to their relevance. 2.3.3 Index update This operation is also called incremental indexing. It is not supported by all search engines. Typically, a worker thread of the application monitors the actual inventory of documents. In case of document insertion, update, or deletion, the index is changed on the spot and its content is immediately made searchable. Lucene supports this operation. 3 3.1 PROPOSED SYSTEM INTEGRATION
Architecture Lucene divides its index into several segments. The data in each segment is spread across several files. Each index file carries a certain type of information. The exact number of files that constitute a Lucene index and the exact number of segments vary from one index to another and depend on the
UbiCC Journal, Volume 3, January 2008
31
number of fields the index contains. The internal structure of the index file is public and is platform independent [8]. This ensures its portability. We take the index file as our basic building block and store it in the MySQL database as illustrated in Fig. 2. The set of files, i.e. the logical directory, is mapped to one database relation. Due to the huge variation in file sizes, we divide each file into multiple chunks of fixed length. Each chunk is stored in a separate tuple in the relation. This leads to better performance than storing the whole file as CLOB in the database. The primary key of the tuple is the filename and the chunk id. Other normal file attributes such as its size and timestamp of last change are stored in the tuple next to the content. We provide standard random file access operations based on the above mentioned mapping. Using this simple mapping, we do not violate the public index file format and present a simple yet elegant way of choosing between the different file storage media (file system, RAM files, or database).
ence to a new instance of the OutputStream class. We provide a database specific implementation, DBDirectory, which maps these operations to SQL operations on the database. Both InputStream and OutputStream are abstract classes that mimic the functionality of their java.io counterparts. Basically, they implement the transformation of the file contents into a stream of basic data types, such as integer, long, byte, etc., according to the file standardized internal format [8]. Actual reading and writing from the file buffer remain as abstract method to decouple the classes from their physical storing mechanism. Similar to FSInputStream and RAMInputStream, we provide the database dependent implementation of the readInternal and seekInternal methods. Moreover, the DBOutputStream provides the database specific flushing of the file buffer after the different write operations. Other buffer management operations are also implemented. Both DBInputStream and DBOutputStream use the central class DBFile. A DBFile object provides access to the correct file chunk stored in a separate tuple in the database. It also provides a clever caching mechanism for keeping recently used file chunks in memory. The size of the cache is dynamically adjusted to make use of the available free memory of the system. The class is responsible for guaranteeing the coherency of the cache.
Figure 2: Integrating Lucene index in MySQL database System Design Fig. 3 illustrates the UML class diagram of the store package of Lucene. We only include the relevant classes. The newly introduced classes are grayed. Directory is an abstract class that acts as a container for the index files. Lucene comes with two implementations for file system directory (FSDirectory) and in-RAM index (RAMDirectory). It provides the declaration of all basic file operations such as listing all file names, checking the existence of a file, returning its length, changing its timestamp, etc. It is also responsible for opening files by returning an InputStream object and creating a new file by returning a refer3.2
Figure 3: UML class diagram of the store package after modification
UbiCC Journal, Volume 3, January 2008
32
4
PERFORMANCE EVALUATION
In our order to evaluate the performance of our proposed system, we build a full text search engine on the data of a neutralized version of a real electronic marketplace. The index is build over the textual description of more than one million products. Each product contains approximately 25 attributes varying from few characters to more than 1300 characters each. We develop a performance evaluation toolkit around the search engine as illustrated in Fig. 4. The workload generator composes queries of single terms, which are randomly extracted from the product description. It submits them in parallel to the application. The product update simulator mimics product changes and submits the new content to the application in order to update the Lucene index. The application consists of the modified Lucene kernel supporting both file system and database storage options of the full text index. The application under test manages two pools of worker threads. The first pool consists of searcher threads that process the search queries coming from the workload generator. The second pool consists of index updater threads that process the updated content coming from the product update simulator. The performance of the system is monitored using the performance monitor unit.
We also monitor the response time of: • the searches, and • the index updates from the moment of submitting the request till receiving the result. 4.2 System Configuration In our experiments we use a dual core Intel Pentium 3.4 GHz processor, 2 GB RAM 667 MHz and one hard disk having 7200 RPM, access time of 13.2 ms, seek time of 8.9 ms and latency of 4 ms. The operating system is Windows XP. We use JDK 1.4.2, MySQL version 5.0, JDBC mysqlconnector version 3.1.12, and Lucene version 1.4.3. Experiment Results The performance evaluation considers the main operations: complete index creation, simultaneous full text search over single terms under various workloads, and - in parallel - performing index update as product data change. The experiments are conducted for the file system index and the database index. We drop the RAM directory from our consideration, since the index under investigation is too large to fit into the 1.5 GB heap size provided by Java under Windows. 4.3.1 Complete index creation Building the complete index from scratch on the file system takes about 28 minutes. We find that the best way to create the complete index for the database is to first create a working copy on the file system and then to migrate the index from the file system to the database using a small utility that we developed to migrate the index from one storage to the other. This migration takes 3 minutes 19 seconds to complete. Thus, the overhead in this one time operation is less than 12%. 4.3.2 Full text search In this set of experiments, we vary the number of search threads from 1 to 25 concurrent worker threads and compare the system throughput, illustrated in Fig. 5, and the query response time, illustrated in Fig. 6, for both index storage techniques. We find that the performance indices are enhanced by a factor > 2. The search throughput jumps from round 1,250,000 searches per hour to almost 3,000,000 searches per hour in our proposed system. The query response time is lowered by 40% by decreasing from 0.8 second to 0.6 second in average. This is a very important result because it means that we increase the performance and take the robustness and scalability advantages of database management systems on top in our proposed system. 4.3
Figure 4: Components of the performance evaluation toolkit. Input Parameters and Performance Metrics We choose the maximum number of fetched hits to be 20 documents. This is a reasonable assumption taking into consideration that no more than 20 hits are usually displayed on a web page. The number of search threads is varied from 1 to 25 enabling the concurrent processing of 25 search queries. Due locking restrictions inherent in Lucene, we restrict our experiments to maximum one index update thread. We also introduce a think time varying from 20 to 100 milliseconds between successive index update requests to simulate the format specific parsing of the updated products. In all our experiments, we monitor the overall system throughput in terms of conducted: • searches per second, and • index updates per second. 4.1
UbiCC Journal, Volume 3, January 2008
33
remains under the absolute level of 25 seconds which is acceptable for most application taking into consideration the high update rate.
Figure 5: Search throughput in an update free environment
Figure 7: Index update throughput
Figure 6: Search response time in an update free environment 4.3.3 Index update In this set of experiments, we enable the incremental indexing option and repeat the above mentioned experiments of Section 4.3.2. for different settings of think time between successive updates. In order to highlight the effect of incremental indexing, we choose very high index update rates by varying the think time from 20 to 100 milliseconds. For readability purposes, we only plot the results of the experiments having a think time of 40 and 80 milliseconds. In real life, we do not expect this exaggerated index update frequency. Fig. 7 demonstrates that the throughput of the index update thread in our proposed system is slightly better than the file system based implementation. However, Fig. 8 shows that the response time of the index update operation in our system is worse than the original one. We attribute this to an inherent problem in Lucene. During index update, the whole index is exclusively locked by the index updater thread. This is too restrictive. In our implementation, we keep this exclusive lock although the database management system also keeps its own locking on the level of tuples which is less restrictive, which would allow for more than one index update thread and certainly more concurrent searches. The extra overhead of holding both locks lead to the increase in the system response time. The good news is that the response time always
Figure 8: Index update response time The search performance of our proposed system becomes very comparable to the original file system based implementation in an environment suffering from a high rate of index updates. Fig. 9 shows that the search throughput of the proposed system is slightly better than the file system based implementation; whereas Fig. 10 shows that our database index suffer from a slightly higher response time than the original system. Again, the effect of the exclusive lock over the whole index during index update is remarkable by comparing the performance indices of Fig. 5 and Fig. 6 to those of Fig. 9 and Fig. 10, respectively. The search throughput drops from 3,000,000 to round 1,100,000 searches per hour and the response time increases from 0.6 seconds to round 3 seconds.
UbiCC Journal, Volume 3, January 2008
34
evaluation toolkit to work on several sites of a distributed database. REFERENCES [1] Oracle Text. An Oracle Technical White Paper, http://www.oracle.com/technology/products/text /pdf/10gR2text_twp_f.pdf. (2005). [2] Apache Lucene, http://lucene.apache.org/java/docs/index.html. [3] B. Hermann, C. Müller, T. Schäfer, and M. Mezini: Search Browser: An efficient index based search feature for the Eclipse IDE, Eclipse Technology eXchange workshop (eTX) at ECOOP (2006). [4] MIT OpenCourseWare, MIT Reports to the President (2003–2004). [5] Nutch home page, http://lucene.apache.org/nutch/ [6] D. Cutting, J. Pedersen: Space Optimizations for Total Ranking, Proceedings of RIAO (1997). [7] D. Cutting, J. Pedersen: Optimizations for Dynamic Inverted Index Maintenance, Proceedings of SIGIR (1990). [8] Apache Lucene - Index File Formats, http://lucene.apache.org/java/docs/fileformats.ht ml.
Figure 9: Search throughput in an environment with high update rate.
Figure 10. Search response time in an environment with high update rate. 5 CONCLUSION AND FUTURE WORK
In this paper, we attempt to bring information retrieval back to database management systems. We propose using commercial DBMS as backend to existing full text search engines. Achieving this, today’s search engines directly gain more robustness, scalability, distribution and replication features provided by DBMS. In our case study, we provide a simple system integration of Lucene and MySQL without loss of generality. We build a performance evaluation toolkit and conduct several experiments on real data of an electronic marketplace. The results show that we reach comparable system throughout and response times of typical full text search engine operations to the current implementation, which stores the index directly in the file system on the disk. In several cases, we even reach much better results which mean that we take the robustness and scalability of DBMS on top. Yet, this is only the beginning. We plan on mapping the whole internal index structure into database logical schema instead of just taking the file chunk as the smallest building block. This will solve the restrictive locking problem inherent in Lucene and will definitely boost overall performance. We also plan on extending our performance
UbiCC Journal, Volume 3, January 2008
35
A SET-THEORETIC DATA MODEL FOR EVOLVING DATABASE ENVIRONMENTS
E. J. Yannakoudakis, P. K. Andrikopoulos Athens Univ. of Economics and Business, Department of Informatics Athens 10434, Greece {eyan, padrik}@aueb.gr
ABSTRACT The paper presents an integrated set-theoretic data model that offers a framework for defining a unified schema for any database environment. We utilise the concepts ”entity” in its classical meaning, ”tag” as a set of properties (attributes) which can describe an entity, ”subtag” as a set of simple atomic attributes which cannot be decomposed further, ”domain” as a set of well-defined values that can be derived from pre-specified data types, ”language”, ”vocabulary” and ”message” as strings or coded values that represent human languages and corresponding messages. The model described can manage efficiently changes that occur at the logical level and supports operations and functions that offer solutions to well-known problems faced by database designers and programmers alike. Typical problems solved include the retention of multiple schema versions, the maintenance of authority files, the support of repeatable attributes, the processing of multilingual databases (at both data and interface levels). Keywords: data models, databases, schema evolution
1
INTRODUCTION
The evolution of database schemata is becoming a very serious problem, especially with the advent of large distributed databases. Continuous modifications of a database schema are necessary due to (a) the fact that the applications are continuously changing, (b) the perspective experience from using a system that induces changes to desired functionality and (c) the scale of many tasks that require incremental design. Sjoberg addressed the problem in practice by using a method for quantifying schema changes and proposed a tool that automates modification procedures [1]. Nowadays, the problems of schema evolution and schema versioning still form common issues of research in the context of database applications which are destined to have a long lifetime. According to a widely accepted terminology, schema evolution is accommodated when a database system facilitates the modification of the database schema without loss of existing data, whereas schema versioning is accommodated when a database system allows the accessing of all data, both retrospectively and prospectively, through user definable version interfaces [2]. Schema evolution implies that extant data will be converted from the old to the new
schema. Thus, all valid instances of the old database schema will become valid instances of the new database schema and new versions of the software can access old data [3]. Despite supporting backward compatibility schema evolution fails to form an effective technique because data conversion is feasible when the scale is small and when the changes are simple. As the scale of the application increases the problem becomes more complicated and confusing. In schema versioning the old schema and its corresponding data are preserved, but a new version of the schema is created, which incorporates all the desired changes [4]. The different versions of the schema can be identified and selected by a suitable labelling system, such as symbolic naming, or time stamping (transaction time of schema changes). Despite attaining continued support of legacy applications, the accommodation of schema versioning presents various open problems related to the update of data through historical schemata [3]. Organizing all schema versions that arise from previous evolutions in order to cope with how applications can access the data in a versioned schema environment, besides the considerable time and effort required for implementation [5], gives rise to high computational costs. In relational database systems the main problem
UbiCC Journal, Volume 3, January 2008
36
addressed is that of change propagation. Change propagation is commonly accomplished by populating the new schema version with the results of queries involving extant data connected to previous schema versions [6]. This procedure encounters significant performance problems on transforming existing data to meet the evolved specifications and requires much memory space as the database evolves. In the object-oriented field, where schema versioning is usually regarded as the versioning of classes that constitute a schema [7] [8], the problems are even more serious. Although there are some techniques to handle the derivation of classes and the navigation through versions [9], there are still open issues at the point of polymorphism and inheritance. For example, the ability to distinguish between two kinds of inheritance (inheritance between classes and inheritance between versions) is still not fully explored [10]. Experience shows that supporting schema evolution by creating new schema versions and using data transformation mechanisms is a hard task. A more sophisticated solution to this problem could be the implementation of a unified schema with the ability to handle frequently changes that occur with evolving database environments without the need for corresponding modifications at either logical or physical levels. In this direction, Yannakoudakis proposed a Framework for a DataBase (FDB) for defining a unified schema with the ability for self-propagation of the stored data [11][12]. In FDB, the main emphasis of the research has been the establishment of a structure, offering dynamically evolving database environments which encapsulate frequently encountered changes, by utilizing metadata. A similar method has been used by Grandi [13] for mapping attributes and tables by adopting a multi-pool schema versioning approach. In this paper we propose an integrated settheoretic data model that offers a framework for defining a structure (unified schema) that eliminates completely the need for reorganization at the logical level. The basis for the creation of the unified schema is the definition and manipulation of metadata that compose the whole structure. The scope of this paper is to describe explicitly the features of the unified schema, define metadata and the operations on the metadata. The remainder of this paper is organized in four sections. Section 2 describes the structure of the proposed model by defining the unified schema. Section 3 exhibits some of the basic operations for manipulating metadata that compose the unified schema, Section 4 illustrates the model with an example, and Section 5 concludes the paper. 2 DEFINITIONS
sets. A set is a collection of objects; order is not taken into account [14]. The elements making up a set are assumed to be distinct. Given a description of a set X and an element x, we can determine whether or not x belongs to X. If x is in the set X, we write x ∈ X, and if x is not in X, we write x ∈ X. / Every set is a subset of the set U, called a universal set or universe, which must be explicitly given or inferred from the context. In the definitions that follow we attempt to define every universal set, so that it can be used as a frame of reference to every set of our model. Languages, Vocabulary Items and Messages Let us define UL as the universal set of all spoken languages supported by the Unicode standard and UV the universal set of all words and phrases supported by any Unicode writing system. Certainly there is a meaning behind every element of the UV set. That helps us define UM as the universal set of the semantics of the UV set. This consideration is the first step towards the implementation of multilingual operations, while adopting the following definitions: • Set of Languages L ⊆ UL is the unordered set of all registered languages. That is, all the languages which are currently supported. • Set of Vocabulary Items V ⊆ UV is the unordered set of registered words and phrases, which can be used for naming data elements in a specific language l ∈ L. • Set of Messages M ⊆ UM is the unordered set of the semantics of the V element and can be used to assign conceptual meanings to data elements. That is, to name data elements in a language of the universe of discourse. According to the above definitions a language l is currently supported if l ∈ L or is not supported if l ∈ {x ∈ UL |¬(x ∈ L)}. Similarly, a message m ∈ M is a registered message that can be used and v ∈ V is a registered vocabulary item that can also be used. Matching vocabulary items with a message is performed by the Smsg function. Functions are used extensively in discrete mathematics to assign to each member of a set X exactly one member of a set Y . The binary function Smsg from DSmsg to V is a relation that each element of DSmsg ⊆ M × L is associated with a unique element of V . The Smsg function can be used to translate any message into any supported language. Domains We define UD as the universal set of every data domain that can be defined for storing information. Based on the above: • Set of domains D ⊆ UD is the unordered set of registered domains, which determine all possible values of every data element in our model. 2.2 2.1
The definition of the basic elements of our model is based on the mathematical theory of (unordered)
UbiCC Journal, Volume 3, January 2008
37
If a domain d is defined, then d ∈ D holds and d can be assigned to a data element. A domain d ∈ D specifies a primitive datatype value and a length value. These properties are defined by the functions Dtype and Dlen. The function Dtype from DDtype = D to Q = {int, char, real, bool, blob} is a relation whereby each element of D is associated with a unique element of Q. The function Dlen from DDlen = D to N is a relation whereby each element of D is associated with a natural number which determines the data length in bytes. Entities We define UE as the universal set of distinct and autonomous objects, forming what we generally call ”entity”. An entity need not have a material existence. In particular, abstractions and legal functions are usually regarded as entities. Based on this we define: • Set of Entities E ⊆ UE is the unordered set of registered entities that participate in the logical schema of our model. For every entity e that has been created and is current in the logical schema the expression e ∈ E holds. Also, every entity in the logical schema has a single unique name. The function Enam from DEnam =E to M is a relation whereby each element of E is associated with a unique element of M, naming in effect each entity. Note that Enam is an one-to-one function so it has an inverse function Enam−1 . Tags We define UT as the universal set of all particular properties (attributes) that can feature any entity of the model. Based on this: • Set of Tags T ⊆ UT is the unordered set of properties that exist in our logical schema and feature its entities. Tags can be simple or complex. For every tag t that has been created and currently exists in the logical schema the expression t ∈ T holds. Every tag resides in (at most) one entity. Entitytag associations are performed by the Tent function. This function from DTent =T to E is a relation whereby each element of T is associated with a unique element of E. Every tag has a name assigned through the Tnam function. The function Tnam from DT nam =T to M is a relation whereby each element of T is associated with a unique element of M. Note that the Tnam function is not one-to-one, because two different tags may have the same name. However, a tag’s name has to be different from any other tag’s name in the same entity. Every simple tag t ∈ T holds data of the same kind, which are singularly associated with every instance of the entity e = Tent(t) where tag t resides. Although data storage and manipulation are not dissertated in this model description that emphasizes on 2.4 2.3
introducing the logical structure by defining metadata, it is important to note that tags can only receive values from a specific domain. The function Tdom from DT dom ⊆ T to D is a relation whereby each element of T is associated with a unique element of D. A complex tag has no domain, because its domain is specified by the domains of its subtags. Every simple tag has an occurrence status. It can either be optional or mandatory. A mandatory tag must always contain a value that is not null. This feature of a tag is defined by the Tocc function. The function Tocc from DTocc ⊆ T to O = {0, 1|0 : Optional, 1 : Mandatory} is a relation whereby each element of T is associated with a unique element of O. A complex tag has no occurrence status, as it has no domain. Every tag has a repetition status which can be either single-valued or multi-valued. A single-valued tag must always contain a singe value for a given sample of an entity e. However, multi-valued tags can receive multiple values for the same sample of an entity. The repetition status of a tag is defined by the Trep function. The function Trep from DTrep = T to R = {0, 1|0 : Single − valued, 1 : Multi − valued} is a relation whereby each element of T is associated with a unique element of R. Tags from different entities may reference one another, provided they receive values from the same domain. Every tag has an authority status that determines its participation in a relation. The authority status of a tag is defined by the Taust function. The function Taust from DTaust = T to A = {0, 1, 2|0 : No− participation, 1 : Authority−Tag, 2 : Selected −Tag} is a relation whereby each element of T is associated with a unique element of A. References between tags are performed by the Auth function from DAuth ⊂ T ¯ to T = {t|Taust(t) = 1,t ∈ T }. For example, let us assume that tag tA1 of a given entity A references tag tB1 of entity B and tags tB2 , tB3 from B are selected for presentation. That is, Auth(tA1 ) = tB1 where Taust(tB1 ) = 1 and Taust(tB2 ) = Taust(tB3 ) = 2 Subtags We define US as the universal set of all possible simple (atomic) attributes that can constitute any complex tag. • Set of Subtags S ⊆ US is the unordered set of registered attributes that exist in our logical schema, which constitute existing complex tags. For every subtag s that has been created and currently resides in the logical schema the expression s ∈ S holds. Every subtag is part of one and only tag. This association is performed by the Stag Function. The function Stag from DStag =S to T is a relation whereby each element of S is associated with a unique element of T . Every subtag necessarily has a name. The naming of subtags is performed by the Snam function. 2.5
UbiCC Journal, Volume 3, January 2008
38
The function Snam from DSnam =S to M is a relation whereby each element of S is associated with a unique element of M. The Snam function is not one-to-one, because two different subtags may have the same name. However, a subtag’s name has to be different from any other subtag’s name in the same tag. Subtags can only receive values from a specific domain, similarly to tags. The function Sdom from DSdom = S to D is a relation whereby each element of S is associated with a unique element of D. Every subtag has also an occurrence status. It can either be optional or mandatory, and is defined by the Socc function. This function from DSocc = S to O = {0, 1|0 : Optional, 1 : Mandatory} is a relation whereby each element of S is associated with a unique element of O. Every subtag has also a repetition status. It can either be single-valued or multi-valued, and is defined by the Srep function. This function from DSrep = S to R = {0, 1|0 : Single − valued, 1 : Multi − valued} is a relation whereby each element of S is associated with a unique element of R. 3 OPERATIONS
The creation and management of the structure defined in the previous section is based on operations that perform metadata manipulation. These operations affect the basic sets and the functions, as well as domains and values. Generally, the types of operations are add, delete, rename, update and select. Add operation adds a new set element. Delete operation deletes an element from a set. Rename operation changes the alphanumeric code or the abbreviation used for an element notation. Update operation is used in more complex sets (e.g. Tags) for updating any of their features. Finally, select operation enable us to select a set element. In what follows we present the algorithm of some typical operations. 3.1 Languages, Vocabulary Items and Messages 3.1.1 Add language This function inserts a new element l in Languages set L. If addition succeeds the function returns the modified L; otherwise, it returns the empty set. AddLanguage(l) / x := 0 / if {l} ∩ L = 0 then L := L ∪ {l} x := L else display(”language l exists”) end if return(x) 3.1.2 Select vocabulary item This function enables the user to select an element v from Vocabulary Items set V. If a selection is made,
the function returns the element v; otherwise, it returns the empty set. SelectVocItem() / x := 0 ¯ V := V ¯ do ∀v ∈ V display(v) read(choice) if choice = ”select” then x := {v} ¯ / V := 0 end if end do return(x) 3.1.3 Update message This function updates message properties. It enables the manipulation of the vocabulary items of a message in every language. If an update is effected, the function returns the updated message m; otherwise, it returns the empty set. U pdateMessage(m) / x := 0 do ∀l ∈ L read (choice) case choice of ”edit”: / if DSmsg ∩ {(m, l)} = 0 then DSmsg := {(m, l)} ∪ DSmsg end if v := SelectVocItem() / if v = 0 then / do while z := 0 read(v) z := AddVocItem(v) end do end if Smsg(m, l) := v x := m ”delete”: DSmsg := ¬{(m, l)} ∩ DSmsg x := m end case end do 3.2 Domains 3.2.1 Select domain This function enables the user to select an element d from Domains set D. If a selection is made, the function returns the element d; otherwise, it returns the empty set. SelectDomain() / x := 0 ¯ D := L ¯ do ∀d ∈ D display(d, Dtype(d), Dlng(d)) read(choice) if choice = ”select” then x := {d}
UbiCC Journal, Volume 3, January 2008
39
¯ / D := 0 end if end do return(x) 3.2.2 Delete domain This function deletes an element d from Domains set D. If deletion succeeds, the function returns the modified D; otherwise, it returns the empty set. Deletion fails if d is assigned to a data element. DelDomain(d) / x := 0 if {d} ∩ D = {d} then / if {t|T dom(t) = d} ∪ {s|Sdom(s) = d} = 0 then D := ¬{d} ∩ D DDtype := ¬{d} ∩ DDtype DDlng := ¬{d} ∩ DDlng x := D else display(”domain d can not be deleted”) end if else display(”domain d not found”) end if return(x) 3.3 Entities Add entity
DelEntity(e) / x := 0 if {e} ∩ E = {e} then ¯ T := {y|Tent(y) = e} ¯ ¯ / if (DAuth ∩ T ) ∪ {t|Taust(t) = 0,t ∈ T } = 0 then ¯ do ∀t ∈ T DelTag(t) end do E := ¬{e} ∩ E DEnam := E x := E else display(”entity e can not be deleted”) end if else display(”entity e not found”) end if return(x) 3.4 Tags and Subtags 3.4.1 Add tag This function inserts a new tag t in Tags set T. Tag t is assigned to an entity e and is given a distinct name within the tags of e. A domain is also selected, but authority status, occurrence status and repetition status are simply initialized. If the insertion succeeds, the function returns the modified T; otherwise, it returns the empty set. AddTag(t, e) / x := 0 / if {t} ∩ T = 0 then if {e} ∩ E = {e} then m := SelectMessage() ¯ M := {m|T nam(y) = m, Tent(y) = e} ¯ / if m = 0 or m ∈ M then / do while z = 0 read(m) z := AddMessage(m) end do end if T := {t} ∪ T DTent := T Tent(t) := e DT nam := {t} ∪ DT nam T nam(t) := m d := SelectDomain() DT dom := {t} ∪ DT dom T dom(t) := d DTocc := {t} ∪ DTocc Tocc(t) := 0 DTrep := {t} ∪ DTrep Trep(t) := 0 DTaust := {t} ∪ DTaust Taust(t) := 0 x := T else display(”entity e not found”) end if
3.3.1
This function inserts a new element e in Entities set E and assigns a unique message m to it. If addition succeeds, the function returns the modified E; otherwise, it returns the empty set. AddEntity(e) / x := 0 / if {e} ∩ E = 0 then E := {e} ∪ E m := SelectMessage() / if m = 0 or m ∈ DEnam−1 then / z := 0 / do while z = 0 read(m) z := AddMessage(m) end do end if DEnam := E Enam(e) := m x := E else display(”entity e exists”) end if return(x) 3.3.2 Delete entity This function deletes an element e from Entities set E. If deletion succeeds, the function returns the modified E; otherwise it returns the empty set. Deletion fails if there are tags of e that participate in authority links.
UbiCC Journal, Volume 3, January 2008
40
else display(”tag t exists”) end if return(x) 3.4.2 Delete subTag
This function deletes an element s from Subtags set S. If deletion succeeds, the function returns the modified S; otherwise it returns the empty set. If s is the only subtag of the parent tag t, t is reformed to a simple tag. DelSubTag(s) / x := 0 / if {s} ∩ S = 0 then if s ∪ {z|Stag(z) = Stag(s)} = s then DT dom := Stag(s) ∪ DT dom d := SelectDomain() T dom(Stag(s)) := d DTocc := Stag(s) ∪ DTocc Tocc(Stag(s)) := 0 end if S := ¬{s} ∩ S DSnam := ¬{s} ∩ DSnam DStag := ¬{s} ∩ DStag DSdom := ¬{s} ∩ DSdom DSocc := ¬{s} ∩ DSocc DSrep := ¬{s} ∩ DSrep x := S else display(”subtag s not found”) end if return(x) 3.4.3 Update tag
x := t ”create authority link”: / if {t} ∩ DAuth = 0 then tx := SelectTag() / do while tx = 0 or Taust(tx ) = 1 or T dom(tx ) = T dom(t) tx := SelectTag() end do DAuth := {t} ∪ DAuth Auth(t) := tx else display(”an authority link already exists”) end if x := t ”delete authority link”: if {t} ∩ DAuth = {t} then DAuth := ¬{t} ∩ DAuth else display(”no authority link”) end if x := t . . . end case return(x) 4 A PRACTICAL EXAMPLE
This function updates tag’s properties. It allows the modification of a tag’s message or a tag’s domain and the conversion of occurrence, repetition and authority properties to a different status. It also enables the manipulation of subtags and authority links. If an update is effected, function returns the updated entity t; otherwise, it returns the empty set. The algorithm presented in this section, it is a part of the UpdateTag algorithm that illustrates the manipulation of authority links. U pdateTag(t) / x := 0 read(choice) case choice of ”change authority status”: / if {q|Tauth(q) = t} = 0 read(aust) do while aust ∈ {0, 1, 2} / read(aust) end do Taust(t) := aust else display(”authority status can not be changed”) end if
To illustrate the functionality of the proposed data model, we present an example to demonstrate a database that manages book and publisher data. The entity set E contains elements ebook and e publ (Fig. 1a ). In this example, a book is described by ISBN, title, author, publisher and year of publication. A publisher is described by a publisher ID, corporate name, address and phone. The address is a complex tag (Fig. 1b ).
Figure 1: a. Function Tent b. Function Stag
A distinct message is assigned to every entity, tag and subtag using Enam, T nam and Snam function (Fig. 2). Different tags (subtags) can share the same message as long as they do not reside in the same entity (tag). Every message describes a conceptual object that can be translated into any of the supported languages by the Smsg function. In our example, the supported
UbiCC Journal, Volume 3, January 2008
41
Figure 2: Functions Enam, T nam and Snam
Figure 5: a. Functions Tocc, Socc b. Functions Trep, Srep
languages are English and French (Fig. 3). Authority links between tags rely on their domain values and their authority status. Linked tags must be defined on the same domain. Moreover, authority links have to result to a tag with value 1 for authority status. Value 2 is assigned to tags with contents which can be revealed. This assignment is performed by Taust action (Fig. 6a ). Function Auth is responsible for creating authority links. In our example, there is an authority link between t publ and t pbid (Fig. 6b ).
Figure 3: Function Smsg
Simple tags and subtags hold data which is derived from specific domains. Matching is performed by the functions T dom and Sdom (Fig.4). The complex tag taddr has no domain assigned to it because its subtags are based on specified domains.
Figure 6: a. Function Taust b. Function Auth
5
CONCLUSIONS SEARCH
AND
FURTHER
RE-
Figure 4: Functions T dom and Sdom
Every simple tag and every subtag has an occurrence status, which is mandatory (value 1) or optional (value 0). Assignment is performed by the functions Tocc and Socc. The complex tag taddr has no occurrence status as it has no domain (Fig. 5a ). On the other hand, repetition status is required for all tags and subtags. Value 1 is given to repeating tags. In our example, the repeating tags are t phon and tauth (Fig. 5b ).
The design of data models that evolve with time is still a major problem today. While user requirements change quite frequently, databases continue to show little flexibility in supporting these changes in their structures and data organization. Research carried out in the direction of supporting schema evolution and schema versioning has proved inefficient in the long-term. In this paper we proposed an integrated set-theoretic model for database systems that forms a framework for defining a structure (unified schema) that eliminates completely the need for reorganization at the logical level. We presented its structure, its features and we demonstrated some of its operations with algorithms that can be applied at the logical level (metadata). The next step is to investigate efficient data storage, data retrieval and manipulation operations, as well as data integrity rules and performance issues.
UbiCC Journal, Volume 3, January 2008
42
6
REFERENCES
[1] D. Sjoberg: Quantifying Schema Evolution, Information and Software Technology, 35(1), pp. 35-44 (1993). [2] J. F. Roddick: A Survey of Schema Versioning Issues for Database Systems, Information and Software Technology, 37(7), pp. 383-393 (1995). [3] J. F. Roddick: A model for schema versioning in temporal database systems, Australian Computer Science Communications, 18(1), pp. 446452 (1996). [4] C. De Castro, F. Grandi , M. R. Scalas: Schema versioning for multitemporal relational databases, Information Systems, 22(5), pp. 249-290 (1997). [5] F. Grandi , F. Mandreoli , M. R. Scalas: A Generalized Modeling Framework for Schema Versioning Support, In Proc. ADC 2000, pp. 33-40 (2000). [6] E. Franconi , F. Grandi , F. Mandreoli: Schema Evolution and Versioning: a Logical and Computational Characterisation, In Proc. DEMM 2000, pp. 67-81 (2000). [7] W. Kim, H. T. Chou: Versions of Schema for Object-Oriented Databases, In Proc. of 14th VLDB Conf., pp. 148-159 (1988). [8] S. Monk, I. Sommerville: Schema Evolution in OODBs using Class Versioning, SIGMOD Rec., 22(3), pp. 16-22 (1993). [9] S. E. Lautemann: Schema Versions in ObjectOriented Database Systems, In Proc. DASFAA, pp. 323-332 (1997). [10] X. Li, Z. Tari: Class Versioning for the Schema Evolution, Australian Computer Science Communications, 20(2), pp. 117-128 (1998). [11] E. J. Yannakoudakis, C. X. Tsionos, C. A. Kapetis: A new Framework for dynamically evolving Database Environments, Journal Of Documentation, 55(2), pp. 144-158 (1999). [12] E. J. Yannakoudakis, I. K. Diamantis: Further improvements of the Framework for Dynamically Evolving Database Environments, In Proc. HERCMA 2001, pp. 213-218 (2001). [13] F. Grandi: A Relational Multi-Schema Data Model and Query Language for full Support of Schema Versioning, In Proc. SEBD 2002, pp. 323-336 (2002). [14] R. Johnsonbaugh: Discrete mathematics, 6th edition, Pearson Education, London, (2005).
UbiCC Journal, Volume 3, January 2008
43
Knowledge processing, codification and reuse model for communities of practice1
Sas Mihindu, Terrence Fernando, Farzad Khosrowshahi Research Institute for the Built and Human Environment University of Salford, M5 4WT, United Kingdom s.mihindu@pgr.salford.ac.uk
ABSTRACT Knowledge Management (KM) has been a widely received area for research and development for many decades. However combining the full Knowledge Management Cycle; knowledge capture, development, sharing and utilisation within the same infrastructure has not been successfully researched and a suitable remedy all-in-one application for virtual communities or Communities of Practice (CoP) has not been established. This article proposes models for knowledge analysis, knowledge processing, codifying and reuse, and finally elaborates on research prototypes that are being used in virtual community settings (i-FAB Community*, INTUITION Network**) for gathering knowledge management requirements of such environments. Further research works are also discussed in application of KM infrastructures and Virtual Machine Architectures within future workspaces. Keywords: Knowledge Management Infrastructure, KMLC, Knowledge Production Life Cycle, CoP, Virtual Machine Architecture
*
i-FAB Global Community i-FAB is an international collaborative activity initiated to impact on the related knowledge creation, dissemination, utilisation and industrial harmonisation of the foot and ankle biomechanics community. i-FAB has members from every community related to foot and ankle biomechanics, from academics, physicians, surgeons, and health professionals, to members of the footwear, insole, surgery and related industries. i-FAB has an open philosophy and one of its key objectives is to connect people (through ‘Collaborative Workspace’ and ‘organised activities’) across traditional disciplinary boundaries as one of its key objectives. INTUITION EU Network The INTUITION project is a Network of Excellence focused on virtual reality (VR) and virtual environments (VE) applications for future workspaces (FWS). The INTUITION Network consists of groups of communities (Communities of Practice) who are interested and actively working within the VR and FWS related areas.
**
1
INTRODUCTION
An earlier version of this paper has been published in the proceedings of the 2007 International Conference on Information and Knowledge Engineering (IKE'07: CSREA Press, ISBN 1-60132-0507, 06/2007)
1
Knowledge sharing has become one of the most valuable interfaces between Communities of Practice (CoP) and it is seen as a critical factor for the longterm existence of each community. Managing knowledge within the community and what is to be shared with other communities are governed by the community structure and the virtual culture that they have established over time. Recognition of ‘people perspective of knowledge and its management’ has made the realisation of knowledge related processes less complex than otherwise [1][2][3]. Although some argue that multifaceted nature of knowledge sharing will typically be a complex process even under the best of circumstances [4][5]. The knowledge related processes such as; knowledge creation, identification, storage, valuation, sharing, transfer, acquisition, community learning, distribution, dissemination, etc. are all very much interdependent Knowledge Management (KM) procedures that each community member or group act and react to on a daily basis. However out of the above processes one could consider knowledge creation and transfer as the primary factors which encapsulate the rest. The appropriate facilitation of knowledge storage can amplify all the other processes while protecting the individual or group ownership of the content and the depth of sharing. Knowledge exists at multiple levels within CoP. Starting from individual level then groups within
UbiCC Journal, Volume 3, January 2008
44
communities and finally extending into large communities that consist of many groups. While the individuals become the most important of all [6] the interaction between individuals within different groups (including the peer group) facilitate knowledge creation. The above authors also argued that individuals or groups who decide not to share a piece of knowledge then that particular knowledge will have a limited impact on the effectiveness of their community. On the other hand based on the realisation of value or expected value of that piece of knowledge this may not be suitable for sharing for the good reasons. This is further explained later in the knowledge processing model detailed. Three types of knowledge; know-how, know-what and dispositional have been identified at the individual level and are all important in value creation [7] within the community. The individuals who connect with the community by offering the knowledge that resides within individuals are converting the knowledge into economic and competitive value for the community [4][8]. Different individuals with diverse experiences and knowledge within CoP have the ability to innovate and create competitive advantage if the community could support for a framework for those individuals to exchange, evaluate and integrate their knowledge by working towards a common theme or facilitating them into a close working environment [9]. Such a framework is considered as a community asset which will not only manage knowledge of CoP but also harness knowledge creation and sharing as a key capability of the community. This paper explores knowledge creation, aggregation and reuse processes that support facilitation of knowledge sharing within communities of practice or virtual communities. As a primary issue this paper analyses knowledge creation process within an individual, virtual groups and the community as a whole. This has led in proposing KM related models and definitions that are applicable for CoP. A system prototype is being created and being used for further analysis of community requirements in a lively manner within large scientific community projects that consist of industrial partners, scientists and academic members. 2 RELATED WORK
development and utilisation have not been successfully addressed within many KM tools. 2.1 ICT Infrastructures To Support The Complete Cycle of KM The system developers and researchers must work together to design more innovative mechanisms and interfaces for knowledge development and utilisation of the future KM systems infrastructures. While these two steps (out of 4 basic steps) require much higher attention the maturity of current and immerging technologies has provided the confidence that the knowledge capturing and sharing can be achieved to a greater degree of success. Providing a single KM infrastructure (KMI) with intelligence interfaces to facilitate the complete KM life cycle within a virtual community will be an important goal to achieve. Several advancements have been achieved by researchers and developers. A digital library infrastructure called Collaborative Knowledge Evolution Support System (CKESS) has targeted managing knowledge of communities. CKESS knowledge support system has addressed many issues related to developing a community knowledge repository of an evolving nature [12]. Theoretical base Online Community Framework (OCF) elaborates issues of capturing community requirements as a whole so that systems designers could support these needs through the ITC systems [13]. How well the designers can support sociability and usability is the main focus of OCF. Applications such as over a decade old Arthur Andersen’s worldclass knowledge base [14] of Global Best Practices (GBP), and recent development of Oracle’s J D Edwards EnterpriseOne confirms the growing importance in designing of integrated information and KMIs and the path it has taken to the current date. There are many other numerous systems and design methodologies which have surfaced in focus of developing KMIs as this subject area ‘organisational KM’ has become a major global economic driver for many businesses. While one could consider that these designs, methods and tools established have been highly contributing to its cause successfully, many gaps still exist in the development and implementation of KMI that could support the full KMLC. Some of the interesting aspects of this nature which have been addressed by Firestone and McElroy include describing a Distributed Organisational Knowledge Base (DOKB) associated concepts in their textbook [15]. Web Ontology Language (WOL) based development platforms (E.g. Protégé-OWL), Semantic Web Rule Language (SWRL) models, etc. have also contributed for designing rule-based KM systems that focuses on business applications [16]. Technical and infrastructure challenges have made development of KMS from scratch a difficult process. However developers today can rely on
The IT systems developers have created many tools for supporting KM operations that cover four basic steps in KM life cycle to some extent and those tools could not provide complete facilitation of capture, development, sharing and utilisation of knowledge [10]. This has forced communities to adapt set of KM tools to cover the full life cycle of KM when the community has recognised that this behaviour within their community as being highly valued. The KM architecture model defined by [11] describes the requirement of using multiples of tools for supporting the full life cycle of KM. Knowledge
UbiCC Journal, Volume 3, January 2008
45
integrating enterprise grade technologies such as JEE, EJB, Java Architecture for XML Binding (JAXB), and JBoss/Tomcat servers [17] operating under Linux AS environment as a superior starting point. The above researchers have demonstrated a prototype KMS developed under this platform. Their proposed automatic component generation of Knowledge Object Management (KOM) has made this KMS easy to reengineer (with improved reusability) for different domains. The research prototype used in this work is also engages a collaboration development environment based on similar architecture. The innovative tools of the future that are integrated with KMIs should provide features to facilitate knowledge availability and analysis (tools) to ascertain the value of knowledge for those who seek for a specific content at the right time. While one could consider that the above discussed systems and tools have been highly successful in their time many gaps still exist in the development and implementation of KMIs that could support the full KM life cycle. 2.2 System Prototyping For Gathering Requirements Of Full KM Life Cycle In order to address much of the above issues and establish a suitable KMI for the scientific and professional communities a prototype version of a Knowledgebase Infrastructure [18] has been developed and utilised within large scientific community project environments. This prototype is used as a requirement gathering engine to capture, discuss and finalise user needs of such communities and where time permits to implement these findings in the future prototypes in an iterative prototyping process. 3 THEORETICAL BASE FOR KNOWLEDGE CREATION AND CODIFYING FRAMEWORK Knowledge in communities is classified into two types; tacit and explicit by considering its very nature. In three major areas; codifiability and mechanisms for transfer, methods for acquisition and accumulation, and potential to be collected and distributed, has seen the critical differences of the above two types [19]. EgoChat system investigates ways of sharing tacit knowledge [20]. The tacit knowledge is very much attached to the individual who owns it and it’s difficult to codify while explicit knowledge can be codified, stored and shared independent from the owners’ involvement. This does not imply that explicit knowledge is readily available for sharing as needed [8] but with the appropriate knowledge sharing mechanisms in place that could be facilitated within the community. The ability to articulate knowledge within or between
other communities also depends on whether the knowledge is rationalised or embedded [21] and whether any value is attributed to its content. When knowledge becomes a valuable commodity, what knowledge to share, when to share it and with whom to share it becomes very important decisions to make [22]. While certain types of knowledge which are valued highly by the community and the individuals but resides relatively in a limited number of individuals [23]. Even such knowledge must be shared among the authorised individuals and utilised promptly by taking the appropriate actions so that the higher impact could be achieved. Suitable KM framework can control the availability and depth of accessibility of such value attributed knowledge so that only the appropriate content is shared or becomes available among the intended recipients. 3.1 Shared Community Knowledge Pool And Aggregation Of Knowledge Within Individuals Knowledge creation activities inherently requires to absorb related knowledge from different sources and then that content interacts with individuals (one or many working together) own knowledge of that specific subject area(s). So, an individual’s (Px=Person x) knowledge (K) at a specific time, PxK can be given as; PxK = Σ (held knowledge of a specific subject area); where Σ been used to consolidate and capture the knowledge of many subject areas that an individual (Px) has retain. With the assumption that the Shared community Knowledge pool (Skn) of n number of members is able to create their own self sustainable knowledge creation and management activities by absorbing knowledge from the Skn and within post creation by supplying the newly created knowledge back into the Skn. The following figure describes knowledge sharing, knowledge absorption and working knowledge of an individual engaged with a specific knowledge creation activity. As in figure 1 and described above at a given time the knowledge held by an individual (Px) is PxK and the amount of shared knowledge or post creation knowledge contribution to the Skn by that person is SkPx. In a different point of view this can be expressed as; Skn = Σ (SkP1, SkP2, SkP3 … SkPn); where Σ been used to combine knowledge contributions of the community of n knowledge sharing members. The knowledge an individual requires absorbing to perform a specific task (knowledge creation activity) from the K pool (Skn) to call A1. The knowledge required to absorb (A) could directly have an association with the intensity of knowledge lacking and/or type of knowledge creation work that the person involved at the time individually or as a part of a group. The individual’s involvement within knowledge creation and management process described is to be named Knowledge Production Life Cycle (KPLC). Knowledge creation activity within a group setting
UbiCC Journal, Volume 3, January 2008
46
P1 = Person one Sk = Shared K pool of the organisation P1K = Person no.1’s pre-knowledge Knowledge creation activities = Any knowledge creation activities an individual involved (personally or within group environment) A1 = Received (or Absorbed) knowledge from the K pool for completing K creation activities SkP1 = New knowledge to be shared during post knowledge creation activity Figure 1: Knowledge creation activities related knowledge aggregation (of an individual) will be discussed later. Knowledge creation and sharing as discussed above can be seen as a two facet movement; on one side it benefits the community for moving forward in accordance with their goals and visions and on the other side it serves as a knowledge pool that members of the community can constantly gain knowledge from the pool for their own advancement. Therefore knowledge sharing at its best can be seen as a controlled knowledge pool that accumulates knowledge from individuals (or groups) where the individuals can receive, aggregate, create and then supplying back the new knowledge to the pool. Considering these characteristics sharing of knowledge can be viewed mathematically as discussed later in this article. Individuals working together (dynamic working units engaged in knowledge creation activities) constantly share knowledge but not every piece of knowledge reaches the shared knowledge pool. The group or an individual must take the opportunity to codify that knowledge and make this available for others authorised to receive that content (or objects). However many artefacts may exist (predefined deliverables or voluntary group work), that virtual groups create together which capture their knowledge of a particular subject matter so that these objects can be included in the knowledge pool on a timely manner. Much of the knowledge created within a virtual community is based on virtual teams (or groups). While some individuals may be involved in knowledge creation activities by themselves many others engaged with different activities, associated with different groups. The figure 2 shows an integrated version that includes the individual knowledge creation activities or KPLC described in figure 1 within group environments which are spread among virtual teams. As an example a virtual team of two people (P1 and P2) can engage in knowledge creation activities in order to deliver new knowledge for the community or for their own benefit. During post completion of this activity they may share a codified version (SkP1+SkP2) that contains the created knowledge among the virtual community by incorporating this object in the Skn. It is assumed that only the appropriate users are able to access this information on a timely manner. 3.2 Knowledge Aggregation Algorithm For n People Within The Community The following (a) and (b) recaptures KPLC for the individual and its implication on the K pool as two important points of interest where knowledge aggregates for further analysis. (a) KPLC within an individual (Px) An individual is likely to receive (Ax: Absorb) knowledge from the knowledge pool in order to complete the activities in hand. For the simplicity the knowledge gained from another individual directly is not pictured above but assumed as this type of knowledge also reaches the individual via the knowledge pool. The individual’s prior knowledge (at time t=0; PxK) is likely to react with the absorbed knowledge for creating new knowledge, concepts and ideas. This new knowledge created to be shared by moving back to the knowledge pool, fully or partially, if that seems appropriate at the time. (b) KPLC and aggregated group contribution to shared knowledge pool (Σ Skn(t)) Individuals within working groups are likely to share their knowledge of a particular interest by making it available in the shared knowledge pool by some means of codifying the knowledge content. In a given case this may be in the form of an artefact that few individuals have worked together (virtually connected, co-located or a combination of two modes) and decided to share it by making that
UbiCC Journal, Volume 3, January 2008
47
Figure 2: Group knowledge creation activities and knowledge aggregation artefact available in the knowledge pool. At a given time many individuals or virtual groups who work together on various knowledge creation activities also share similar contents accordingly. This concept automatically facilitates knowledge aggregation within the knowledge pool as well as within individuals (as discussed in (a)) who may require receiving the newly available knowledge as needed basis. Assuming there is only limited resistance for knowledge sharing (based on the value of created knowledge) individuals frequently create new knowledge and make it available in the knowledge pool for community access. Over time the community will be equipped with a valuable aggregated knowledge pool and this will become a key capability of the community for achieving their goals and objectives. Knowledge creation activities relating individual and group human processors in the light of quantifying the Knowledge Creation Process (KCP) of virtual communities is a novel idea. Tacit and explicit knowledge transferred among individuals in a group and different modes of knowledge creation [24] have been explored by researchers. They have described a ‘web’ of knowledge management activities in organisational settings and propose a conceptual foundation of a KM framework. The KCP within an individual, how this process inter reacts with the community setting and knowledge aggregation within individual and community knowledge pool has not been explored and the literature shows many gaps in this area. In order to fulfill this requirement figure 3 analyses the KCP within an individual and provides a knowledge
Figure 3: Knowledge creation state diagram of an individual
UbiCC Journal, Volume 3, January 2008
48
creation state diagram. This analysis and the descriptions based on figure 2 leads into providing a model to quantify knowledge creation and aggregation process mathematically focusing virtual and/or knowledge communities. The above state diagram with seven knowledge states pictures two exit points where one could share the created knowledge after some means of codifying it or decides to exit out of the process regardless of in which knowledge state that the person engage when such decision is being made. The two knowledge cycles; first, knowledge absorption, integration with their own knowledge, knowledge creation, integration of created knowledge, and finalising the created knowledge (KS1, KS2, KS3); second, the time delay that may be added to the codifying and sharing process if the sharing of created knowledge is not appropriate at that moment (KS5) are highlighted in figure 3. There could be circumstances the knowledge may be codified fully or partially and then one could add the delay for sharing that is not pictured above for simplicity. The KPLC simplifies knowledge creation, aggregation and sharing aspects of KM within a community knowledge pool and within individuals. The following mathematics (figure 4) shows within an ideal situation that knowledge aggregation over a period of time can be quantified with few assumptions. There may exist terms in (E.g. sub terms in (4) and (5)) equations that are valued to Null as not every moment everyone absorb, create or share knowledge over the considered period. While this quantification does not have a direct influence on the storage space required for the codified knowledge it recognise the patterns immerging in aggregation of knowledge in both cases. This will allow further analysis, perceptive and quantification of knowledge creation and management activities in knowledge communities. 4 OPPORTUNITY FOR KNOWLEDGE SHARING WITHIN A COMMUNITY 4.1 Formal vs Informal There are both formal and informal; activities, events, or tools available for individuals to react in order to share knowledge. Communities and groups can organise training programs, workshops, etc and also offer methods, systems and tools for the facilitation of knowledge sharing through formal interactions [25]. Communities can organise formal settings for very large number of audience where the knowledge sharing and dissemination can be speedier and effective. On the other hand interactions of a personal nature, social networks, etc also facilitate the same through informal interactions. While formal settings play an important role of knowledge sharing research has shown most sharing
take place under informal settings [8]. The ability to build trust with face-to-face interactions influenced sharing in some cases. Codifying (for sharing) the knowledge associated with formal activities is somewhat easier based on their structured approach throughout the process. Hence recognition of the type of activity at very early stage can enhance supporting and facilitating KM associated with them. In detail analysis of attributes of formal and informal activities (and/or events) shows that some activities do not clearly fit in with these two types. These specific activities in question have characteristics of informal and some characteristics of formal nature (vice versa) making it difficult to isolate its type. Inherently most activities performed by CoP are informal nature. Similarly Knowledge Networks (KN) perform activities with characteristics that are more formal nature of which some activities also prove to have informal attributes. Hence categorising these activities as ‘formal’ is not appropriate in considering their codifiability. It has been difficult to codify knowledge when the associated activity has more informal nature and therefore to facilitate reusing it. This is also partially due to lack of suitable and standard tools for capturing and analysing such knowledge content. For further analysis of knowledge sharing associated with this mix folded type of activities a third type called ‘nominal’ is defined. In another words nominal activity shows informal as well as formal characteristics. Therefore such recognition has assisted us in improving the process of KM associated with the type of activity or event held by the community. As a result three types of knowledge creation activities are considered in the KM framework defined. 4.2 Knowledge Processing, Codification And Reuse Model The four factors have been identified; nature of knowledge, motivation to share, opportunities to share and culture within work environment as interconnected influencing factors of knowledge sharing [8]. However within a large virtual community environment the culture becomes a very complex matter to discuss, and even between each community group the culture may vary a lot. On the other hand factors such as codifying, storage, access control, estimating the appropriate time for sharing and means of sharing are also valued factors within a complete knowledge creation and management cycle. An individual may bring new knowledge into the community through external activities, created by working together with others or perhaps by their own work and also via other conduct. Regardless of the origin the individuals’ or knowledge creation groups
UbiCC Journal, Volume 3, January 2008
49
Total number of people use the shared pool = n; ( Person x ) = Px; Knowledge shared by Person x to the pool = Sk Px; Knowledge shared by n persons to the pool = Skn; At a given time t , shared knowledge of n persons can be written as;
Skn(t ) = Sk P1(t ) + Sk P 2(t ) + .. + Sk Px(t ) + .. + Sk Pn(t )
Be original knowledge status of Person x , at time t = PxK (t ); Knowledge Absorpsion from the Pool ( Skn ) by Person x, at time t = Ax Skn (t );
(1)
δ = time in between two knowledge states (to reach at KS 1 : fig 3);
Be aggregated knowledge of Person x reached at KS 1 ( fig 3) = PxK (t + δ ); At time t + δ , aggregated new knowledge of Person x can be written as;
PxK (t + δ ) = PxK (t ) + Ax Skn(t )
At just after time t + δ , due to PxK (t + δ ) be created new knowledge = Ck (t + δ ); If Person ' s knowledge has moved to the next state when δ → Δ (e. g . to reach at KS 2 : fig 3) At time t + Δ , Person x ' s new knowledge status = PxK (t + Δ );
( 2)
PxK (t + Δ) = PxK (t + δ ) + Ck (t + δ ) Applying ( 2) in (3) and with the notation ∑ ,
Combined aggregated knowledge of an individual over a period can be stated as below; Be person x ' s aggregated knowledge over a period (time 0 to t ) = Pxkt ;
(3)
Pxkt = ∑t =0 PxK (t + Δ) = ∑t =0 PxK (t ) + ∑t =0 AxSkn(t ) + ∑t =0 Ck (t + δ )
t =t t =t t =t t =t
( 4)
Also from above (1) and with the notation ∑ , Combined aggregated knowledge in the Pool over a period (time 0 to t );
Skn = ∑t =0 Skn(t ) = ∑t =0 Sk P1(t ) + ∑t =0 Sk P 2(t ) + .. + ∑t =0 Sk Pn(t )
t =t t =t t =t t =t
or Skn = ∑t =0
t =t
(∑
x =n x =1
Sk Px (t )
)
(5)
Figure 4: Knowledge aggregation algorithm for n people facilitating a community K pool must realise the nature of knowledge including its suitability (content, time, etc.) for sharing prior to making the decision of codification and storage. When the knowledge is stored the storage mechanism deals with the availability of the knowledge and the individuals’ access rights for that piece of knowledge. Role or process based access control system within the KMI are to provide the required trust and confidence for knowledge owners and for the community in general. While the community in practice must bring forward opportunities to share knowledge, it must also create a culture for individuals and groups to become motivated for such work. It has been identified that certain aspects of community culture to influence knowledge sharing [26] or vice versa. The community should organise various events on a regular basis that promotes such a motivating environment. The virtual communities should invest and maintain a suitable KMI so that dispersed knowledge creation groups can attain the required benefits of such a system. The ICT tools provided by KMI should support the community for easy management of codified knowledge objects, opportunity for knowledge creation and integration, access control and ownership, timing of sharing, and discarding. A knowledge processing, codification and reuse model (figure 5) has derived that has taken into consideration, most of the above factors. Further details of this model, embedded KMI, and systems usability are to be published in the future. 5 FURTHER DEVELOPMENTS The established prototype with enhanced collaborative development environment facilitates KCP within virtual groups. This prototype is created for gathering user needs and requirements of the targeted communities while providing many
UbiCC Journal, Volume 3, January 2008
50
Figure 5: Knowledge processing, codification and reuse model Knowledge Operation (KO) and manipulation tools [27] for daily use. The future developments that embed newly identified user needs which are suitable for integration within this system to be integrated following some extent similar to the ‘prototyping’ systems development life cycle. Each consecutive prototype version to be used lively within communities for user needs analysis and consensus gathering process which inputs to the next iteration. 5.1 Integration of Virtual Machine Architecture With KMI The implementation of these system prototypes are currently in operation within a virtual machine environment [28]. This facilitates hardware and operating system independence giving the possibility of advanced replication of the repositories [29] and KM operations. To capitalise on the best practices of the virtual infrastructure development, the Virtual Machine Architecture (VMA) services are based on the current technologies that are at enterprise grade. Therefore some of our ongoing work utilises; Jboss
UbiCC Journal, Volume 3, January 2008
51
JEMS, WMware Infrastructure 3, Citrix XenServer v4, Internet2, etc, e-science core programme associated services and the latest research (E.g. EU and international developments) which are to be combined to provide an enhanced and consolidated infrastructure for identifying the outlook for future standardisation of community workspace environments. A conceptual diagram of the VMA established for one of the prototypes is shown in figure 6. The current engagement of virtual communities through various projects provide the community base and the user requirements for this research activities, which has become the initial test bed with direct access to many organisations. While the considered KMI provides an essential solution to existing as well as future needs and requirements of these networks of communities the establishment of future research and its sustainability becomes extremely hopeful.
virtual communities who create, share and reuse knowledge on a daily basis for various activities that are focused in achieving the goals and objectives of their community. The iterative results, findings and further development work are to be published in the future. 7 REFERENCES
[1] M. Earl: Knowledge management strategies: Towards a taxonomy, Journal of Management Information Systems, Vol. 18, No. 1, pp. 215242 (2001). [2] J. C. Spender, and R. M. Grant: Knowledge and the firm: Overview, Strategic Management Journal, Vol. 17, Special Issue: Knowledge and the Firm, pp. 5-9 (1996). [3] D. Stenmark: Leveraging tacit organizational knowledge, Journal of Management Information Systems, 17(3), pp. 9-24 (2001). [4] P. Hendriks: Why share knowledge? The influence of ICT on the motivation for knowledge sharing, Knowledge and Process Management, Vol. 6, No. 2, pp. 91-100 (1999). [5] D. R. Lessard, and S. Zaheer: Breaking the silos: Distributed knowledge and strategic responses to volatile exchange rates, Strategic Management Journal, Vol. 17, No. 7, pp. 513-533 (1996). [6] I. Nonaka, and H. Takeuchi: The knowledge creating company: How Japanese companies create the dynamics of innovation, Oxford University Press New York (1995). [7] B. Lowendahl, O. Revang, and S. N. Fosstenlokken: Knowledge and value creation in professional service firms, Human Relations, Vol. 54, No. 7, pp. 911-931 (2001).
Figure 6: Conceptual VM architecture The integration of final KMI and the VMA within Future Workspaces [30] are also to be discussed in other publications. These future solutions will eliminate complicated physical networking of many computers which facilitates the above ubiquitous environment somewhat similar to figure 6. This technology is to provide very effective and improved systems user response that could otherwise cause an adverse effect. Such infrastructure within Future Workspaces will provide advanced features of KM that are required for successful collaborative knowledge development work among dispersed scientific and professional community groups. 6 CONCLUSION
[8] M. Ipe: Knowledge Sharing in Organizations: A Conceptual Framework, Theory and Conceptual Article, Human Resource Development Review, Sage Publications, Vol. 2, No. 4, pp. 337-359 (2003). [9] R. J. J. Boland, and R. V. Tenkasi: Perspective making and perspective taking in communities of knowing, Organization Science, Vol. 6, No. 4, pp. 350-372 (1995). [10] S. M. Lee, and S. Hong: An enterprise-wide knowledge management system infrastructure, Industrial Management & Data Systems, Vol. 102, No. 1, pp. 17-25 (2002).
As this paper explores the concept that whilst knowledge sharing is a complex matter to analyse within communities of practice, by adhering to a suitable knowledge management infrastructure and associated tools, some of the complexities recognised can be avoided and appropriate influencing factors of knowledge sharing can be facilitated. Authors have taken the lead in demonstrating this stance by implementing a suitable infrastructure within appropriate settings that captures the user needs and constraints of targeted communities (E.g. [18],[31]). The system has been used by individuals’ within
UbiCC Journal, Volume 3, January 2008
52
[11] M. Lindvall, I. Rus and S. S. Sinha: Software systems support for knowledge management, Journal of Knowledge Management, Vol. 7, No. 5, pp. 137-150 (2003). [12] M. Bieber, D. Engelbart, R. Furuta, S.R. Hiltz, J. Noll, J. Preece, E. A. Stohr, M. Turoff, and B. Van de Walle: Toward virtual community knowledge evolution, Journal of Management Information Systems, Vol. 18 , Issue 4, pp. 1135 (2002). [13] C. S. De Souza, and J. Preece: A framework for analyzing and understanding online communities, Interacting with Computers, The Interdisciplinary Journal of Human-Computer Interaction, Vol. 16, Issue 3, pp. 579-610 (2004). [14] W. Bukowitz: At the Core of a Knowledge Base, Journal of Knowledge Management, Vol. 1, Issue 3, pp. 215-224 (1997). [15] J. M. Firestone, and M. McElroy: Key issues in the new knowledge management, KMCI Press /Butterworth-Heinemann (2003). [16] M. O’Connor, H. Knublauch, S. Tu, B. Grosof, D. Dean, W. Grossoand, and M. Musen: Supporting Rule System Interoperability on the Semantic Web with SWRL, Fourth International Semantic Web Conference: ISWC2005, SMI2005-1080, Galway, Ireland (2005). [17] S. Li, H. Hsieh, S. Chen, and M. Shyu: Facilitating KMS Reusability by XML Binding Model, IRI2003, IEEE Publication, pp. 478-484 (2003). [18] S. Mihindu, and T. Fernando: Intuition Knowledgebase, Intuition General Assembly presentation at Basel, Switzerland, 03/09/2006, (2006). [19] A. Lam: Tacit knowledge, organizational learning and societal institutions: An integrated framework, Organization Studies, Vol. 21, No. 3, pp. 487-513 (2000). [20] H. Kubota, T. Nishida, and T. Koda: Exchanging Tacit Community Knowledge by Talking-virtualized-egos, Fourth International Conference on Autonomous agents, Barcelona, Spain, June 3-June 7, pp. 285-292 (2000). [21] L. Weiss: Collection and connection: The anatomy of knowledge sharing in professional service, Organization Development Journal, Vol. 17, No. 4, pp. 61-77 (1999).
[22] K. M. Andrews, and B. L. Delahaye: Influences on knowledge processes in organizational learning: The psychological filter, Journal of Management Studies, Vol. 37, No. 6, pp. 797810 (2000). [23] F. M. R. Armbrecht, R. B. Chapas, C. C. Chappelow, G. F. Farris, P. N. Friga, C. A. Hartz, M. E. McIlvaine, S. R. Postle, and G. E. Whitwell: Knowledge management in research and development, Research-Technology Management, Vol. 44, No. 4, pp. 28-48 (2001). [24] M. Alavi, and D. E. Leidner: Review: Knowledge Management and Knowledge Management Systems, Conceptual Foundations and Research Issues, MIS Quarterly, Vol. 25, No. 1, pp. 107-136 (2001). [25] K. M. Bartol, and A. Srivastava: Encouraging knowledge sharing: The role of organizational reward systems, Journal of Leadership & Organizational Studies, Vol. 9, No. 1, pp. 64-76 (2002). [26] D. W. DeLong, and L. Fahey: Diagnosing cultural barriers to knowledge management, The Academy of Management Executive, Vol. 14, No. 4, pp. 113-127 (2000). [27] S. Mihindu, and S. De Alwis: User Guide of the Knowledgebase Prototype V1.0, IST-NMP-1507248-2, Oct 2006, 21pgs (2006). [28] S. Mihindu: Intuition Knowledge Base Software Tool Development: Prototype V1.0 documentation, IST-NMP-1-507248-2, Aug 2006, 7pgs (2006). [29] S. Mihindu, and F. Khosrowshahi: Virtualisation of disaster recovery centres, CIB W89: International Conference in Building Education and Research, Kandalama Sri Lanka, 10-15th February 2008, BEAR 2008: accepted for publication (2008).
[30] S. Mihindu: Virtualised e-infrastructure that facilitates collaborative work among distributed communities, 9th SPARC’07, Salford Crescent UK, 10-11 May 2007, 12pgs, in print (2007). [31] S. Mihindu: INTUITION Knowledgebase: Prototype V1.0, Future Workspaces Research Centre, University of Salford, EU Project Deliverable: D1.C_4, IST-NMP-1-507248-2, Feb 2007, 67pgs (2007).
UbiCC Journal, Volume 3, January 2008
53
Signal Denoising by Wavelet Packet Transform on FPGA Technology
Mohamed I. Mahmoud, Moawad I. M. Dessouky, Salah Deyab, and Fatma H. Elfouly
generalization of the discrete wavelet transform is the discrete wavelet packet transform (DWPT) which keeps splitting both lowpass and highpass subbands at all scales in the filter bank implementation, thus Wavelet Packet obtains a flexible and a detail analysis transform. So we used the Wavelet Packet transform for de-noising. Signal de-noising using wavelet packet transform consists of the following three steps: The main steps of signal denoising are: 1. Wavelet packet transform of observed signal. 2. Shrinkage of the empirical wavelet coefficients. 3. Inverse wavelet packet transform of the modified coefficients. The denoising procedure requires the estimation of the noise level. In this work Stein's Unbiased Estimate of Risk (SURE) [6] has been chosen as a principle for selecting a threshold to be used for denoising. Previous research on signal de-noising using wavelet is offline in nature; the signal is sampled in real-time, but then captured in memory or on hard disk, and de-noised after the fact on a traditional personal computer or workstation using a software tool such as Matlab. However, many applications require real-time processing, in which the signal must be processed as it is received. These real-time applications require that the signal be processed at the same rate that it is produced; in other words, the throughput in samples per second of data coming out of the de-noising system must be equal to the throughput of data going into the system. A small amount of latency, or lag from input to output, is acceptable (and necessary, since computations can not be done instantaneously). The goal of this research is to demonstrate that signal de-noising can be done in realtime efficiently and inexpensively by using a field programmable gate array as the computational platform. The rest of the paper is organized as follows. Section II describes the wavelet packet algorithm; Section III explains why FPGA are an appealing choice for implementation of the de-noising part of the system. Section IV describes the denoising principle. Section V details our FPGA implementation of signal denoising. Section VI gives the simulation results. Section VII gives the synthesis results. And Section VIII draws conclusions. II. WAVELET PACKET ALGORITHM
Abstract— A denoising method based on wavelet packet shrinkage was developed in this research. The principle of wavelet packet shrinkage for denoising and the selection of thresholds and threshold functions were analyzed. The design of a low-cost, field programmable gate array (FPGA) based digital hardware platform that implements wavelet packet transform algorithms for real-time signal denoising is presented. Keywords— FPGA. wavelet packet transform, denoising,
I. INTRODUCTION LL signals obtained as instrumental response of analytical apparatus are affected by noise. The noise degrades the accuracy and precision of an analysis, and it also reduces the detection limit of the instrumental technique. Signal denoising is therefore highly desirable in analytical response optimization. For the applications of interest, noise is primarily high frequency, while the signal of interest is primarily low frequency. Because the wavelet transform decomposes the signal neatly into approximation (low frequency) and detail (high frequency) coefficients, the detail coefficients will contain much of the noise. This suggests a method for denoising the signal: simply reduce the size of the detail coefficients before using them to reconstruct the signal. This approach is called thresholding or shrinkage the detail coefficients. Of course, we cannot throw away the detail coefficients entirely; they still contain some important features of the original signal. Various kinds of thresholding have been proposed, and which kind of thresholding is best depends on the application. The two different approaches which are usually applied to denoise: hard thresholding or soft thresholding. The hard thresholding method consists in setting all the wavelet coefficients below a given threshold value equal to zero, while in soft thresholding the wavelet coefficients are reduced by a quantity equal to the threshold value [5]. A
M. I. Mahmoud, M. I. M. Dessouky, S. Deyab are with Faculty of Electronic Engineering, Menouf, Egypt. F. H. Elfouly is with HIE, Alshorouk academy, Cairo, Egypt.
A
UbiCC Journal, Volume 3, January 2008
54
Wavelet Packet Transform (WPT) is now becoming an efficient tool for signal analysis. Compare with the normal wavelet analysis, it has special abilities to achieve higher discrimination by analyzing the higher frequency domains of a signal. The frequency domains divided by the wavelet packet can be easily selected and classified according to the characteristics of the analyzed signal. So the wavelet packet is more suitable than wavelet in signal analysis and has much wider applications such as signal and image compression, denoising and speech coding [7]. Wavelet packet transform uses a pair of low pass and high pass filters to split a space corresponds to splitting the frequency content of a signal into roughly a low-frequency and a high-frequency component. In wavelet decomposition we leave the high-frequency part alone and keep splitting the low-frequency part. In wavelet packet decomposition, we can choose to split the high-frequency part also into a low-frequency part and a high-frequency part. So in general, wavelet packet decomposition divides the frequency space into various parts and allows better frequency localization of signals [7].
X(z)
application-specific integrated circuit (ASIC). Microprocessors and digital signal processors offer the advantage of being inexpensive, off-the-shelf devices, easily programmed to perform a variety of tasks. On the other hand, an ASIC, while expensive to design and fabricate and inherently inflexible once the design is complete, offers an advantage in terms of processing speed [8]. Recent advances in FPGA technology have made FPGA extremely attractive for implementation of all types of computational systems. FPGA represent a new middle ground between microprocessors and ASICs in terms of computational performance and cost. Like microprocessors, FPGA are inexpensive, off-the-shelf, and easily reprogrammed for new applications [8]. Like ASICs, FPGA offer a high degree of control over the underlying computer hardware, and therefore allow the system designer to specify hardware architecture tailored to the application at hand, thus providing additional processing speed. Once relegated to small “glue logic” applications, FPGA are now capable of implementing complex computational systems. In the last few years, systems have been built or proposed for a variety of applications dominated by mathematical computations, including a cross-correlator for radio astronomy, a sonar beam former, one- and two-dimensional convolvers [8], a decimation filter, and a fast Fourier transform. This prior research shows that FPGA -based implementations are typically at least one order of magnitude faster than processor-based implementations, without incurring the high cost of fabrication and development required for application specific integrated circuits. IV. DENOISING PRINCIPLE A. Model of Noise-containing Signals and Principles of Denoising Based on Wavelet Packet Shrinkage In engineering, a one-dimensional model of signals with additive noises can be shown as follows:
H0(z) H0(z) 2
2 H1(z) 2 H0(z)
H1(z)
2 H1(z) 2
2
Fig. 1 Wavelet packet tree As shown in Fig. 1, the wavelet packet transform can be viewed as a tree. The root of the tree is the original data set. The next level of the tree is the result of one step of the wavelet transform. Subsequent levels in the tree are constructed by recursively applying the wavelet transform step to the low and high pass filter results from the previous wavelet transform step [7]. Similarly the inverse wavelet packet can reconstruct the original signal from the wavelet packet decomposition spectrum. The inverse wavelet packet is done starting from the coarsest decomposition level where the WPT coefficients are upsampled before passing through a pair of reconstruction filters. Note that, the wavelet that is used as a base for decomposition cannot be changed if we want to reconstruct the original signal. Daubechies 18-tap wavelet has been chosen for this implementation. The filters coefficients corresponding to this wavelet type are shown in Table 1. III. ADVANTAGES OF FPGA-BASED IMPLEMENTATION. Several computer hardware platforms can be considered for processing of signals from optical imaging systems; traditional choices for implementing such a system are a microprocessor, a digital signal processor, or an
y (n) = x(n) + σ e(n), n = 1,2,..., N
(1)
Where, y(n) denotes noise-containing signals, x(n) denotes real signals, e(n) is white Gaussian noises with a normal distribution, and N (0,1) denotes the deviation of noise signals. In engineering, the useful real signals usually behave in the form of low-frequency signals or certain relatively stable signals, while noise signals are usually in the form of high-frequency signals. Signal x(n) can be depicted by wavelet packet coefficients decomposed from wavelet packet, with larger wavelet packet coefficients carrying more signal energy and smaller carrying less [8,9]. The basic idea of denoising with wavelet packet shrinkage is (according to the characteristic that wavelet packet coefficients of noises and signals) behaves differently in different scales (namely, different bands). To eliminate
UbiCC Journal, Volume 3, January 2008
55
wavelet components of different scales produced by noises, especially components of noise-dominated scales, and the preserved wavelet packet coefficients are the very wavelet packet coefficients of original signals, then the original signals are reconstructed via the wavelet packet transform reconstruction algorithm. Therefore, we know the key to denoising based on wavelet packet shrinkage is how to filter out wavelet packet decomposition coefficients produced by noises. Appropriate thresholds are chosen in engineering to quantify wavelet packet decomposition coefficients, wavelet packet coefficients lower than or equal to the threshold are treated as zero, and only data above the threshold are used to reconstruct signals x(n). In this way, most of noises are eliminated, while the singularity points and characteristics of the original signals are preserved [9,10]. Obviously, the choice of threshold directly influences the effectiveness of the denoising algorithm. Too high a threshold would result in too many wavelet packet decomposition coefficients being reset as zero, and thus destroying too many details of the signal, while with too low a threshold the expected denoising effect could not be achieved. The process of denoising based on wavelet packet shrinkage is divided into three steps:
threshold to be used for de-noising. Stein Unbiased Risk Estimate (SURE) is an adaptive threshold selection rule. It is data driven. The aim of estimate is to minimize the risk. Because the coefficients of true signal are unknown, the true risk is also not unknown. We derive the unbiased estimate of true risk for generalized threshold functions; then SURE threshold value minimizes the unbiased risk estimate [6]. This technique calls for setting the threshold T to
T = 2 log e (n log 2 (n))
Where n is the length of the signal. C. Selection of Threshold Function
(5)
For any threshold, two kinds of threshold function can be used: hard-threshold function, soft-threshold function. Their mathematical expressions are as follows [9]: Hard-threshold function:
⎧y ⎪ D H ( y, t ) = ⎨ ⎪0 ⎩
y ≥t y