incorporating prior knowledge into kernel based regression

Document Sample
incorporating prior knowledge into kernel based regression Powered By Docstoc
					Vol. 34, No. 12

ACTA AUTOMATICA SINICA

December, 2008

Incorporating Prior Knowledge into Kernel Based Regression
SUN Zhe1 ZHANG Zeng-Ke1 WANG Huan-Gang1
Abstract In some sample based regression tasks, the observed samples are quite few or not informative enough. As a result, the conflict between the number of samples and the model complexity emerges, and the regression method will confront the dilemma whether to choose a complex model or not. Incorporating the prior knowledge is a potential solution for this dilemma. In this paper, a sort of the prior knowledge is investigated and a novel method to incorporate it into the kernel based regression scheme is proposed. The proposed prior knowledge based kernel regression (PKBKR) method includes two subproblems: representing the prior knowledge in the function space, and combining this representation and the training samples to obtain the regression function. A greedy algorithm for the representing step and a weighted loss function for the incorporation step are proposed. Finally, experiments are performed to validate the proposed PKBKR method, wherein the results show that the proposed method can achieve relatively high regression performance with appropriate model complexity, especially when the number of samples is small or the observation noise is large. Key words Machine learning, prior knowledge, kernel based regression, iterative greedy algorithm, weighted loss function

Kernel based regression is a theoretical framework, which contains many popular regression methods such as support vector regression (SVR)[1] , Gaussian process regression (GPR)[2] , and so on. Moreover, several other regression methods can be equivalently represented in the kernel based framework. Many regression problems are solved successfully by these methods with relatively high precision and good generalization performance. Nowadays, the kernel based regression is a hot topic in the field of machine learning. In general, a kernel based regression method approxix mates the unknown function f (x ) in the following form ˆx f (x ) =
l X i=1

x αi K(x , x i )

(1)

x x where K(x , x i ) is a predefined kernel function, {x i , yi }l is i=1 the observed sample set, αi , i = 1, · · · , l are coefficients determined in the training process. Specifically, an additional bias term b is introduced into (1) in the SVR method. For a given kernel based method, the regression result depends only on the training samples. If the number of samples is relatively large and the samples can reflect the essential properties of the unknown function, then all welldeveloped kernel based regression methods will make excellent approximation. However, if the observed samples are a few or not informative enough, then the conflict between the sample number and the complexity of the learning machine (namely the number of free coefficients in (1)) will emerge. The regression method will confront the dilemma whether to choose a complex regression model or not. In Section 1, this dilemma will be discussed thoroughly. The prior knowledge might offer a solution to this dilemma. In most practical machine learning problems, especially in the industrial applications, some prior knowledge about the problem is available and the observed samples are a few or congregate in a small region. The prior knowledge might be derived from mechanism analysis or summarized from experimental data. In these cases, incorporating the prior knowledge might offset the weakness of
Received September 5, 2007; in revised form November 27, 2007 Supported by National Key Technologies Research and Development Program of China during the 11th Five-year Plan (2006AA060206) and Basic Research Foundation of Tsinghua University (JC2007024) 1. Department of Automation, Tsinghua University, Beijing 100084 P. R. China DOI: 10.3724/SP.J.1004.2008.01515

the sample based learning method in the following way: 1) The redundant information in the samples can be removed and the lost information can be added on; 2) A proper region of the model complexity can be determined; 3) Some other performance improvements might be achieved, for instance, the resulted classifier or regression function can satisfy some known constraints; the training process can be sped up and the generalization performance can be improved and so on[3] . Some achievements based on this idea are reported. Here we only discuss the “internal” incorporation methods, namely the prior knowledge is used to modify the structure or the constraints or the training process of a learning machine. First, Joerding introduced the analytical properties of the function, such as monotonicity and convexity, into neural network regression methods[3] . Furthermore, several methods to incorporate prior knowledge into the neural networks were summarized[4] and simulations were made to compare these methods. Based on these theoretical achievements, an industrial application was investigated[5] . As for incorporating prior knowledge into the kernel based learning methods, only a few works were published and most of these works concentrated on particular problems. For instance, Sch¨lkopf proposed two methods to o incorporate prior knowledge into the construction of the kernels of support vector machine (SVM) classifiers[6] , and image retrieval was discussed and the knowledge about the positive instances distribution was incorporated into the SVM structure via an additional penalty term[7] . Some researchers attempted to incorporate more general prior knowledge into the machine learning algorithms. For example, Wu proposed an algorithm called “weighted margin support vector machines”[8] , whose key point was that different samples possess different effects on the classification hyperplane. Fung discussed the polyhedral constraint in the sample space in the classification problems[9] , and a conclusion in the optimization theory was used for translating this high-level statement of the constraint to the infinity-norm linear SVM. This paper has proposed a new way to incorporate the prior knowledge into a learning machine theoretically and its main idea is quite valuable, though its conclusion suits only a special learning method. Recently, a few achievements in this realm were published: Le introduced a new optimization method to treat a class of prior knowledge[10] . If the prior model structure is known to be linear with respect to the unknown parameters, then

1516

ACTA AUTOMATICA SINICA

Vol. 34

the prior model parameters can be evaluated together with the regression model parameters to find a global minimum of the loss function[11−13] . To our best knowledge, no other general method to incorporate the prior knowledge into kernel based regression has been reported. This paper is limited to consider a sort of prior knowledge which satisfies the following conditions: 1) the prior x k=1 knowledge is stated as determinate functions {pk (x )}h without unknown parameters; 2) the union of the domains of these functions (denoted as Dk ) covers the region of the samples (denoted as Ω ∈ Rn ). A novel method to incorporate prior knowledge into the kernel based regression is proposed and two key points will be discussed thoroughly thereby: how to represent the prior knowledge in the function space and how to combine the prior knowledge and the training samples in the training procedure. The main idea and technical details will be discussed in the following sections. Besides, two experiments will be performed to validate the proposed method and to investigate the effects on the regression performance of various factors. The first experiment is based on the blackbody radiation problem using an artificial data set and the second one is based on a real industrial problem and real operational data.

if the norm of the solution is required to be as small as possible, then, the solution is the hollow square in Fig. 1. It is also far from the optimal solution. If a solution is guessed, with the prior knowledge, as the hollow circle, then the solution can be determined as the solid square, which is much better than the least-norm solution (the hollow square).

Fig. 1

Main idea of the proposed method

1

Main idea and basic settings

For many industrial applications, most samples are obtained in the stationary state of the production line and congregate in a small region, only a few samples are outside this region. In most cases, the information about unstationary state is of great importance, therefore, most samples in the stationary region are helpless for the regression task and these samples can be regarded as “non-informative samples”. There are only several informative samples and they are submerged by a mass of non-informative samples. In this sense, these problems should be regarded as “problem with a few samples” though the total sample number is not a few at all. On the implemental viewpoint, for a kernel based regression method, closely distributed samples usually result in highly dependent equations of regression coefficients, and these redundant equations bring few benefits for determining the coefficients but might cause serious numerical problems. Commonly, the coefficients αi in (1) will be determined through several (say d) linear equations or corresponding optimization problems. Denote l as the number of informative samples. If d < l, then, the coefficient equation system is overdetermined and should be replaced by an optimization problem and the coefficients can be calculated with high numerical stability. In this case, the complexity of the regression model is limited by the number of informative samples, but the problem might be so complex that it cannot be solved by a simple model. On the contrary, if d > l, then the coefficient equation system is underdetermined, and its solution is not unique. In this case, some other assumptions or constraints should be involved. However, these assumptions or constraints might be unreasonable so that the solution is not the desired optimal one. For example, suppose a two-dimensional coefficient should be determined by a single linear equation, and the optimal solution is the solid circle in Fig. 1. Then the straight line is the solution trajectory of the linear equation, which does not pass the optimal solution because of the observation noise. There is no unique solution to this problem, therefore, additional constraints should be included. For instance,

Obtaining the solution based on prior knowledge is not a simple task at all. It can be divided into two steps: Step 1. Representing the prior knowledge function in the function space. This step includes two subproblems: choosing the function space, namely, constructing the coordinate system in Fig.1, and representing the prior function in this space, namely, determining the hollow circle. Step 2. Incorporating the prior knowledge and training samples, namely, computing the solid square with the straight line and the hollow circle. The following sections will discuss these steps thoroughly. In this paper, all analysis will be performed in a function space and a prior function is regarded as a point or a set in the space and the reproducing kernel Hilbert space (RKHS)[14] is used as the basic function space. The regression will be performed in a finite dimensional subspace t S = span{K(t k , x ), k = 1, · · · , d} and the prior knowledge function will be translated into this subspace x p(x ) ≈
d X k=1

t βk K(t k , x )

(2)

The points t k and the coefficients βk will be determined with the prior knowledge using an iterative greedy algorithm. Once the representation (2) is obtained, the regression is made in the subspace S. The corresponding loss function possesses two weighted terms. One of them reflects the performance on the training samples while the other the distance between the regression function and the prior knowledge. Various distances in these two terms are discussed in this paper. The weights in the loss function reflect the tradeoff between the training samples and the prior knowledge.

2

Greedy algorithm to represent the prior function

In this section, a greedy algorithm is proposed to represent the prior knowledge function in the function space. Consider the simplest case: there is only a single deterx minate prior knowledge function p(x ) with domain D ⊃ Ω. Our aim is to find its finite dimensional RKHS representation, namely

No. 12

SUN Zhe et al.: Incorporating Prior Knowledge into Kernel Based Regression x p(x ) ≈
d X k=1

1517

t βk K(t k , x ; ω)

(3)

where are called ”basis vector” and ω is the parameter of the kernel function. The representation of the prior knowledge involves two subproblems: determining the used subspace S, namely, the kernel function parameter ω and t k=1 the basis vectors {t k }d ; determining the coordinates of the prior knowledge, namely, the coefficients βk . Suppose x p(x ) belongs to the RKHS associated with the positive dex terminate kernel K(x , y ). Then, there must be a series in (3), which converges to p as d → ∞. An iterative algorithm is used to find the approximation (3). In [15], a greedy algorithm was proposed to find the lowrank approximation of the kernel Gram matrix. Evaluating approximation (3) can be regarded as its generalization with infinity dimension and specific error criterion. Suppose an approximation in the m-th step is obtained, and denote the residue function as x x rm (x ) = p(x ) −
(m) m X k=1

t k=1 {t k }d

be evaluated by the angle between the corresponding kernel p x x x functions: θij = arccos K(x i , x j )2 /[K(x i , x i )K(x j , x j )]. The kernel parameter can also be determined by the iterative procedure. If the prior knowledge representaP x t tion is already obtained: p(x ) ≈ di βk (ωi )K(t k , x ; ωi ), k=1 i = 1, · · · , Nω , then, the optimal parameter ωopt can be chosen with some criteria, for example, the least error criP x t terion ωopt = argminω { p(x ) − di βk (ω)K(t k , x ; ω) }, k=1 or the least order criterion ωopt = argminω {di }, or the β β least complexity criterion ωopt = argminω {β (ω) Kβ (ω)}. These criteria possess different meanings and should be chosen according to the problem. The iterative algorithm in this section is summarized as follows.
Algorithm 1. Greedy algorithm to represent the prior knowledge Input Tolerance εf , kernel parameters {ωk }Nω k=1 d dopt t opt Find Optimal dopt , {t i }i=1 , {βi }i=1 , ωopt For j = 1 : Nω x x K(x , y ) = K(x , y ; ωj ) Remove the training samples with high dependency mj t to obtain the initial basis vector set {t j,i }i=1 Compute β j Compute εf
(mj )

βk

(m)

t K(t k , x )

(4)

by (7) x = p(x ) −

where βk denotes the k-th coefficient evaluated in the m-th step. In order to achieve the largest improvement, the next basis vector t m+1 should ensure that the next basis t function K(t m+1 , x ) is as close as possible to the residue function rm . Various measures can be adopted to evaluate this closeness. In this paper, the angle between the two functions is used because of convenience in computation. This angle can be calculated as t cos ∠(rm , K(t m+1 , ·)) = t rm (t m+1 ) t rm H K(t m+1 , ·) t rm (t m+1 ) p t K(t m+1 , t m+1 ) ∝
H

(mj )

P mj
k=1

βk

(mj )

t K(t j,k , x )

While εf > εf mj = mj + 1 Choose t j,mj by (6) Compute β j Compute εf
(mj )

(mj )

by (7)

(mj )

(5)

end Find the optimal index jopt using some criterion dopt t t Output dopt = djopt , {t i } = {t jopt ,i }i=1 , β opt = β jopt , ωopt = ωjopt

end (d ) dj = mj , β j = β j j

3

where the norm · H denotes the RKHS norm, the reproducing property of the RKHS is used and the term independent on t m+1 is ignored in the last equation. Therefore, the next basis vector can be determined as ( ) t |rm (t )| t m+1 = argmax p ,t ∈ Ω (6) t t K(t , t ) Once the basis vectors t m+1 are obtained, the coefficients (m+1) βk should be calculated. In this paper, the Gaussian process regression[2] is adopted to determine these coefficients. This method possesses the following formula β (m+1) = (Km+1 + δI)−1p m+1
(m+1) (m+1)

Prior knowledge based kernel regression (PKBKR)

In this section, the proposed prior knowledge based kernel regression (PKBKR) method will be established using the results of the last section. Suppose the representation of the prior function is obtained as in (3). Then, the prior knowledge can be incorporated into the kernel regression scheme by introducing another penalty term into the target function J(α) = c0 Kld α − y
(2) Rl

x + c1 (α − β)Tk (x )

(2) F

(8)

(7)

where β (m+1) = [β1 , · · · , βm+1 ]T is the coefficient t vector, Km+1 = [K(t i , t j )]m+1 is the kernel Gram mai, j=1 t t trix, p m+1 = [p(t 1 ), · · · , p(t m+1 )]T is the vectorized function value and δ is a predefined small positive number. This iterative procedure will be performed until the dex sired precision is achieved: rd (x ) ≤ εf . Here, · can be any function norm. t i=1 The initial basis vector set {t i }m0 can be extracted from the training sample set. In most cases, the training sample set contains many highly dependent samples, and these samples are redundant and can be removed from the basis vector set. The dependency between samples x i and x j can

x where Kld = [K(x i , t j )], i = 1, · · · , l; j = 1, · · · , d is the x i=1 generalized Gram matrix between the sample set {x i }l t j=1 x t and the basis vector set {t j }d , k (x ) = [K(t 1 , x ), · · · , t K(t d , x )]T is the vectorized kernel function, and y = [y1 , · · · , yl ]T is the output vector of the training samples. The following norms and parameters should be determined beforehand: · Rl is a (semi)norm in the l-dimensional real space Rl , · F is a (semi)norm in the function space F , the superscript · (2) represents that sometimes the square of the norm will be used instead of the norm itself, and the weights c0 , and c1 with c0 + c1 = 1, c0 ≥ 0, c1 ≥ 0 determine the tradeoff between the training samples and the prior knowledge. The regression function is ˆx f (x ) =
d X i=1

t x αi K(t i , x ) = αTk (x )

(9)

1518

ACTA AUTOMATICA SINICA

Vol. 34

Obviously, Kld α is the vectorized value of the regression ˆ function f by the training samples x 1 , · · · , x l . Various (semi) norms can be applied to (8), and the corresponding solutions possess various geometrical meanings in the function space. Here are some examples: 1) The 2-norm for regression error and the Mahalanobis norm for prior knowledge. In this case, the real norm is u 2 l = u Tu , and the R T x F function norm is µTk (x ) 2 = µT Kd Kd µ with the Gram matrix Kd . It is easy to prove that this case is equivalent to training a (d + l)-dimensional least square kernel machine x t t k=1 with training samples {x i , yi }l ∪ {t k , p(t k )}d . In other i=1 words, the prior knowledge works as a trainer. This case is not of our interest. 2) The 2-norm for regression error and the RKHS norm for prior knowledge. x F In this case, the function norm is µTk (x ) 2 = µT Kd µ with the Gram matrix Kd . The geometrical meaning of minimizing target function in this case is minimizing the weighted distance from the solution to the prior knowledge function and the solution set with the training sample, namely, the weighted distance from the solid square to the hollow circle and the straight line in Fig. 1. In this case, the optimal solution can be calculated analytically
T α∗ = (c0 Kld Kld + c1 Kd )−1 (c0y + c1 Kd β)

distance between the solution and the prior knowledge. The ε-insensitive distance for the functions on the finite dimensional subspace possesses the same form as for the real number, with the only difference being the replacement of the tolerance εR with a function distance tolerance εF . In this case, the hollow circle in Fig. 1 becomes a square region. Once again, minimizing the target function implies minimizing the weighted distance from the hollow square to this square region and the straight line. Similar to the last case, the optimal solution should be solved from a constrained quadratic programming problem 1 T v α ∗ = D−1 Kldy − (v − v ∗ ) 2 » –T » –» – 1 v D −D v v + (v , v ∗ ) = argmin ∗ −D D v∗ 4 v v v ,v ∗ #T » – " T e εF De + Dβ − Kldy v v∗ e εF De − Dβ + K T y
ld

s.t.

v 0 ≤ Dv ∗ ≤

c0 e c1 (13)

T D = Kld Kld

(10)

3) The ε-insensitive distance for regression error and the RKHS norm for prior knowledge. In this case, similar to the popular SVR, the ε-insensitive distance is used to evaluate the regression error. This distance possesses as u =
l X ∗ (ξi + ξi ) i=1

ξi = max(ui − εR , 0) ∗ ξi = max(εR − ui , 0)

(11)

where εR is a predefined tolerance of regression error. In this case, the straight line in Fig. 1 becomes a zonal region. Minimizing the target function implies minimizing the weighted distance from the solid square to this zonal region and the hollow circle. In this case, the optimal solution should be solved from a constrained quadratic programming problem 1 −1 v α ∗ = − Kd Kld (v − v ∗ ) 2 –» – » –T » 1 v Q −Q v v (v , v ∗ ) = argmin + ∗ −Q Q v∗ 4 v v v ,v ∗ " #T » – εRe + y − K ldβ v v∗ εRe − y + K ldβ s.t. c0 e c1 −1 Q = Kld Kd Kld 0 ≤ v∗ ≤ e = [1, · · · , 1]T (12)

Note that for the last two cases, the solution regions derived from training samples and prior knowledge might intersect. For instance, in case 3), the zonal region might cover the hollow circle, and in case 4), the straight line might pass the square region. If these intersections occur, there is no unique solution for (8), because all points in the intersection will lead the target function J(α) to zero. In these cases, other constraints such as least norm should be introduced. This paper does not discuss these cases. There are many other variations of the general scheme (8) when other norms are used. This paper only concentrates on the aforementioned cases. Furthermore, suppose there are h prior knowledge funcx k=1 and all their domains cover the sample tions {pk (x )}h region: ∀k, Dk ⊃ Ω, then the target function in (8) can be modified to J(α) = c0 Kld α − y
(2) Rl

+

h X k=1

x ck (α − β k )Tk (x )

(2) F

(14)

4) The 2-norm for regression error and the ε-insensitive distance for the prior knowledge. If the prior knowledge function or its representation is not accurate, then, incorporating it into the learning machine might bring bad effects. In these cases, the ε-insensitive distance can be applied to evaluation of the

where β k is the representation coefficient vector of the P k-th prior knowledge function, and h ck = 1, ck ≥ 0. k=0 However, if the function norm · F obeys the homogeneous u property au F = |a| u F , the target function in (14) is P equivalent to (8) with β = 1/(1−c0 ) h ck β k , neglecting k=1 the terms independent of α . This implies the result will not change if the prior knowledge functions are replaced with P their combination p = 1/(1 − c0 ) h ck pk . k=1 This idea can be applied to more complex case: if domains of some prior knowledge functions do not cover the sample region, then, the confidence functions are necessary to combine the prior functions together. The confix dence functions vk (x ), k = 1, · · · , h make up a partition of unity on Ω, namely, they must satisfy the following propP x x erties: 1) 0 ≤ vk (x ) ≤ 1, k = 1, · · · , h; 2) h vk (x ) = 1; k=1 x 3) supp{vk (x )} = Dk , k = 1, · · · , h. Using these confidence functions, the prior knowledge functions can be combined to a single function x p(x ) =
h X k=1

x x vk (x )pk (x ),

x∈Ω

(15)

No. 12

SUN Zhe et al.: Incorporating Prior Knowledge into Kernel Based Regression

1519

Then, the method proposed in this section can be applied directly. The proposed PKBKR is, then, completely established with the above discussions.

4
4.1

Experiments
The blackbody radiation problem

In this section, two regression problems are introduced to validate the proposed method PKBKR. The blackbody radiation (BR) problem[16−17] is a classical theoretical physics problem and is an important original motivation of the quantum theory. A key point of the BR problem is to regress the functional relationship between the radiation power Eν and the frequency ν. Based on different theoretical frameworks, W. Wien and J. W. Rayleigh have respectively proposed two formulae “ κ ν” 2 (16) Eν = κ1 ν 3 exp − T 8π κT ν 2 (17) c3 where T is the temperature in the stationary state of the blackbody, κ1 , κ2 , κ are coefficients. The Wien s formula fits the experimental data with high-frequency very well, whereas Rayleigh s formula works well if the frequency is low, but both are valid only in specific frequency ranges. Finally, M. Planck has “guessed” another formula Eν = Eν = κ1 ν 3 “κ ν ” 2 exp −1 T (18)

where x is the output of the impedance sensor, and K|Z| is a parameter and can be calculated theoretically. This relationship is relatively accurate when the load impedance is close to the matching point, but will become poor if the load impedance is far from the matching point. The main task of this problem is to estimate the load impedance according to the output of the sensor. The samples were obtained from a real RFIS and the corresponding impedances were measured by a high-precision sensor. The sample set includes 977 samples. The RMS regression error of the prior knowledge function (21) through these 977 samples was 5.8299. 4.3 Experiment results

This formula fits the experimental data with any frequency very well. Planck and other physicists revealed the physical essence implied in the Planck s formula in the following decades. In this paper, the BR problem is regarded as a regression problem with prior knowledge functions. The Wien s and Rayleigh s formulae are chosen as prior functions and the training samples are generated by Planck s formula. The BR problem can be represented as follows Fit the training samples {νi , Eν,i }l i=1 with the prior knowledge functions “ κ ν” 2 p1 (ν) = κ1 ν 3 exp − , D1 ⊃ Ω T 8π p2 (ν) = 3 κT ν 2 , D2 ⊃ Ω Ω ⊂ (D1 ∪ D2 ) (19) c The confidence functions are chosen as v1 (ν) = 1 exp(aν + b) + 1 v2 (ν) = 1 − v1 (ν)

Several experiments were performed to demonstrate the regression performance of the proposed PKBKR. All the experiments were repeated 100 (BR problem) or 50 (RFIS problem) times and all results were averaged. In every run, training samples were randomly generated (BR problem) or randomly selected from the data set (RFIS problem). For comparison, the SVM method was adopted to regress the same data set for both problems. The kernel parameter and other free parameters of the SVM method were set to be equivalent to that in the proposed PKBKR method to make results comparable; however, this does not imply that they are optimal for the SVM method. Thus, parameter choices with the criterion proposed by Cherkassky and Ma[19] were also investigated. The SVM methods with these parameters are denoted as SVM-PK and SVM-CM, respectively. For both problems, the regression performance was evaluated by the regression error and the number of support vectors (SVs) or basis vectors. Since these two notions possess the same meaning in the context of this paper, SV was used in the experiment results. Furthermore, because the parameter C in SVM scheme and c0 /c1 in PKBKR scheme possess similar meaning, the notion C is used in the following tables. The experiment results are illustrated by the following figures. 4.3.1 Results for the BR problem

Based on the BR problem, the regression error and the number of SVs of various methods vs. the number of training samples under different noise levels were investigated and the results were illustrated in Figs. 2 and 3. The parameters are shown in Table 1, where “C-M” implies the parameters were determined online with [19].

(20)

where a and b are two predefined constants. 4.2 The radio frequency impedance sensor problem

The radio frequency impedance sensor (RFIS) problem is a practical industrial problem. In practice, online measurement of the load impedance of the radio frequency source is a difficult task. The RFIS scheme proposed in [18] will be discussed in this section. According to the principle analysis, the magnitude of the impedance should obey the following function: x |Z| = p(x ) = K|Z| x1 x2 (21)
Fig. 2 Regression error vs. number of training samples under various noise levels, BR problem

1520

ACTA AUTOMATICA SINICA
Table 2
Method C ε

Vol. 34
Parameters for RFIS problem
SVM-CM C-M C-M PKBKR 1000 −

SVM-PK 1000 0.01

5

Discussions

Fig. 3

Number of SVs vs. number of training samples under various noise levels, BR problem Table 1 Parameters for BR problem
SVM-CM C-M C-M PKBKR 1 −

Method C ε

SVM-PK 1 0.01

4.3.2

Results for the RFIS problem

The results based on the RFIS problem are illustrated in Figs. 4 and 5. The parameters are shown in Table 2.

Fig. 4

Regression error vs. number of training samples, RFIS problem

Fig. 5

Number of SVs vs. number of training samples, RFIS problem

Based on the results illustrated in Figs. 2 ∼ 5, the following conclusions can be drawn: 1) For both problems, the PKBKR method achieves higher generalization performance than the SVM regression, especially when the training samples are only a few or the observation noise is large. As the sample number increases or the noise level decreases, the regression precision of the SVM regression is comparable to the PKBKR. 2) The SVs (basis vectors) in the PKBKR method are determined in the greedy algorithm and are almost independent of the training samples and the noise level, therefore, the number of SVs in the PKBKR method changes hardly when the number of training samples or the noise level differs. 3) The BR problem is a simple problem, namely, it can be regressed by a function with a simple structure. The corresponding curve is quite smooth. This implies that it can be easily regressed by Gaussian kernels. On the contrary, the RFIS problem is an essentially complex problem so that the relationship between the input x and the target |Z| can hardly be represented by the linear combination of Gaussian kernels. The number of SVs is always about 500 in the PKBKR method or almost equals the sample number in the SVM method. The reason can be revealed partly by the prior knowledge: with (21), the relationship to be regressed must be quite steep in the second component of the input x2 ; in contrast, applying the Gaussian kernel already implies the assumption that the target function is smooth, therefore, a complex structure is necessary to represent this steep relationship. The PKBKR method regresses the BR problem with a small number of SVs (about 30) and the RFIS problem with many SVs (about 500), see Figs. 3 and 5. In this sense, the PKBKR method regresses the training samples with more appropriate complexity and this complexity will be hardly affected by training samples. As shown by the two practical problems, the proposed PKBKR method is especially valuable for a kind of problems: some theoretical prior knowledge is available, but not precise enough and there are only a few informative samples. For these problems, neither prior knowledge based nor samples based methods can make reliable estimation, therefore, the PKBKR method is a reasonable solution for these problems. In industrial applications, it is not difficult to find problems of this kind because of the necessary simplification in modeling and difficulties in obtaining samples of abnormal state of the production line. The proposed PKBKR method is only applicable if the applicable domain of the prior knowledge covers the input space, and at least there are some prior functions which can be combined into a “total” prior function by (15). If the prior functions are “locally” applicable, the PKBKR cannot be used directly. Thus, developing a further method to incorporate “local” prior knowledge is an important subsequential work of this paper. Furthermore, unknown parameters in the prior functions will also hinder applying PKBKR. Though this limitation can be overcome by estimating the unknown parameters by least square method, it seems more reasonable to develop a new method to perform the parameter estimation and incorporate prior knowledge

No. 12

SUN Zhe et al.: Incorporating Prior Knowledge into Kernel Based Regression

1521

in one step in the future.

6

Conclusions

In this paper, a dilemma in the regression problem is investigated and the function of the prior knowledge in solving this dilemma is discussed. The novel method to incorporate prior knowledge into the kernel based regression is proposed. The proposed method PKBKR involves two subproblems: representing the prior knowledge and incorporating it into the regression, both of which are discussed thoroughly: an iterative greedy method is proposed to represent the prior knowledge and the prior knowledge is incorporated into the regression with a weighted loss function. Distances of various kinds in the loss function are discussed and solutions are derived for a few cases. Experiments based on artificial and real data sets are performed to validate the proposed method. The experiment results show that the proposed method exceeds the prior knowledge itself and the SVM method in regression precision, even if the number of samples is small and the noise level is high. Furthermore, the regression made by PKBKR possesses more appropriate model sample-independent complexity. Based on the theoretical discussions and experiments, the proposed PKBKR method is a potential solution for many practical problems. References
o 1 Smola A J, Sch¨lkopf B. A tutorial on support vector regression. Statistics and Computing, 2004, 14(3): 199−122 2 Williams C K I, Rasmussen C E. Gaussian Processes for Regression. Cambridge: MIT Press, 1996. 514−520 3 Joerding W H, Meador J L. Encoding a priori information in feedforward networks. Neural Networks, 1991, 4(6): 847−856 4 Chen C W, Chen D Z. Prior-knowledge-based feedforward network simulation of true boiling point curve of crude oil. Computers and Chemistry, 2001, 25(6): 541−550 ˘ 5 Milani˘ S, Strm˘nik S, Sel D, Hvala N, Karba R. Incorpoc c rating prior knowledge into artificial neural networks — an industrial case study. Neuralcomputing, 2004, 62: 131−151 6 Sch¨lkopf B, Simard P, Smola A J, Vapnik A. Prior knowlo edge in support vector kernels. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Denver, USA: MIT Press, 1997. 640−646 7 Wang L, Xue P, Chan K L. Incorporating prior knowledge into SVM for image retrieval. In: Proceedings of the 17th International Conference on Pattern Recognition. Washington D. C., USA: IEEE, 2004. 981−984 8 Wu X Y, Srihari R. Incorporating prior knowledge with weighted margin support vector machines. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, USA: ACM, 2004. 326−333 9 Fung G M, Mangasarian O L, Shavlik J W. Knowledge-based support vector machine classifiers. In: Proceedings of the 2002 Conference on Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2002. 537−544 a 10 Le Q V, Smola A J, G¨rtner T. Simpler knowledge-based support vector machines. In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, USA: ACM, 2006. 521−528

o 11 Smola A J, Frieß T, Sch¨lkopf B. Semiparametric support vector and linear programming machines. In: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 1998. 585−591 12 Li W Y, Lee K H, Leung K S. Generalized regularized leastsquares learning with predefined features in a Hilbert space. In: Proceedings of the 2006 Conference on Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2006. 881−888 13 Li W Y, Leung K S, Lee K H. Generalizing the bias term of support vector machines. In: Proceedings of the the 12th International Joint Conference on Artificial Intelligence. Hyderabad, India: University Trier, 2007. 919−924 14 Sch¨lkopf B, Smola A J. Learning with Kernels. Cambridge, o USA: The MIT Press, 2001 15 Smola A J, Sch¨lkopf B. Sparse greedy matrix approximao tion for machine learning. In: Proceedings of the 17th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers, 2000. 911−918 16 McMahon D. Quantum Mechanics Demystified. New York, USA: McGraw-Hill Professional Publishing, 2005 17 Zeng Jin-Yan. Quantum Mechanics. Beijing: Science Press, 1984 (in Chinese) 18 Keane A R A, Hauer S E. Automatic Impededance Matching Apparatus and Method, U. S. Patent 5195045, March 1993 19 Cherkassky V, Ma Y Q. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Networks, 2004, 17(1): 113−126

SUN Zhe Ph. D. candidate in the Department of Automation, Tsinghua University. He received his bachelor degree at Harbin Institute of Technology, P. R. China in 2004. He studies recently control science and engineering at Tsinghua University. His main research interest is machine learning theory. E-mail: sun04@mails.thu.edu.cn

ZHANG Zeng-Ke Professor in the Department of Automation, Tsinghua University. His research interest covers intelligent control (fuzzy and neural network), soft sensing, motion control and system integration, and machine learning. E-mail: zzk@tsinghua.edu.cn

WANG Huan-Gang Associate professor in the Department of Automation, Tsinghua University. His research interest covers nonlinear control, optimization of manufacturing systems, machine learning and intelligent control systems. Corresponding author of this paper. E-mail: hgwang@tsinghua.edu.cn


				
DOCUMENT INFO
Shared By:
Stats:
views:76
posted:12/17/2009
language:English
pages:7
Description: incorporating prior knowledge into kernel based regression