Neurofuzzy Modelling
Jan Jantzen jj@iau.dtu.dk:1
$EVWUDFW
A neural network can approximate a function, but it is impossible to interpret the result in terms of natural language. The fusion of neural networks and fuzzy logic in neurofuzzy models provide learning as well as readability. Control engineers find this useful, because the models can be interpreted and supplemented by process operators.
&RQWHQWV
,QWURGXFWLRQ 7ULDO DQG HUURU &OXVWHULQJ 3.1 Feature determination 3.2 Hard clusters (HCM algorithm). 3.3 Fuzzy clusters (FCM algorithm) 3.4 Subtractive clustering 1HXURIX]]\ IXQFWLRQ DSSUR[LPDWLRQ 4.1 Adaptive Neurofuzzy Inference System (ANFIS) 4.2 ANFIS architecture 4.3 The ANFIS learning algorithm 4.4 Genetic algorithms 4.5 Computing membership functions 7HVW UHVXOWV DQG GLVFXVVLRQ &RQFOXVLRQV 5 8 10 12 16 17 19 20 24
Technical University of Denmark, Department of Automation, Bldg 326, DK-2800 Lyngby, DENMARK. Tech. report no 98-H-874 (nfmod), 30 Oct 1998.
1
Controller design
Estimate model
Reference Controller Process
Figure 1: Indirect adaptive control: The controller parameters are updated indirectly via a process model.
,QWURGXFWLRQ
A neural network can model a dynamic plant by means of a nonlinear regression in the discrete time domain. The result is a network, with adjusted weights, which approximates the plant. It is a problem, though, that the knowledge is stored in an RSDTXH fashion; the learning results in a (large) set of parameter values, almost impossible to interpret in words. Conversely, a fuzzy rule base consists of readable if-then statements that are almost natural language, but it cannot learn the rules itself. The two are combined in QHXURIX]]\ V\VWHPV in order to achieve readability and learning ability at the same time. The obtained rules may reveal insight into the data that generated the model, and for control purposes, they can be integrated with rules formulated by control experts (operators). Assume the problem is to model a process such as in the LQGLUHFW DGDSWLYH FRQWUROOHU in Fig. 1. A mechanism is supposed to extract a model of the nonlinear process, depending on the current operating region. Given a model, a controller for that operating region is to be designed using, say, a pole placement design method. One approach is to build a twolayer perceptron network that models the plant, linearise it around the operating points, and adjust the model depending on the current state (Nørgaard, 1996). The problem seems well suited for the so-called 7 DNDJL6XJHQR type of neurofuzzy model, because it is based on piecewise linearisation. Extracting rules from data is a form of modelling activity within SDWWHUQ UHFRJQLWLRQ, GDWD DQDO\VLV or GDWD PLQLQJ also referred to as WKH VHDUFK IRU VWUXFWXUH LQ GDWD (Bezdek and Pal, 1992). The goal is to reduce the complexity in a problem, or to reduce the amount of data associated with a problem. The field of data analysis comprises a great variety of methods; the objective of this note is to present a feasible way of combining fuzzy and neural networks. The neural network research started in the 1940s, and the fuzzy logic research in the 1960s, but the neurofuzzy research area is relatively new. The first book was probably by Kosko (1992). His ideas were implemented slightly earlier in the commercial tool TILGen (Hill, Horstkotte and Teichrow, 1990), and in 1995 came the Fuzzy Logic
2
Toolbox for Matlab (Jang and Gulley, 1995), which includes a neurofuzzy method. Many other commercial neurofuzzy tools are now available (see MIT, 1995). Basically, we intend to try and describe a set of data collected from a process by means of rules. The description would be able to reproduce the training data, not necessarily with a zero error, but in an interpolative way. The approach here is to investigate a plain trial and error method, refine it with clustering methods, and refine it further with learning methods.
7ULDO DQG HUURU
The classical way of modelling a plant, or a controller, is to acquire rules from experts based on their experience. The most common approach is to question experts or operators using a carefully organised questionnaire. The LQSXW VSDFH, that is, the coordinate system formed by the input variables (position, velocity, error, change in error) is partitioned into a number of regions. Each input variable is associated with a family of fuzzy term sets, say, ’negative’, ’zero’, and ’positive’. The membership functions must then be defined by the expert. For each valid combination of inputs, the expert is supposed to give typical values for the outputs. For many systems, however, knowledge about the partitioning of the input space is unavailable. In that case, the input space can be partitioned by a number of equally spaced and shaped membership functions. The rule base can be established as all the possible combinations of the input terms. The task for the expert is then to estimate the outputs. The design procedure would be 1. Select relevant input and output variables, 2. determine the number of membership functions associated with each input and output, and 3. design a collection of fuzzy rules. To accomplish this the expert relies on common sense, physical laws, or just trial and error. The result is a rule base which more or less describes the behaviour of the plant. ([DPSOH H[SHUW PRGHOOLQJ *LYHQ D GDWD VHW SORWWHG ZLWK D GDVKHG OLQH LQ )LJ 2 WRS ZH VKDOO DVVXPH WKH H[SHUW¶ UROH DQG WU\ WR GHVFULEH LW ZLWK UXOHV /HW XV GLYLGH WKH V LQSXW VSDFH LQWR WKUHH SDUWLWLRQV ¶ PDOO¶ ¶PHGLXP¶ DQG ¶ODUJH¶ $ VHW RI UXOHV FRXOG EH V ,I LQSXW LV VPDOO WKHQ RXWSXW LV ,I LQSXW LV PHGLXP WKHQ RXWSXW LV ,I LQSXW LV ODUJH WKHQ RXWSXW LV &OHDUO\ ZLWK WULDQJXODU PHPEHUVKLS IXQFWLRQV HTXDOO\ VSDFHG WKH UHVXOW ZRXOG EH D VWUDLJKW OLQH 7KHUHIRUH ZH XVH WKH VKDSHV LQ )LJ 2 ERWWRP DIWHU VRPH WULDO DQG HUURU 7KH LQSXW RXWSXW UHODWLRQVKLS RI WKH PRGHO LV VKRZQ LQ WKH XSSHU SORW RI WKH ILJXUH 2EYLRXVO\ WKH ILW LV VRPHZKDW SRRU EXW IRU D ILUVW DWWHPSW LW PLJKW EH DFFHSWDEOH 7KH PRGHO XVHV WKH ZHLJKWHG DYHUDJH RI WKH RXWSXW VLQJOHWRQV IRU GHIX]]LILFDWLRQ 3
1
Output
0 Small 1 0
20 Medium
40
60 Large
80
100
Membership
0 0 20 40 Input 60 80 100
Figure 2: A fuzzy model approximation (solid line, top) of a data set (dashed line, top). The input space is divided into three fuzzy regions (bottom).
&OXVWHULQJ
The example above shows that it can be rather difficult to fit the fuzzy model to the target data using trial and error, although it is quite easy to express linguistically the relation between input and output. A better approach is to approximate the target function with a piece-wise linear function and interpolate, in some way, between the linear regions. In the 7 DNDJL6XJHQR model (Takagi & Sugeno, 1985) the idea is that each rule in a rule base defines a region for a model, which can be linear. The left hand side of each rule defines a fuzzy validity region for the linear model on the right hand side. The inference mechanism interpolates smoothly between each local model to provide a global model. The general Takagi-Sugeno rule structure is If i +h4 is $4 > h5 is $5 > = = = > hn is $n , then | @ j+h4 > h4 = = =, Here i is a logical function that connects the sentences forming the condition, | is the output, and j is a function of the inputs hl . An example is If error is positive and change in error is positive then x @ Ns +huuru . Wg fkdqjh lq huuru, where x is a controller’s output, and the constants Ns and Wg are the familiar tuning constants for a proportional-derivative (PD) controller. Another rule could specify a PD controller with different tuning settings, for another operating region. The inference mechanism is then able to interpolate between the two controllers in regions of overlap. 4
150 100 1 2
output
50 0 0 1 50 (a)
100
membership
0.5
0 0 50 (b) 100
Figure 3: Interpolation between two lines (top) in the overlap of input sets (bottom). ([DPSOH 6XJHQR LQWHUSRODWLRQ 6XSSRVH ZH KDYH WZR UXOHV ,I HUURU LV /DUJH WKHQ RXWSXW LV /LQH ,I HUURU LV 6PDOO WKHQ RXWSXW LV /LQH /LQH LV GHILQHG DV 3=5 huuru . <3 DQG OLQH LV GHILQHG DV 3=9 huuru . 53 7KH UXOHV LQWHUSRODWH EHWZHHQ WKH WZR OLQHV LQ WKH UHJLRQ ZKHUH WKH PHPEHUVKLS IXQFWLRQV RYHUODS )LJ 3 2XWVLGH RI WKDW UHJLRQ WKH RXWSXW LV D OLQHDU IXQFWLRQ RI WKH huuru In order set up a Takagi-Sugeno model of a process, it would be helpful to have an automatic method that finds FOXVWHUV in the data that could be candidates for the linear regions. The objective of such a cluster analysis is to partition the data set into a number of natural and homogeneous subsets, where the elements of each subset are as similar to each other as possible, and at the same time as different from those of the other sets as possible. A FOXVWHU is such a grouping of objects, that is, it consists of all points close, in some sense, to the cluster centre.
)HDWXUH GHWHUPLQDWLRQ
In general, data analysis (Zimmermann, 1993) concerns REMHFWV which are described by IHDWXUHV. Consider the example in Table 1 which contains data for some vehicles. Each row of such a table describes an object, and each column describes a feature. A feature can be regarded as a pool of values from which the actual values appearing in a given column are drawn. The features form axes of an abstract IHDWXUH VSDFH in which each object is represented by a point. For instance, in the four-dimensional coordinate system spanned 5
3500
3000
Lorries
2500
Weight [kg]
2000
Sports cars
1500
Medium market cars
1000
500 100
150
200 Top speed [km/h]
250
300
Figure 4: Clusters of vehicles in a feature space. by the axes WRS VSHHG, DLU UHVLVWDQFH, FRORXU, and ZHLJKW, vehicle Y 4 is represented by the point +{4 > {5 > {6 > {7 , @ +553> uhg> 3=63> 4633,. The colour axis is different from the other axes because its values are drawn from a discrete domain rather than the domain of real numbers. In order to maintain the idea of a feature space, the discrete domain must therefore be ordered. Features can be selected directly, or they can be generated by a suitable combination of features. The latter option is necessary in case a large number of features needs to be reduced to a smaller number of features. Objects are fuzzy when one or more features are described in fuzzy terms. An example is a vehicle with a ’very fast’ car engine, rather than top speed equal to some crisp number. Clusters are fuzzy when each object is associated with a degree of membership rather than a crisp membership. An example is the cluster of ’sports cars’; a particular car would be a member to a degree depending on its top speed, air resistance, and weight. ([DPSOH FDUV )LJXUH 4 LV D SORW RI WRS VSHHG DJDLQVW DLU UHVLVWDQFH RI WKH GDWD IURP 7 DEOH 1 7KH SORW FDQ EH UHJDUGHG DV D IHDWXUH VSDFH EXW DV ZH DUH RQO\ SORWWLQJ WZR RI WKH IRXU IHDWXUHV LW LV PRUH SUHFLVH WR UHJDUG LW DV D SURMHFWLRQ RI WKH WRWDO IHDWXUH VSDFH RQWR WKH SODQH VSDQQHG E\ WRS VSHHG DQG ZHLJKW ,W ORRNV DV LI WKHUH DUH WKUHH FOXVWHUV SUHVHQW LQ WKH GDWD VD\ vsruwv fduv 9 9 9 phglxp pdunhw fduv 9 9 9 DQG oruulhv 9 9 9 ,Q WKLV FDVH ZH KDYH VHOHFWHG WZR IHDWXUHV RXW RI IRXU WR PDNH DQ DVVLJQPHQW RI YHKLFOHV WR WKH WKUHH FOXVWHUV In practice data conditioning (preprocessing) is necessary. It can be an advantage if each 6
V ehicle V1 V2 V3 V4 V5 V6 V7 V8 V9
Top speed [km/h] 553 563 593 473 488 463 433 438 443
Colour red black red grey blue white black red grey
Air resistance 3=63 3=65 3=5< 3=68 3=66 3=73 3=83 3=93 3=88
Weight [kg] 4633 4733 4833 ;33 <83 933 6333 5833 6833
Table 1: Sample data (from MIT, 1997) feature value x is QRUPDOLVHG to its normalised value x on a standard universe, say ^3> 4` > according to x xplq x@ (1) xpd{ xplq Here xplq is the smallest value in a series of measurements and xpd{ is the largest. 6WDQ GDUGLVDWLRQ transforms the mean of the set of feature values to zero, and the standard deviation to one. If the data are distributed according to the normal distribution with a mean value of p and standard deviation , the standardised values are found as follows: x p (2) xÆ @ It may also be necessary to VFDOH the values onto a particular range using an affine mapping. Scaling to a range ^x4 > x5 ` is a linear transformation according to x x4 +x5 x4 , . x4 (3) xQ @ x5 x4 After preprocessing, relevant features can be selected by an expert. For a series of data X @ +x4 > x5 > = = = > xN , some characteristic quantities, which can be used in forming the features, are SN 4 mean value p @ N l@4 xl S variance y @ N4 4 N +xl p,5 l@4 (4) sà standard deviation @ y range v @ xpd{ xplq Graphical representations such as plots of frequency spectra and histograms can support the selection. Features may be correlated, that is, a change in one feature [ is associated with a similar change in another feature \ . A FRUUHODWLRQ FRHIILFLHQW can be calculated as SN +{l p4 ,+|l p5 , (5) u @ tS l@4 S N +{l p4 ,5 N +|l p5 ,5 l@4 l@4 Here p4 is the mean value of all the values {l of feature [, and p5 is the mean of all |l values of the \ feature. The correlation coefficient assumes values in the interval ^ 4> 4`. 7
1
x2
0 0 x1 1
Figure 5: Example with two clusters (Jang & Gulley, 1995). Cluster centres are marked with solid circles. When u @ 4 there is a strong negative correlation between [ and \ , when u @ 4 there is a strong positive correlation, and when u @ 3 there is no correlation at all. Strongly correlated features may indicate a linear dependency. If two features are linearly dependent one of them is UHGXQGDQW (unnecessary); it is sufficient to select just one of them as a feature.
+DUG FOXVWHUV +&0 DOJRULWKP
The hard c-means (HCM) algorithm tries to locate clusters in the multi-dimensional feature space. The goal is to assign each point in the feature space to a particular cluster. The basic approach is as follows (see for example Lewis, 1990). 1. Manually seed the algorithm with f cluster centres, one for each cluster we are seeking. This requires prior information from the outside world of the number of different clusters into which the points are to be divided; thus the algorithm belongs to the class of VXSHUYLVHG algorithms. 2. Each point is assigned to the cluster centre nearest to it. 3. A new cluster centre is computed for each class by taking the mean values of the coordinates of the points assigned to it. 4. If not finished according to some stopping criterion (see later), go to step 2. Some additional rules can be added to remove the necessity of knowing precisely how many clusters there are. The rules allow nearby clusters to merge and clusters which have large standard deviations in coordinate to split. ([DPSOH WZR FOXVWHUV &RQVLGHU WKH GDWD SRLQWV SORWWHG LQ )LJ 5 7KH FOXVWHU DOJR 8
ULWKP ILQGV RQH FHQWUH LQ WKH ORZHU OHIW KDQG FRUQHU DQG DQRWKHU LQ WKH XSSHU ULJKW KDQG FRUQHU 7KH LQLWLDO FHQWUHV DUH PRUH RU OHVV LQ WKH PLGGOH RI WKH SORW DQG GXULQJ WKH LWHUD WLRQV WKH\ PRYH WRZDUGV WKHLU ILQDO SRVLWLRQV (DFK SRLQW EHORQJV WR RQH RU WKH RWKHU FODVV VR WKH FOXVWHUV DUH FULVS ,Q JHQHUDO WKH IHDWXUH GDWD KDYH WR EH QRUPDOLVHG LQ RUGHU IRU WKH GLVWDQFH PHDVXUH WR ZRUN SURSHUO\ Generally speaking the hard c-means algorithm is based on a FSDUWLWLRQ of the data space X into a family of clusters iFl j > l @ 4> 5> = = = > f, where the following set-theoretic equations apply, f ^ Fl @ X (6) Fl _ Fm @ >> all l 9@ m (7) (8) > Fl X> all l The set X @ iX4 > X5 > = = = > XN j is a finite set of points in a space spanned by the feature axes, and f is the number of clusters. We note, 5fN (9) because f @ N clusters just places each data sample into its own cluster, and f @ 4 places all data samples into the same cluster. Equations (6)-(8) express, respectively, that the set of clusters exhausts the whole universe, that none of the clusters overlap, and that a cluster can neither be empty nor contain all data samples. Formally, the c-means algorithm finds a centre in each cluster, minimising an objective function of a distance measure. The objective function depends on the distances between vectors Xn and cluster centres Fl > and when the Euclidean distance is chosen as a distance function, the expression for the objective function is, 3 4 f f [ [ [ C Ml @ nXn Fl n5 D (10) M@ l@4 l@4 n>Xn SFl where Ml is the objective function within cluster l. The partitioned clusters are typically defined by a f N binary characteristic matrix 0, called the PHPEHUVKLS PDWUL[, where each element pln is 1 if the nth data point Xn belongs to cluster l, and 0 otherwise. Since a data point can only belong to one cluster, the membership matrix 0 has the properties: the sum of each column is one, and the sum of all elements is N. If the cluster centres Fl are fixed, the pln that minimise Ml can be derived as pln @ i 1 if nXn Fl n nXn 0 otherwise
5
l@4
(11)
Fm n > for each m 9@ l
5
(12)
That is, Xn belongs to cluster l if Fl is the closest centre among all centres. If, on the other hand, pln is fixed, then the optimal centre Fl that minimises (10) is the mean of all vectors 9
in cluster l, Fl @
where mFl m is the number of objects in Fl (its FDUGLQDOLW\ ), and the summation is an elementby-element summation of vectors. $OJRULWKP (Jang, Sun and Mizutani, 1997) The hard c-means algorithm has five steps. 1. Initialise the cluster centres Fl +l @ 4> 5> = = = > f,= This is typically achieved by randomly selecting f points from the data points. 2. Determine the membership matrix 0 by (12). 3. Compute the objective function (10). Stop if either it is below a certain threshold value, or its improvement over the previous iteration is below a certain tolerance. 4. Update the cluster centres according to (13). 5. Go to step 2. The algorithm is iterative, and there is no guarantee that it will converge to an optimum solution. The performance depends on the initial positions of the cluster centres, and it is advisable to employ some method to find good initial cluster centres. It is also possible to initialise a random membership matrix 0 first and then follow the iterative procedure. ([DPSOH UXOHV 7KH SORW RI WKH FOXVWHUV LQ )LJ 5 VXJJHVWV D UHODWLRQ EHWZHHQ WKH YDUL DEOH {4 RQ WKH KRUL]RQWDO D[LV DQG {5 RQ WKH YHUWLFDO D[LV )RU H[DPSOH WKH FOXVWHU LQ WKH XSSHU ULJKW KDQG FRUQHU RI WKH SORW LQGLFDWHV LQ YHU\ ORRVH WHUPV WKDW ZKHQHYHU {4 LV ¶KLJK¶ GHILQHG DV QHDU WKH ULJKW HQG RI WKH KRUL]RQWDO D[LV WKHQ {5 LV DOVR ¶KLJK¶ GHILQHG DV QHDU WKH WRS HQG RI WKH YHUWLFDO D[LV 7KH UHODWLRQ FDQ EH GHVFULEHG E\ WKH UXOH LI {4 LV KLJK WKHQ {5 LV KLJK (14) ,QWXLWLYHO\ DW OHDVW LW VHHPV SRVVLEOH WR PDNH VRPH VHQVLEOH GHILQLWLRQV RI WKH WZR LQVWDQFHV RI WKH ZRUG ¶KLJK¶ LQ WKH UXOH EDVHG RQ WKH ORFDWLRQ RI WKH FOXVWHU FHQWUH 7KH FOXVWHU LQ WKH ORZHU OHIW SDUW RI WKH ILJXUH FRXOG SHUKDSV EH GHVFULEHG DV LI {4 LV ORZ WKHQ {5 LV ORZ $JDLQ WKH DFFXUDF\ RI WKLV UXOH GHSHQGV RQ WKH WZR GHILQLWLRQV RI ¶ORZ¶ (15)
[ 4 Xn mFl m n>X SF
n l
(13)
)X]]\ FOXVWHUV )&0 DOJRULWKP
It is reasonable to assume that points in the middle region of Fig. 5, between the two cluster centres, have a gradual membership of ERWK clusters. Naturally this is accommodated by fuzzifying the definitions of ’low’ and ’high’ in (14) and (15). The fuzzified c-means algorithm (Bezdek in Jang et al., 1997) allows each data point to belong to a cluster to a degree specified by a membership grade, and thus each point may belong to several clusters. The IX]]\ FPHDQV (FCM) algorithm partitions a collection of N data points specified by p-dimensional vectors Xn +n @ 4> 5> = = = > N, into f fuzzy clusters, and finds a cluster centre 10
in each, minimising an objective function. Fuzzy c-means is different from hard c-means, mainly because it employs IX]]\ SDUWLWLRQLQJ where a point can belong to several clusters with degrees of membership. To accommodate the fuzzy partitioning, the membership matrix 0 is allowed to have elements in the range ^3> 4`. A point’s total membership of all clusters, however, must always be equal to unity to maintain the properties (11) of the 0 matrix. The objective function is a generalisation of (10), M +0> F4 > F5 > = = = > Ff , @
f [ l@4
Ml @
N f [[ l@4 n@4
pt g5 > ln ln
(16)
where pln is a membership between 0 and 1, Fl is the centre of fuzzy cluster l, gln @ nXn Fl n is the Euclidean distance between the lth cluster centre and nth data point, and t 5 ^4> 4, is a weighting exponent. There are two necessary conditions for M to reach a minimum, SN pt Xn n (17) Fl @ S@4 lnt > N pln n@4 and 4 (18) pln @ Sf gln 5@+tÃ4,
m @4 gmn
The algorithm is simply an iteration through the preceding two conditions. $OJRULWKP (Jang et al., 1997) In a batch mode operation, the fuzzy c-means algorithm determines the cluster centres Fl and the membership matrix 0 using the following steps: 1. Initialise the membership matrix 0 with random values between 0 and 1 within the constraints of (11). 2. Calculate f cluster centres Fl +l @ 4> 5> = = = > f, using (17). 3. Compute the objective function according to (16). Stop if either it is below a certain threshold level or its improvement over the previous iteration is below a certain tolerance. 4. Compute a new 0 using (18). 5. Go to step 2. The cluster centres can alternatively be initialised first, before carrying out the iterative procedure. The algorithm may not converge to an optimum solution and the performance depends on the initial cluster centres, just as in the case of the hard c-means algorithm. ([DPSOH )&0 *HWWLQJ EDFN WR WKH SUREOHP RI PRGHOOLQJ WKH WHVW GDWD LQ ([DPSOH WKH GDWD ZHUH VXEPLWWHG WR WKH )&0 IXQFWLRQ LQ WKH 0DWODE )X]]\ /RJLF 7RROER[ $VNLQJ IRU WKUHH FOXVWHUV DOO RWKHU VHWWLQJV GHIDXOW LW ILQGV WKUHH FOXVWHU FHQWUHV )LJ 6 7KHVH FDQ EH XVHG DV LQGLFDWLRQV RI ZKHUH WR SODFH WKH SHDNV RI WKUHH IX]]\ PHPEHUVKLS IXQFWLRQV RQ WKH LQSXW D[LV
11
1
0
0
20
40
60
80
100
Figure 6: The FCM algorithm finds the cluster centres indicated by stars when asked to find three clusters.
6XEWUDFWLYH FOXVWHULQJ
Fuzzy c-means is a supervised algorithm, because it is necessary to tell it how many clusters f to look for. If f is not known beforehand, it is necessary to apply an unsupervised algorithm. 6XEWUDFWLYH FOXVWHULQJ is based on a measure of the density of data points in the feature space (Chiu in Jang et al., 1997). The idea is to find regions in the feature space with high densities of data points. The point with the highest number of neighbours is selected as centre for a cluster. The data points within a prespecified, fuzzy radius are then removed (subtracted), and the algorithm looks for a new point with the highest number of neighbours. This continues until all data points are examined. Consider a collection of N data points specified by p-dimensional vectors Xn > n @ 4> 5> = = = > N. Without loss of generality, the data points are assumed normalised. Since each data point is a candidate for a cluster centre, a GHQVLW\ PHDVXUH at data point Xn is defined as # $ N [ nXn Xm n h{s > (19) Gn @ +ud @5,5 m @4 where ud is a positive constant. Hence, a data point will have a high density value if it has many neighbouring data points. Only the fuzzy neighbourhood within the radius ud contributes to the density measure. After calculating the density measure for each data point, the point with the highest density is selected as the first cluster centre. Let Xf4 be the point selected and Gf4 its density measure. Next, the density measure for each data point Xn is revised by the formula # $ Q @ Gn GF h{s nXn XF4 n > Gn (20) 4 +ue @5,5 where ue is a positive constant. Therefore, the data points near the first cluster centre Xf4 will have significantly reduced density measures, thereby making the points unlikely to be selected as the next cluster centre. The constant ue defines a neighbourhood to be reduced in density measure. It is normally larger than ud to prevent closely spaced cluster centres; typically ue @ 4=8 ud = After the density measure for each point is revised, the next cluster centre Xf5 is selected and all the density measures are revised again. The process is repeated until a sufficient
12
Output
1
0 0
Membership
20
40
60
80
100
1 Low Medium High
0 0 20 40 Input 60 80 100
Figure 7: Result of applying subclustering to test data. The upper plot shows the target function (dashed) and the estimate (solid) as well as the clustere centres (x). The lower plot shows the three membership functions corresponding to the clusters. number of cluster centres are generated. When applying subtractive clustering to a set of input-output data, each of the cluster centres represents a rule. To generate rules, the cluster centres are used as the centres for the premise sets in a singleton type of rule base (or the radial basis functions in a radial basis function neural network, see later). ([DPSOH VXEFOXVW 7KH IXQFWLRQ JHQILV LQ WKH )X]]\ /RJLF 7RROER[ XVHV VXEFOXV WHULQJ WR JHQHUDWH UXOHV WKDW DSSUR[LPDWH D IXQFWLRQ $SSO\LQJ LW WR WKH GDWD IURP WKH SUHYL RXV H[DPSOH SURGXFHV FOXVWHU FHQWUHV WKDW DUH RQ WKH FXUYH DQG VOLJKWO\ VKLIWHG WR WKH ULJKW UHODWLYH WR )&0 FS )LJV 6 DQG 7 7KH UXOHV DUH ,I lqsxw LV /RZ WKHQ rxwsxw @ 3=365; lqsxw 3=<76 ,I lqsxw LV 0HGLXP WKHQ rxwsxw @ 3=368: lqsxw . 3=56: ,I lqsxw LV +LJK WKHQ rxwsxw @ 3=3539 lqsxw 4=44 (21) (22) (23)
7KH UXOH H[WUDFWLRQ PHWKRG ILUVW XVHV VXEFOXVWHULQJ WR GHWHUPLQH WKH QXPEHU RI UXOHV DQG LQSXW PHPEHUVKLS IXQFWLRQV WKHQ OLQHDU OHDVW VTXDUHV HVWLPDWLRQ WR GHWHUPLQH HDFK UXOH¶ V RXWSXW HTXDWLRQV (DFK IX]]\ FOXVWHU LV PDSSHG LQWR D JHQHUDOLVHG EHOOVKDSHG PHPEHUVKLS IXQFWLRQ GHILQHG DV 4 ehoo +{> d> e> f, @ {Ãf 5e 4. d 7KH SDUDPHWHU f LV WDNHQ WR EH WKH FHQWUH RI WKH FOXVWHU DQG d LV WKH FOXVWHU UDGLXV GHILQHG DV 13
100 Pos e Zero W1 0 W2 -100 Neg W3 + + + Sum1 + + + Sum2 u / Div
Input layer
Hidden layer
Output layer
Figure 8: Three rules perceived as a network. WKH ORQJHVW GLVWDQFH IURP WKH FHQWUH WR D SRLQW s ZLWK QRQ]HUR PHPEHUVKLS 7KH SDUDPHWHU e LV D VORSH GHWHUPLQHG DV D OLQHDU IXQFWLRQ RI WKH PHPEHUVKLS RI WKH ERXQGDU\ SRLQW s
1HXURIX]]\ IXQFWLRQ DSSUR[LPDWLRQ
It immediately comes to mind, when looking at a neural network, that the activation functions look like fuzzy membership functions. Indeed, an early paper from 1975 treats the extension of the McCulloch-Pitts neuron to a fuzzy neuron (Lee & Lee, 1975; see also Keller & Hunt, 1985). Consider a standard rule base for a fuzzy proportional controller with the error h as input and a control signal x with singleton membership functions as the output, If h is Pos then x is -100 If h is Zero then x is 0 If h is Neg then x is -100 (24) (25) (26)
The inference mechanism can be drawn in a block diagram somewhat like a neural network (Fig. 8). The network has an input layer, one hidden layer, and one output layer. The input node connects to the neurons in the hidden layer, this corresponds to the if-part of the rules. Each neuron only consists of an activation function, there is no summation, because each neuron has only one input. The singleton control signals appear as weights on the outputs from the neurons. The one neuron in the output layer, with a rather odd appearance, calculates the weighted average corresponding to the FHQWUH RI JUDYLW\ GHIX]]LILFDWLRQ in the 14
rule base. The network can be generalised to multi-input-multi-output control, but then the diagram becomes very busy. Backpropagation applies to this network since all layers are differentiable. Two possibilities for learning are apparent. One is to adjust the weights in the output layer, l=h=, all the singletons zl until the error is minimized. The other is to adjust the shape of the membership functions, provided they are parametric. The network can be described as a feedforward network with an input layer, a single hidden layer, and an output layer consisting of a single unit. The network performs a nonlinear mapping from the input layer to the hidden layer, followed by a linear mapping from the hidden layer to the output layer. Exactly such a topology occurs in UDGLDO EDVLV IXQF WLRQ networks; the hidden units provide a ’basis’ for the input patterns and their functions ’radially’ surround a particular data point. Radial basis function networks are used for curve-fitting in a multi-dimensional space. This is also called IXQFWLRQ DSSUR[LPDWLRQ and learning is equivalent to finding a function that best fits the training data. In its VWULFW sense the function is constrained to pass through all the training data points. The radial basis functions technique consists of choosing a function I> (27) 6 +nX X4 n, i 9 i +nX X5 n, : 9 : z4 z5 = = = zn 7 @ (28) 8 === i +nX XN n, p p Here X 5 U is a vector of inputs, Xn 5 U +n @ 4> 5> = = = > N, are vectors of training data, Z 5 UN is the vector of weights, i +nX Xn n, is a set of (nonlinear) radial basis functions, and n n is a norm, usually the Euclidean. The known data points Xn are taken to be the FHQWUHV of the radial basis functions. The activation level of a function i +X> Xn , is PD[LPXP when the input X is at the centre Xn of the function, as demonstrated by the next example. 5 ([DPSOH UDGLDO EDVLV IXQFWLRQ 7KH *DXVVLDQ LV DQ H[DPSOH RI D UDGLDO EDVLV IXQF WLRQ 5 (29) i +X, @ h{s+ nX X4 n , $ WZRGLPHQVLRQDO VDPSOH ZLWK WZR LQSXWV X @ +x4 > x5 , LV SORWWHG LQ )LJ 9 ZLWK LWV FHQWUH DW +x4 > x5 ,4 @ +8> 8, :KHQ X LV QHDU WKH FHQWUH WKH H[SRQHQW LV QHDU ]HUR DQG i +X, LV QHDUO\ RQH ZKLFK LV LWV PD[LPXP 7KH LQGH[ LQ UHIHUV WR WKH ILUVW WUDLQLQJ H[DPSOH $ VHFRQG WUDLQLQJ H[DPSOH ZRXOG EH DVVRFLDWHG ZLWK D VLPLODU *DXVVLDQ IXQFWLRQ EXW SODFHG RYHU DQRWKHU FHQWUH Each known data point Xn must satisfy the equation, ZW i +Xn , @ gn (30) where gn is the desired response corresponding to Xn , therefore the unknown weights Z 15
W I+X, @ Z i +nX
Xn n,
1
0.5
0 10 10 5 0 0 5
Figure 9: A Gaussian radial basis function. must satisfy the following set of equations 5 i45 i44 9 i54 i55 9 z4 z5 = = = zN 7 === === iN 4 iN 5 In matrix notation ZW @ GW where the vector G represents the GHVLUHG UHVSRQVH vector, and fmn @ i +nXm Xn n, > m> n @ 4> 5> = = = > N (32) (33) 6 i4N i5N : :@ === 8 iNN
=== === === ===
g4
g5
===
gN (31)
The matrix @ iimn j is called the LQWHUSRODWLRQ PDWUL[. For a class of radial basis functions, Gaussian functions for instance, the interpolation matrix is invertible (it is positive definite). Provided the data points are all distinct, then we can solve for the weights directly, obtaining (34) ZW @ GW Ã4 Although in theory this means we can solve the VWULFW interpolation problem where the function passes through all training points Xn , in practice we cannot, if the interpolation matrix is close to singular. The performance of the network depends only little on the type of function i + ,, according to theoretical investigations and practical experiences (Powell in Haykin, 1994). The performance may be improved by adjustments of the centre and the shape of the activation functions. Generally the radial basis function networks enjoy faster convergence than back-propagation networks.
$GDSWLYH 1HXURIX]]\ ,QIHUHQFH 6\VWHP $1),6
ANFIS (Adaptive Neuro Fuzzy Inference System) is an architecture which is functionally 16
equivalent to a Sugeno type fuzzy rule base (Jang, Sun & Mizutani, 1997; Jang & Sun, 1995). Under certain minor constraints the ANFIS architecture is also equivalent to a radial basis function network. Loosely speaking ANFIS is a method for tuning an existing rule base with a learning algorithm based on a collection of training data. This allows the rule base to adapt. The network in Fig. 8 may be extended by assigning a linear function to the output weight of each neuron, zn @ DW X . en > n
p
n @ 4> 5> = = = > N
(35)
where Dn 5 U is a parameter vector and en is a scalar parameter. The network is then equivalent to a first order Sugeno type fuzzy rule base (Takagi and Sugeno, 1985). The requirements for the radial basis function network to be equivalent to a fuzzy rule base is summarised in the following (Jang et al., 1997). Both must use the same aggregation method (weighted average or weighted sum) to derive their overall outputs. The number of activation functions must be equal to the number of fuzzy if-then rules. When there are several inputs in the rule base, each activation function must be equal to a composite input membership function. One way to achieve this is to employ Gaussian membership functions with the same variance in the rule base, and apply product for the DQG operation. The multiplication of the Gaussian membership functions becomes a multi-dimensional Gaussian radial basis function. Corresponding activation functions and fuzzy rules should have the same functions on the output side of the neurons and rules respectively. If the training data are contained in a small region of the input space, the centres of the neurons in the hidden layer can be concentrated within the region and sparsely cover the remaining area. Thus only a local model will be formed and if the test data lie outside the region, the performance of the network will be poor. On the other hand, if one distributes the basis function centres evenly throughout the input space, the number of neurons depends exponentially on the dimension of the input space.
$1),6 DUFKLWHFWXUH
Without loss of generality we assume two inputs, x4 and x5 > and one output, |. Assume for now a first order Sugeno type of rule base with the following two rules If x4 is $ 4 and x5 is % 4 then |4 If x4 is $ 5 and x5 is % 5 then |5 @ f44 x4 . f45 x5 . f43 @ f54 x4 . f55 x5 . f53 (36) (37)
Incidentally, this fuzzy controller could interpolate between two linear controllers depending on the current state. If the firing strengths of the rules are 4 and 5 respectively, for two particular values of the inputs x4 and x5 > then the output is computed as a weighted average 4 |4 . 5 |5 |@ @4 |4 . 5 |5 (38) 4 . 5
17
A1
u1 AND N
u1,u2
A2
+ y
B1
AND u2
N u1,u2
B2 Layer 1 2 3 4 5
Figure 10: Structure of the ANFIS network. The corresponding ANFIS network is shown in Fig. 10. A description of the layers in the network follows. 1. Each neuron l in layer 1 is adaptive with a parametric activation function. Its output is the grade of membership to which the given input satisfies the membership function, i.e., $4 (x4 ), %4 (x5 ), $5 (x4 ), or %5 (x5 ). An example of a membership function is the generalised EHOO IXQFWLRQ 4 (39) +{, @ {Ãf 5e 4.
d
where id> e> fj is the parameter set. As the values of the parameters change, the shape of the bell-shaped function varies. Parameters in that layer are called SUHPLVH SDUDPHWHUV. 2. Every node in layer 2 is a fixed node, whose output is the product of all incoming signals. In general, any other fuzzy AND operation can be used. Each node output represents the firing strength l of the lth rule. 3. Every node in layer 3 is a fixed node which calculates the ratio of the lth rule’s firing strength relative to the sum of all rule’s firing strengths, l l @ > l @ 4> 5 (40) 4 . 5 The result is a QRUPDOLVHG ILULQJ VWUHQJWK.
18
300
200
100
0
-100
-200
-300 0 5 10 15 20
Figure 11: Approximation of data points (o) by an ANFIS network (solid). Two rules interpolate between two lines (dotted). 4. Every node in layer 4 is an adaptive node with a node output l |l @l +fl4 x4 . fl5 x5 . fl3 , > l @ 4> 5 (41) where l is the normalised firing strength from layer 3 and ifl4 > fl5 > fl3 j is the parameter set of this node. Parameters in this layer are called FRQVHTXHQW SDUDPHWHUV. 5. Every node in layer 5 is a fixed node which sums all incoming signals. It is straight forward to generalise the ANFIS architecture in Fig. 10 to a rule base with more than two rules.
7KH $1),6 OHDUQLQJ DOJRULWKP
When the premise parameters are fixed, the overall output is a linear combination of the consequent parameters. In symbols, the output | can be written as 5 4 |4 . |5 (42) | @ 4 . 5 4 . 5 (43) @ 4 +f44 x4 . f45 x5 . f43 , . 5 +f54 x4 . f55 x5 . f53 , @ +4 x4 , f44 . +4 x5 , f45 . 4 f43 . +5 x5 , f54 . +5 x5 , f55 . 5 f53(44) which is linear in the consequent parameters flm +l @ 4> 5> m @ 3> 4> 5, = A hybrid algorithm adjusts the consequent parameters flm in a forward pass and the premise parameters idl > el > fl j in a backward pass (Jang et al., 1997). In the forward pass the network inputs propagate forward until layer 4, where the consequent parameters are identified by the least-squares method. In the backward pass, the error signals propagate backwards and the premise parameters are updated by gradient descent. Because the update rules for the premise and consequent parameters are decoupled in 19
Initial MFs 1 0.8 0.6 0.4 0.2 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 5 10 15 20 0.3 0 5
Final MFs
10
15
20
Figure 12: Membership functions before (left) and after (right) learning. the hybrid learning rule, a computational speedup may be possible by using variants of the gradient method or other optimisation techniques on the premise parameters. Since ANFIS and radial basis function networks (RBFNs) are functionally equivalent under some minor conditions (p 17), a variety of learning methods can be used for both of them. ([DPSOH $1),6 7R VHH KRZ DQ $1),6 QHWZRUN FDQ DSSUR[LPDWH D IXQFWLRQ )LJ 11 VKRZV D SORW RI D VHW RI GDWDSRLQWV DQG WKH UHVXOWLQJ LQWHUSRODWLQJ FXUYH (OHYHQ GDWD SRLQWV FLUFOHG LQ WKH ILJXUH ZHUH SUHVHQWHG WR WKH $1),6 QHWZRUN ,QLWLDOO\ WZR *DXVVLDQ LQSXW PHPEHUVKLS IXQFWLRQV ZHUH FKRVHQ )LJ 12 OHIW 7KH\ FRYHU WKH ZKROH LQSXW UDQJH ZLWK SHUFHQW RYHUODS $QRWKHU LQLWLDO GHVLJQ FKRLFH ZDV WKH QXPEHU RI UXOHV LH WZR 7KH UHVXOW RI WKH OHDUQLQJ LV D UXOH EDVH ZLWK WZR UXOHV ,I { LV D4 WKHQ |4 ,I { LV D5 WKHQ |5 @ 4;=:8{ . 57<=4 @ 56=88{ 5:9=: (45) (46)
7KH ULJKW KDQG VLGH RI WKH UXOHV DUH WZR VWUDLJKW OLQHV DOVR GUDZQ RQ WKH ILJXUH RQH ZLWK D QHJDWLYH VORSH DQG WKH RWKHU ZLWK D SRVLWLYH VORSH 7KH LQWHUSRODWLQJ FXUYH LV WKH UHVXOW RI D QRQOLQHDU EOHQG RI WKH WZR VWUDLJKW OLQHV 7KH ZHLJKWLQJ RI WKH OLQHV LQ HDFK SRLQW RI WKH LQWHUSRODWLQJ FXUYH LV GHWHUPLQHG E\ WKH LQSXW PHPEHUVKLS IXQFWLRQV D4 DQG D5 )LJ 12 ULJKW 7KH HOHYHQ GDWD SRLQWV ZHUH SUHVHQWHG WR WKH QHWZRUN WLPHV GXULQJ ZKLFK WKH $1),6 DOJRULWKP XSGDWHG WKH SUHPLVH SDUDPHWHUV WKDW GHWHUPLQH WKH VKDSH DQG SRVLWLRQ RI WKH WZR PHPEHUVKLS IXQFWLRQV DQG WKH FRQVHTXHQW SDUDPHWHUV WKDW GHWHUPLQH WKH VORSH DQG WKH FRQVWDQW RI WKH WZR OLQHV RQ WKH RXWSXW VLGH WKH\ ZHUH LQLWLDOO\ DOO ]HUR
*HQHWLF DOJRULWKPV
A problem with backpropagation and least squares optimisation is that they can be trapped in a local minimum of a nonlinear objective function, because they are derivative based. *HQHWLF DOJRULWKPV survival of the fittest! are derivative-free, stochastic optimisa20
tion methods, and therefore less likely to get trapped. They can be used to optimise both structure and parameters in neural networks. A special application for them is to determine fuzzy membership functions. A genetic algorithm mimics the evolution of populations. First, different possible solutions to a problem are generated. They are tested for their performance, that is, how good a solution they provide. A fraction of the good solutions is selected, and the others are eliminated (survival of the fittest). Then the selected solutions undergo the processes of UHSURGXFWLRQ, FURVVRYHU, and PXWDWLRQ to create a new JHQHUDWLRQ of possible solutions, which is expected to perform better than the previous generation. Finally, production and HYDOXDWLRQ of new generations is repeated until convergence. Such an algorithm searches for a solution from a broad spectrum of possible solutions, rather than where the results would normally be expected. The penalty is computational intensity. The elements of a genetic algorithm are explained next (Jang et al., 1997). 1. (QFRGLQJ. The parameter set of the problem is encoded into a bit string representation. For instance, a point +{> |, @ +44> 9, can be represented as a FKURPRVRPH, which is a concatenated bit string 1 0 1 1 0 1 1 0 (47) Each coordinate value is a JHQH of four bits. Other encoding schemes can be used, and arrangements can be made for encoding negative and floating point numbers. 2. )LWQHVV HYDOXDWLRQ. After creating a SRSXODWLRQ the fitness value of each member is calculated. For a maximisation problem, the fitness value of the lth member is the value of the objective function at point l. Usually strictly positive objective functions are employed. Another possible fitness measure is to use a ranking of the members of the population, then the objective function can be inaccurate as long as it provides the correct ranking. 3. 6HOHFWLRQ. The algorithm selects which SDUHQWV should participate in producing offsprings for the next generation. Usually the probability of selection for a member is proportional to its fitness value. The idea is to let members with above-average fitness reproduce and replace members with below-average fitness. 4. &URVVRYHU. Crossover operators generate new chromosomes that hopefully retain good features from the previous generation. Crossover is usually applied to selected pairs of parents with a probability equal to a given FURVVRYHU UDWH. In one-point crossover a crossover point on the genetic code is selected at random and two parent chromosomes interchange their bit strings to the right of this point. Take for example two chromosomes 1 0 1 1 0 (48) 1 0 0 1 0 (49) If the crossover point is between the fifth bit and the sixth, the digits written in italics will swap places vertically. The two new chromosomes will be 1 1 0 0 1 0 1 1 0 0 (50) (51)
21
In two-point crossover, two crossover points are selected and the part of the chromosome string between these two points is swapped to generate two children; and so on. In effect, parents pass segments of their own chromosomes on to their children, and some children will be able to outperform their parents if they get good genes from both parents. 5. 0XWDWLRQ. A mutation operator can spontaneously create new chromosomes. The most common way is to flip a bit with a probability equal to a very low, given PXWDWLRQ UDWH. The mutation prevents the population from converging towards a local minimum. The mutation rate is low in order to preserve good chromosomes. Note that the above is only a general description of the basics of a genetic algorithm; detailed implementations vary considerably. For a textbook on the subject of genetic algorithms, see for example Goldberg (1989). $OJRULWKP (Jang et al., 1997) An example of a simple genetic algorithm for a maximisation problem is the following. 1. Initialise the population with randomly generated individuals and evaluate the fitness of each individual. (a) Select two members from the population with probabilities proportional to their fitness values. (b) Apply crossover with a probability equal to the crossover rate. (c) Apply mutation with a probability equal to the mutation rate. (d) Repeat (a) to (d) until enough members are generated to form the next generation. 3. Repeat steps 2 and 3 until a stopping criterion is met. If the mutation rate is high (above 0.1), the performance of the algorithm will be as bad as a primitive random search. ([DPSOH OLQH ILW 5RVV 7KLV LV DQ H[DPSOH RI KRZ D OLQH PD\ EH ILW WR D JLYHQ GDWD VHW XVLQJ D JHQHWLF DOJRULWKP &RQVLGHU WKH GDWD VHW [ 7KH OLQH WR EH ILWWHG LV \
(52)
| @ d{ . e (53) 7KH SDUDPHWHUV +d> e, PXVW EH HQFRGHG LQ WKH IRUP RI ELW VWULQJV ZLWK D UDQGRP DVVLJQPHQW RI 3V DQG 4V DW GLIIHUHQW ELW ORFDWLRQV IRU H[DPSOH (54) (DFK FKURPRVRPH LV 45 ELWV ORQJ WKH ILUVW JHQH HQFRGH d DQG WKH VHFRQG JHQH HQFRGH e : H VWDUW ZLWK DQ LQLWLDO SRSXODWLRQ RI IRXU FKURPRVRPHV 7KH ELQDU\ YDOXHV RI WKH JHQHV PXVW EH FRQYHUWHG DQG PDSSHG WR GHFLPDO QXPEHUV WKDW PDNH VHQVH LQ $ VXLWDEOH DIILQH 22
PDSSLQJ LV
f @ fplq .
felq +fpd{ 59 4 f +8 . 5, @ 5. 9 5 4 @ f@< 5
fplq ,
(55) (56) (57)
ZKHUH felq LV WKH GHFLPDO YDOXH RI WKH ELQDU\ JHQH DQG fpd{ @ 8 DQG fplq @ 5 DUH DVVXPHG XSSHU DQG ORZHU OLPLWV RI f 7KXV LQ WKH ILUVW LWHUDWLRQ WKH GHFLPDO YDOXH RI JHQH d LV :> ZKLFK LV PDSSHG LQWR d @ :@< 5 @ 4=55 DQG WKH GHFLPDO YDOXH RI JHQH e LV 53 ZKLFK LV PDSSHG LQWR e @ 3=55 7KH LQLWLDO SRSXODWLRQ LV IRXU DQG WKH IROORZLQJ WDEOH VKRZV WKH ILUVW LWHUDWLRQ 0HPEHU &KURPRVRPHV d 4=55 3=33 3=66 5=33 e 3=55 3=9: 5=9: 4=33 H+\, 47: 665 6<4 68; H@ phdq+H, 3=7; 4=3; 4=5: 4=4: &RSLHV 3 4 5 4
(58)
7KH REMHFWLYH IXQFWLRQ LV
[ 5 (59) H+\, @ 733 +g \, 7KH VXP RI WKH VTXDUHG HUURUV LV VXEWUDFWHG IURP D ODUJH QXPEHU 733 WR FRQYHUW WKH SUREOHP LQWR D PD[LPLVDWLRQ SUREOHP 7KH YDULDEOH g LV WKH GHVLUHG |YDOXH IURP DQG WKH YHFWRU \ LV WKH UHVXOW RI LQVHUWLQJ DOO {YDOXHV IURP LQWR WKH HTXDWLRQ IRU WKH OLQH 7KH YDOXH RI WKH REMHFWLYH IXQFWLRQ LV WKH YDOXH RI WKH ILWQHVV RI HDFK VWULQJ (DFK ILWQHVV YDOXH LV GLYLGHG E\ WKH DYHUDJH RI WKH ILWQHVV YDOXHV IRU WKH ZKROH SRSXODWLRQ WR JLYH DQ HVWLPDWH RI WKH UHODWLYH ILWQHVV RI HDFK VWULQJ $OO VWULQJV ZLWK D UHODWLYH ILWQHVV OHVV WKDQ 3=; LV HOLPLQDWHG DQG WKH UHVW RI WKH ILWQHVVHV DUH VFDOHG DQG URXQGHG VXFK WKDW WKHLU VXP LV IRXU WKH QHZ QXPEHUV DUH WKH QXPEHU RI FRSLHV WR XVH RI HDFK LQGLYLGXDO IRU WKH QH[W JHQHUDWLRQ 7KH QH[W WDEOH FRQFHUQV WKH VHFRQG JHQHUDWLRQ 7KH ILUVW FROXPQ VKRZV WKH LQGLYLGXDOV VHOHFWHG DQG WKH\ DUH DOLJQHG IRU FURVVRYHU DW WKH ORFDWLRQV VKRZQ E\ D YHUWLFDO EDU m &RSLHV 4 5 4 3 (60) 1RW VKRZQ LQ WKH WZR WDEOHV LV WKH DYHUDJH ILWQHVV RI WKH ILUVW JHQHUDWLRQ 63: LQFUHDVLQJ WR 659 LQ WKH VHFRQG JHQHUDWLRQ 7KH LWHUDWLRQV VKRXOG EH FRQWLQXHG XQWLO FRQYHUJHQFH WR D VROXWLRQ ZLWKLQ D JHQHUDWLRQ 6HOHFWHG m m m m 1HZ FKURPV d 3=77 3=44 3=66 5=33 e 3=9: 5=9: 5=89 3=;< H+\, 6:9 6;4 5<5 589 H@ phdq+H, 4=48 4=4: 3=<3 3=:;
23
1
1 S L 0.8 0.6 0.4 0.2 0 0 1 2 x 3 4 5 1 S L 0.8 0.6 0.4 0.2 0 0 1 2 x 3 4 5 0 5 10 y 15 20 25 S VL 0 5 10 y 15 20 25 S VL
1st generation 2nd generation
0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0
Figure 13: Best individuals in the first generation (top) and the second generation (bottom).
&RPSXWLQJ PHPEHUVKLS IXQFWLRQV
Genetic algorithms can be used to find membership functions (Karr & Gentry in Ross, 1995). Initially some membership functions and their shapes are assumed for the various fuzzy variables in the system. The membership functions are encoded as bit strings and concatenated. A fitness function is used to evaluate the fitness of each set of parameters that define the membership functions. The process is illustrated for a simple problem in the next example. ([DPSOH PHPEHUVKLS IXQFWLRQV 5RVV *LYHQ D VLQJOHLQSXW VLQJOHRXWSXW UXOH EDVH ,I { LV 6 WKHQ | LV 6 ,I { LV / WKHQ | LV 9/ (61) (62)
7KH LQSXW { XVHV WZR IX]]\ WHUPV 6 VPDOO DQG / ODUJH DQG WKH RXWSXW | XVHV 6 VPDOO DQG 9/ YHU\ ODUJH 7KH SUREOHP LV WR WU\ DQG ILW WKH UXOH EDVH E\ DGMXVWLQJ WKH PHPEHUVKLS IXQFWLRQV WR WKH IROORZLQJ PDSSLQJ { 4 5 6 7 8 @ (63) | 4 7 < 49 58 : DVVXPH WULDQJXODU PHPEHUVKLS IXQFWLRQV ZLWK WKHLU ULJKW DQJOH YHUWLFHV IL[HG DW WKH H H[WUHPHV RI WKH UHVSHFWLYH XQLYHUVHV RI GLVFRXUVH 7KH LQSXW XQLYHUVH LV DVVXPHG WR EH ^3> 8` DQG WKH RXWSXW XQLYHUVH ^3> 58` = )RU HDFK PHPEHUVKLS IXQFWLRQ WKH RQO\ DGMXVWDEOH 24
SDUDPHWHU LV WKH OHQJWK RI WKH EDVH WKH OHJ RI WKH WULDQJOH RQ WKH KRUL]RQWDO D[LV VHH )LJ 13 :H XVH VL[ELW JHQHV WR GHILQH WKH EDVH RI HDFK PHPEHUVKLS IXQFWLRQ 7KH JHQHV DUH FRQFDWHQDWHG WR JLYH D 9 ELWVYDULDEOH 7 YDULDEOHV @ 57 ELW FKURPRVRPH :H VWDUW ZLWK D SRSXODWLRQ RI IRXU DQG FRPSXWH WKH EDVH OHQJWKV XVLQJ WKH FRQYHUVLRQ VFKHPH IURP WKH SUHYLRXV H[DPSOH %DVH 9/ 53=5 48=4 48=< 46=< (64) 7KH FDOFXODWLRQV SURFHHG DV LQ ([DPSOH GHWDLOV FDQ EH IRXQG LQ 5RVV $IWHU WZR LWHUDWLRQV WKHUH LV D VOLJKW FKDQJH LQ WKH PHPEHUVKLS IXQFWLRQV VHH )LJXUH 13 0HPEHU &KURPRVRPHV %DVH 6 3=89 4=76 4=9: 5=;9 %DVH / 4=8< 3=<8 6=66 3=:4 %DVH 6 ;=:6 4:=8 8=49 4:=8
7HVW UHVXOWV DQG GLVFXVVLRQ
For the sake of comparison, consider a very simple, nonlinear, static single-input-singleoutput plant is considered, | @ wdqk+x, . vlq+x, . h (65) The plant output | is governed by the input x and some Gaussian white noise h with zero mean and standard deviation 0.1. The noise is included to show the ability to extract parameters from noisy training data. The reason for choosing this simple example is to compare results with Sørensen (1994); he used it with a multi-layer perceptron to compare four different learning algorithms. From the plant a training set of Q @ 96 samples of matching inputs and outputs ix+n,> |+n,j > +n @ 4> 5> > Q , is collected. The training set is equally spaced in the interval x 5 ^ > ` = Unlike dynamic plants, n denotes the sample number rather than time, implying that the order in which the training examples are presented to the model is unimportant. The initial model has two membership functions (Neg and Pos), and ANFIS is asked to train for 500 epochs. ANFIS returns an adjusted rule base, If input is 1HJ then output is If input is 3RV then output is 4=<54 lqsxw ;=398 5=;4< lqsxw . 45=74 (66)
The training data and the ANFIS approximation is shown in Fig. 14, top. The adjusted membership functions, which we have called 1HJ and 3RV, are in the same figure (bottom). A qualitative comparison with Sørensen (1994) is possible by means of Fig. 15, which shows how the approximation error develops over the training epochs. The result is that ANFIS converges faster than Sørensen’s two simple algorithms, backpropagation and steepest descent. It is, however, a little bit slower than his Gauss Newton method, and it has about the same convergence rate as his parallel Gauss Newton method. Also shown in Fig 15 is 25
2
0
y
-2
-3
-2
-1
0
1
2
3
1
Membership
Neg
Pos
0
-3
-2
-1
0 u
1
2
3
Figure 14: Top: ANFIS approximation (solid) of training data (dotted). Bottom: Two membership functions that cover the input range ^ > ` = a dashed line, showing the optimal error, i.e., vwg +wdqk+x, vlq+x, |, @ 3=446 (67) This is the actual standard deviation of the training data from the ideal function. ANFIS dives below the line, more than Sørensen’s methods, indicating that it is trying to learn the noise (overfitting). It is interesting to notice, that a run with ANFIS’ default settings, results in a similar model with two input membership functions, but it stops after only 10 iterations. At this point the standard deviation on the error is almost optimal +3=44<, and the error curve, cp. Fig. 15, is a straight line. The example shows some strengths and shortcomings of neurofuzzy modelling for control purposes: We do not need to know the plant dynamics in advance; identification of them is embedded in the training of ANFIS. The example was based on off-line learning only. It is possible to turn on on-line learning to cope with time-varying plant dynamics. In general, the best approach is to use off-line learning to find a working controller for a nominal plant, and then use on-line learning to fine-tune the controller if the plant is time-varying. The distribution of the training data could pose a problem. Ideally, we would like to see the training data distributed across the input space of the plant in a somewhat uniform manner. This may not be possible due either to the scarcity of the data (especially when there are many inputs) or to limits imposed by the physical plant.
26
0.13
u
RMS error
0.12 0.11 0.1 0 100 200 epoch 300 400 500
Figure 15: The learning curve for ANFIS is rather steep, and it drops below the optimal error indicating overtraining. In traditional linear system identification it is well known that the frequency spectrum of the input signal must be sufficiently wide. With neurofuzzy methods this is required too, but also the amplitude spectrum of the input signal must be sufficiently wide, because the networks are nonlinear. It is thus necessary to train the network with a number of different amplitudes in order to learn the nonlinearities of the plant. It can be difficult, however, to determine beforehand how to manipulate the inputs in such a way that the internal variables contain sufficiently varied amplitudes. ANFIS uses least squares optimisation which may favour local linear models which are QRW local linear approximations. Figure 11 shows the two lines between which the rule base interpolates, they are absolutely not local approximations to the function. Such a model can hardly be interpreted in terms of the individual rules. This can be achieved by having membership functions with less overlap and wider peaks (larger FRUHV ), because the linear models are then forced to represent local behaviour; other rules are not interfering. Babuška (1998) has studied this carefully, and his SURGXFWVSDFH FOXVWHULQJ method is based on the philosophy that the local models should be local linearisations. He also makes the point, that the fuzzy c-means algorithm, as it is presented here, prefers clusters that are spherical; elongated clusters can be difficult to detect, especially if they are not perpendicular to the feature axes. He proposes to use a cluster algorithm which adapts to the shape of the cluster (Gustafson & Kessel in Babuska, 1998, p 60).
&RQFOXVLRQV
Compared to neural networks, the neurofuzzy methods provide models which can be interpreted by human beings. The models are in the form of the familiar if-then rules, implying easy integration with operators’ (expert) rules. The ANFIS model shows good performance, but sometimes produces spurious rules, that make little sense. For automatic control purposes our interest lies in the application of these methods to system identification and adaptive control systems.
27
5HIHUHQFHV
Babuška, R. (1998). )X]]\ 0RGHOLQJ )RU &RQWURO, Kluwer Academic Publishers. Bezdek, J. and Pal, S. K. (1992). )X]]\ PRGHOV IRU SDWWHUQ UHFRJQLWLRQ, IEEE Press, New Y ork. (Selected reprints). Goldberg, D. E. (1989). *HQHWLF $OJRULWKPV LQ 6HDUFK 2SWLPLVDWLRQ DQG 0DFKLQH /HDUQLQJ, Addison-Wesley, Reading, MA, USA. Haykin, S. (1994). 1HXUDO 1HWZRUNV $ &RPSUHKHQVLYH )RXQGDWLRQ, Macmillan College Publishing Company, Inc., 866 Third Ave, New Y ork, NY 10022. Hill, G., Horstkotte, E. and Teichrow, J. (1990). )X]]\& GHYHORSPHQW V\VWHP ± XVHU PDQXDO, V Togai Infralogic, 30 Corporate Park, Irvine, CA 92714, USA. Jang, J.-S. R. and Gulley, N. (1995). )X]]\ /RJLF 7RROER[, The MathWorks Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Jang, J.-S. R. and Sun, C.-T. (1995). Neuro-fuzzy modeling and control, 3URFHHGLQJV RI WKH ,((( (3): 378–406. Jang, J.-S. R., Sun, C.-T. and Mizutani, E. (1997). 1HXUR)X]]\ DQG 6RIW &RPSXWLQJ, number isbn 0-13-261066-3 in 0DWODE &XUULFXOXP 6HULHV, Prentice Hall, Upper Saddle River, NJ, USA. Keller, J. M. and Hunt, D. J. (1985). Incorporating fuzzy membership functions into the perceptron algorithm, in )X]]\ PRGHOV IRU SDWWHUQ UHFRJQLWLRQ (Bezdek and Pal, 1992), pp. 468– 474. (Selected reprints). Kosko, B. (1992). 1HXUDO 1HWZRUNV DQG )X]]\ 6\VWHPV $ '\QDPLFDO 6\VWHPV $SSURDFK WR 0DFKLQH ,QWHOOLJHQFH, Prentice–Hall, Englewood Cliffs. Lee, S. C. and Lee, E. T. (1975). Fuzzy neural networks, in )X]]\ PRGHOV IRU SDWWHUQ UHFRJQLWLRQ (Bezdek and Pal, 1992), pp. 448–467. (Selected reprints). Lewis, R. (1990). 3UDFWLFDO 'LJLWDO ,PDJH 3URFHVVLQJ, Ellis Horwood Series in Digital and Signal Processing, Ellis Horwood Ltd, New Y ork, etc. MIT (1995). &,7( /LWHUDWXUH DQG 3URGXFWV 'DWDEDVH, MIT GmbH / ELITE, Promenade 9, D52076 Aachen, Germany. Nørgaard, P M. (1996). 6\VWHP ,GHQWLILFDWLRQ DQG &RQWURO ZLWK 1HXUDO 1HWZRUNV, PhD thesis, . Technical University of Denmark, Dept. of Automation, Denmark. Ross, T. (1995). )X]]\ /RJLF ZLWK (QJLQHHULQJ $SSOLFDWLRQV, McGraw-Hill, New Y ork, N.Y . Takagi, T. and Sugeno, M. (1985). Fuzzy identification of systems and its applications to modeling and control, ,((( 7UDQV 6\VWHPV 0DQ &\EHUQHWLFV (1): 116–132. Zimmermann, H.-J. (1993). )X]]\ VHW WKHRU\ DQG LWV DSSOLFDWLRQV, second edn, Kluwer, Boston. (1. ed. 1991).
28