VIEWS: 5 PAGES: 13 CATEGORY: Graduate POSTED ON: 5/29/2012 Public Domain
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 1443 A Dynamic Channel Assignment Policy Through Q-Learning Junhong Nie and Simon Haykin, Fellow, IEEE Abstract— One of the fundamental issues in the operation of problem revolves around how the limited resource (channels) a mobile communication system is the assignment of channels can be utilized with maximum efﬁciency. to cells and to calls. Since the number of channels allocated The existing channel assignment methods may be roughly to a mobile communication system is limited, efﬁcient utiliza- tion of these communication channels by using efﬁcient channel classiﬁed into ﬁxed and dynamic schemes [18], [21]. In the assignment strategies is not only desirable but also imperative. ﬁxed channel assignment (FCA) scheme, a set of channels is This paper presents a novel approach to solving the dynamic allocated to each cell permanently by a frequency planning channel assignment (DCA) problem by using a form of real- process. In contrast to FCA, in dynamic channel assignment time reinforcement learning known as Q-learning in conjunction (DCA) schemes all the channels are available in all the cells with neural network representation. Instead of relying on a known teacher, the system is designed to learn an optimal and channels are assigned to cells only when they are required; channel assignment policy by directly interacting with the mobile there are no ﬁxed relationships between cells and channels. In communication environment. The performance of the Q-learning- other words, channel assignment is carried out on a call-by- based DCA was examined by extensive simulation studies on a call basis in a dynamic manner. A number of FCA approaches 49-cell mobile communication system under various conditions. exist ranging from simple heuristic ones to more mathe- Comparative studies with the ﬁxed channel assignment (FCA) scheme and one of the best dynamic channel assignment strate- matically involved ones in which various conventional or gies, MAXAVAIL, have revealed that the proposed approach is nonconventional optimization schemes are applied, including able to perform better than the FCA in various situations and neural networks, genetic algorithm, and simulated annealing capable of achieving a performance similar to that achieved by [10], [15], [17]. Likewise, a number of DCA schemes have the MAXIAVIAL, but with a signiﬁcantly reduced computational been proposed [4]–[5], [8]–[9], [11], [20], [25]–[26]. It has complexity. been concluded that DCA performs better than FCA in terms of blocking probability in the case of nonuniform trafﬁc and I. INTRODUCTION light to moderate trafﬁc load. However, the implementation complexity of previously known DCA schemes is generally O NE of the fundamental issues in the operation of a mobile communication system is the assignment of channels to cells and to calls. Since the number of channels higher than that of FCA. This paper proposes an alternative approach to solving the dynamic channel assignment problem. The optimal dynamic allocated to a mobile communication system is limited and the assignment policy is obtained through a form of real-time population of mobile users is increasing dramatically, efﬁcient reinforcement learning [2], [3] known as Q-learning [23]. The utilization of available communication resources by using scheme is based on the judgement that DCA can be regarded efﬁcient channel assignment strategies is not only desirable but as a large-scale constrained dynamic optimization problem in also imperative. In a cellular mobile communication system, a stochastic environment, and learning is one of the effective the service area is divided into a number of subareas called ways to ﬁnd a solution to this problem. Instead of relying on cells, with each cell being served by a base station which a known teacher providing a correct output in response to an handles all calls made by mobile users within the cell. An input, the system is designed to learn an optimal policy by essential feature of the cellular concept is channel reuse [16], directly interacting with the environment with which it works, [18], that is, a single radio channel may be used simultaneously a mobile communication environment in our case. Learning is in a number of physically separated cells, provided that a accomplished progressively by appropriately utilizing the past cochannel interference constraint is satisﬁed. The channel experience which is obtained during real-time operation. The assignment problem involves efﬁciently assigning channels to performance of the Q-learning-based DCA was examined by each radio cell (or call) in a cellular mobile system in such a extensive simulation studies on a 49-cell mobile communica- way that the probability that incoming calls are blocked and tion system under various conditions including homogeneous the probability that the carrier-to-interference ratio falls below and inhomogeneous trafﬁc distributions, time-varying trafﬁc a prespeciﬁed value are sufﬁciently low. In other words, the patterns, and channel failures. Also, we carried out some comparative studies with the FCA scheme and one of the best Manuscript received December 30, 1996; revised February 8, 1999. This work was supported by Motorola, Vancouver, BC under ARRC/McMaster DCA strategies, MAXAVAIL [20].1 Research Grant. The authors are with the Communications Research Laboratory, McMaster 1 It is noteworthy that reinforcement learning has already been applied University, Hamilton, Ont., Canada. to other large-scale problems such as backgammon game-playing [22] and Publisher Item Identiﬁer S 1045-9227(99)09448-5. elevator dispatching [7]. 1045–9227/99$10.00 © 1999 IEEE 1444 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 nication system, each of which is served by a base station located at the center of the cell. Also we are given a set of noninterfering radio channels, implying that adjacent- interference is neglected. Then the channel-assignment task concerns how to assign channels to cells and to indi- vidual calls subject to the cochannel interference constraint. It is evident that frequency reuse and cochannel interference are two major issues involved in solving the problem. B. Fixed Channel Assignment Scheme In FCA, a subset of the channels available to the radio system is permanently allocated to each cell. There is a deﬁnite relationship between a channel and a cell. A channel can be associated with many cells as long as the cochannel interference constraint is satisﬁed or equivalently two cells are located at least a cochannel reuse distance apart. In other words, two cells at distance or more are allocated the Fig. 1. Cell i and its interference cells. same subset of channels. The number of channels can be determined by This paper is organized as follows. In Section II, the prob- (2) lem of channel assignment is stated and some existing assign- ment schemes are brieﬂy reviewed. The proposed approach where for dynamic channel assignment is described in Section III with implementation details. Section IV is devoted to reporting simulation results of applying the proposed scheme to a 49- cell mobile communication system. Various communication environmental conditions were considered, including compar- For example, for a reuse distance , and the ative studies with the FCA and MAXAVAIL algorithm. The number of channels allocated to each cell is paper ends with some concluding remarks by summarizing our To associate channels with cells, a frequency planning ﬁndings and outlining future work. process is used in the FCA. Once this is done, the relationship of the channels to cells is ﬁxed. This suggests that a call attempt at a cell can only be served by one of the channels II. CHANNEL ASSIGNMENT PROBLEM in allocated to cell Consequently, if all the channels in are in use, a new call attempt at cell will be blocked A. Problem Description even through there may be unoccupied channels in adjacent A core concept in cellular communication system is fre- cells. quency reuse, that is, the radio channels on the same carrier Although FCA is relatively simple to operate, there are some frequency can be repeatedly used by mobile users in different potential drawbacks resulting from its use. For example, it is cells, provided that the cells using the same channel are not able to handle unpredicted time-varying trafﬁc patterns separated by sufﬁcient distance. Such cells are referred to as such as trafﬁc jams and car accidents, because the capacity it cochannel cells and the interference created by using the same can provide is ﬁxed. Also, frequency planning may become channel simultaneously is known as cochannel interference more difﬁcult and tedious in a microcellular environment which is considered to be a major constraint in the channel- since more accurate knowledge on trafﬁc and interference assignment task. The cochannel interference is a function of a conditions is required. Dynamic channel assignment is one parameter known as cochannel reuse ratio deﬁned by [16] of the solutions to the problem encountered in FCA. (1) C. Dynamic Channel Assignment where , known as the frequency reuse distance, is the The main feature of DCA is that all the channels are distance between the centers of nearest neighboring cochannel available in all the cells, and channel assignment is carried out cells and is the cell radius as shown in Fig. 1 where a on a call-by-call basis in a dynamic manner. Therefore, trafﬁc hexagonal regular cellular layout is assumed. variability can be automatically adapted. This can potentially Assume that the minimum acceptable carrier-to- lead to improved performance, particularly if the spatial trafﬁc interference (C/I) is 18 dB, the minimum reuse distance proﬁle is unknown, poorly known, or varies according to time. has been found to be [16]. This means that The problem a DCA scheme tries to deal with may be cochannel cells in Fig. 1 should be at least three cells apart. described as follows. Assume that there are cells and The channel assignment problem may be simply stated as channels in a mobile system. Referring to Fig. 1, let denote follows. Assume that there are cells in a mobile commu- a cell, the set of cells interfering with , i.e., those NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY 1445 neighborhood cells that lie at a distance less than a reuse distance , and the set of all available channels at time in cell A channel is said to be available if channel is neither being used in nor in cell Now the problem is, when a new call arrives at , how do we choose a channel from for the call? Obviously if and no rearrangement (intracell hand-off) with respect to ongoing calls is permitted, the new call will be blocked. On the other hand, if more than one channel is available, a selection strategy should be used. A number of DCA algorithms have been proposed. A critical review of DCA may be found in our previous report [14]. Here Fig. 2. An illustration of learner-environment interaction. only two types of strategies, namely exhaustive searching DCA and neural-network-based DCA, are brieﬂy described because they are relevant to our approach. III. SOLVING THE DCA PROBLEM THROUGH Q-LEARNING The strategies in the exhaustive searching DCA group share Conventional DCA strategies, as described in the last sec- the following common features. Each available channel in cell tion, completely ignore the experience or knowledge that could , say has a cost (reward) with it. When be gained during real-time operation of the system. Although a new call is attempted, the system searches exhaustively for the neural-network-based approach does have a training stage, the channel with minimum cost (maximum reward) it is crucial to have a good teacher (a known DCA algorithm) to guide the training. On the other hand, exhaustive searching (3) approaches are generally time-consuming to ﬁnd a solution and are thus inefﬁcient. Here, we propose an alternative approach to solving the channel assignment problem. By regarding Then, channel is assigned to the new call. Some criteria the DCA as a large-scale constrained dynamic optimization including maximum availability, maximum interferers, and problem embedded in a stochastic environment, we may obtain minimum damage have been used. The maximum availability an optimal assignment policy through an effective learning strategy, known as MAXAVAIL [20], has been claimed to scheme in which learning is accomplished progressively by produce the best performance in the case of no intracell appropriately utilizing the past experience which is gained handovers being involved. The idea is to select channel during real-time operation. from which maximizes the total number of channels Learning without a teacher is difﬁcult. A particular learning available in the entire system deﬁned by paradigm we have adopted is known as reinforcement learn- ing (RL) [2]. In RL, a learner aims at learning an optimal is assigned to cell (4) control policy by repeatedly interacting with the controlled environment in such a way that its performance evaluated by a scalar reward (cost) obtained from the environment is maximized (minimized). The RL algorithms developed so far Here it is assumed that channel is assigned to cell , where are closely related to the well-known dynamic programming is the set of cells in the system. Notice that the computational (DP) procedure developed some decades ago by Bellman [3]. load for calculating can be high because the number There exists a variety of RL algorithms. A particular algorithm of available channels due to the assignment of must be that appears to be suitable for the DCA task is called Q- calculated for every channel and every base station. learning [23]. In what follows, we ﬁrst describe the algorithm The DCA problem may be solved by neural-network-based brieﬂy and then present the details of how the DCA problem approaches. For example, a Hopﬁeld neural network was can be solved by means of Q-learning. used in [8]. An energy function associated with a particular cell is formulated by incorporating factors like interference constraints, trafﬁc requirement, and packing condition. Corre- A. Q-Learning Algorithm sponding to this energy function, a Hopﬁeld neural network is constructed. When a new call arrives in cell , an equilibrium Assume that the environment, which a learner interacts point of the network is found by solving the corresponding with, is a ﬁnite-state discrete-time stochastic dynamical system dynamic equation iteratively. The stable states (zero or one) of as shown in Fig. 2. Let be the set of possible states, the neurons represent the desired solution. Another possibility and be a set of possible actions, is to use a multilayer feedforward neural network (MFNN) [4]. By providing training data, a MFNN is trained to behave as a The interaction between the learner and the environment at speciﬁc DCA scheme does. After the neural network is trained each time instant consists of the following sequence. using the backpropagation algorithm, the network is used in • The learner senses the state real time to give a desired channel number in response to a • Based on , the learner chooses an action to new call request. perform. 1446 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 • As a result, the environment makes a transition to the new and and are the action taken at time and the immediate state according to probability , and cost due to at , respectively. The Q-learning rule is thereby generates a return (cost) • The is passed back to the learner and the process is repeated. if and (7) The objective of the learner is then to ﬁnd an optimal policy otherwise for each , which minimizes some cumulative measure of the costs received over time. A where is the learning rate, and particular measure, which is referred to as the total expected discounted return (cost) over an inﬁnite time horizon, is given by It has been shown [24] that if the Q-value of each admissible (5) pair is visited inﬁnitely often, and if the learning rate is decreased to zero in a suitable way, then as , where stands for the expectation operator and converges to with probability 1. is a discount factor. is often called the value function Q-learning is a method of asynchronous dynamic pro- of the state gramming. However, unlike traditional dynamic programming, Equation (5) can be rewritten as [23] the Q-learning algorithm is model-free in the sense that its operation does not need to know the state transition probabilities of the system and it can be used in an on-line manner. In addition, Q-learning is computationally efﬁcient. It does not maintain two memory structures, the evaluation where is the mean value of function and the policy; rather, it maintains only one memory The optimal policy satisﬁes Bellman’s opti- structure, namely, the estimated Q value of taking action at mality criterion state B. DCA-Q-Learning Formulation The mobile communication system can be considered as (6) a discrete-time event system. As shown in Fig. 3, without considering handovers the major events which may occur The task of Q-learning is to determine a without knowing in a cell include new call arrivals and call departures due and , which makes it well suited for the DCA to the completion of the call. These events are modeled as problem. This is achieved by reformulating (6). For a policy stochastic variables with appropriate probability distributions. , deﬁne a Q value (or state-action value) as In particular, new call arrivals in cell are independent of all other arrivals and obey a Poisson distribution with a mean arrival rate , as shown by which is the expected discounted cost for executing action arrivals occur in at state and then following policy thereafter. Let (8) The interarrival time has an exponential density, de- ﬁned by We then get (9) Call holding time is assumed to be exponentially Thus the optimal value function that satisﬁes Bellman’s distributed with a mean call duration The density function optimality criterion can be obtained from , and in is given by turn may be expressed as (10) To utilize the Q-learning scheme, it is necessary to formulate The Q-learning process tries to ﬁnd in a recursive the DCA into a dynamic programming problem, or equiva- manner using available information , where lently, to identify the system state , action , associated cost and are the states at time and , respectively; , and the next state NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY 1447 2) Actions: Applying an action is to assign a channel from available channels to the current call request in cell Here, is deﬁned as and 3) Costs: The cost assesses the immediate cost in- curred due to the assignment of at state More speciﬁcally, it is a cost of choosing channel to serve the currently concerned call attempt in cell There are many possibilities to deﬁne Here, we assess the cost of applying action by evaluating usage conditions in cochannel cells associated with cell The basic idea is to assign higher costs to those usages in which cochannel cells are located further away from cell And thus, the lower costs are associated with those usages in which cochannel cells have minimum compact distance. More speciﬁcally, is calculated by the following weighted sum: Fig. 3. Mobile communication system with a channel assignment scheme. (12) 1) State: Recall that it is assumed that there are cells and In the above equation, is the number of compact cells in channels available in the mobile communication system. reference to cell in which channel is being used. Compact We deﬁne state at time as cells are the cells with minimum average distance between cochannel cells [26]. In the case of a regular hexagonal layout shown in Fig. 1, compact cells are located on the third tier where is the cell index specifying there is with three cells apart; is the number of cochannel cells an event, either call arrival or departure, occurring in cell which are located on the third tier but not compact cells in is the number of available channels in which channel is being used; is the number of other cell at time , which depends on the channel usage conditions cochannel cells currently using channel ; and , , and in cell and in its interfering cells are constant subcosts associated with the above-mentioned To obtain at time , we deﬁne the channel status for conditions related to , , and , respectively. cell as a -dimensional vector: The ordering relation between , , and should be kept in such a way that For example, , , and were used in the simulation studies where reported in the next section. if channel is in use in cell 4) Next State: According to the deﬁnition of state de- otherwise scribed before, the state transition from to is de- where and termined by two stochastic events, call arrivals and call Furthermore, an availability vector is departures. Therefore, the next state can be obtained formed whenever one of these events occurs. However, in this paper only call arrivals are treated explicitly as sources to trigger the state transition in which actions, i.e., channel assignments with each component being deﬁned as are required to be taken. Although call departures do alter the if channel is available for use in cell number of available channels, we will not carry out any actions otherwise. for them (here no intracell handover is considered) except to Once channel status in cell and in its interfering cells release the channel on which a call is just completed. are known, availability vector can be formed easily with the corresponding components being obtained from C. Algorithm Implementation Having speciﬁed the state, action, cost, and next state, we are ready to describe a detailed implementation of the Q-learning algorithm for solving the DCA problem. Fig. 4 where and denotes the number of inter- illustrates the structure of the Q-learning-based DCA system. ference cells of cell , and denotes the logical Or operation. As pointed out in Section III-A, Q-learning is an on-line is deﬁned by learning scheme. In our case, it means that the task of learning a good assignment policy and assigning a channel to a call (11) attempt can be performed simultaneously. The system using Q-learning, however, may work in a fashion consisting of two where is the logical negation of successive procedures, learning and assigning. The Q-value 1448 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 Fig. 4. Structure of Q-learning-based DCA. is ﬁrst learned on-line with a sufﬁciently long time period,2 Step 4: Q-value update. Update the Q-values, once the next with the learned Q-values being stored in a representation state and the instant cost becomes available. The mechanism. Then the task of on-line assignment is carried target Q-value denoted by according to (7) is out by using the learned Q-values. Here, an important issue arises as to how to store the Q-values. There exists a variety of approaches to representing the Q- values [2]. A lookup table is the most straightforward method. It has the advantage of being both computationally efﬁcient where are available channels at state The and completely consistent with the structure assumption made Q-value is updated according to the difference in proving the convergence of the Q-learning scheme. How- and the chosen learning rate ever, when the input space consisting of state-action pairs is Step 5: Network parameters update. If the Q-value is stored large or the input variables are continuous, using a lookup in a neural network or any type of approximator, the sec- table can be prohibitive because memory requirement may be ond learning procedure (training) is necessary to learn the huge. In this case, some function approximators such as neural weight parameters associated with the network. In this case, networks may be used in an efﬁcient manner. As expected, is served as an error signal which is a second learning (or training) procedure will be involved in backpropagated. which the network parameters such as weights are determined. It can be seen that if the Q-values are learned and repre- In this report, both the lookup table and the neural network sented faithfully, the task of assignment with learning being are considered as the representational mechanism. stopped can be very efﬁcient, since in this case only the ﬁrst Now the steps concerning learning and assigning corre- four steps are involved. sponding to Fig. 4 are given as follows. Step 1: State-action construction. Construct current state by identifying the current cell number and IV. SIMULATION RESULTS using channel usage information associated with and its interfering cells. Also, ﬁnd a list of available channels A. Issues Related to the Simulation denoted by the set Here, we use , instead of 1) Simulated model: The performance of the proposed , to signify explicitly the number of available channels DCA algorithm was evaluated by simulating a mobile corresponding to state communication system consisting of 49 hexagonal cells as Step 2: Q-value retrieval. Form a set of argumented shown in Fig. 1. With the reuse distance , it turns inputs and feed them into the Q- out that if a channel is allocated to cell in Fig. 1, it cannot value representation mechanism, thereby deriving a set of be reused in two tiers of adjacent cells with because of values. unacceptable cochannel interference levels. Thus there are at Step 3: Channel assignment. According to the deﬁnition of most 18 interfering cells for a speciﬁed reference cell. the Q-values, the optimal action, i.e., the optimal channel , The assumptions and the parameters used in the simulation is the one with minimum Q-values include: (13) • New call arrivals obey Poisson distributions with uniform and nonuniform mean interarrival times among the cells. The mean arrival rate can be from 20 calls/h to 250 as indicated in Fig. 4. calls/h in each cell. 2 Here, on-line learning means that the learner interacts with the operating • The call-holding time obeys an exponential distribution environment in a real-time fashion. However, the environment can be either with a mean call-duration Throughout this report, a real system or a simulated network. s was used for all calls. NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY 1449 Fig. 5. Performance comparison with uniform trafﬁc: FCA by 3, MAXAVAIL by , Q-learning with table by +, Q-learning with neural network by 2 : • The offered trafﬁc in cell is given by • There are channels available in the system, although the number of channels can vary. • Blocked new and handover calls are dropped and cleared (Erlang B). 2) Performance Evaluation: The performance of a chan- nel assignment algorithm at a particular trafﬁc loading was assessed by measuring the new call-blocking probability , given by number of blocked calls in a cell (14) number of new call arrivals at that cell Because Erlang B is assumed, the performance of the DCA Fig. 6. Nonuniform trafﬁc distribution: Case 1. can be readily compared with that of the FCA. The blocking probability in cell in the case of FCA is given by imminent future event, which can be a call arrival or a call departure. To this end, it is necessary to maintain dynamically a list of future events. If the event occurring is a call arrival, a (15) set of steps described in Section III-C is performed, resulting in either the call being blocked or served by a channel. If necessary, learning is carried out. On the other hand, if the event occurring is a call departure, the occupied channel is where and are the offered trafﬁc and the number released. After the event is processed accordingly, the channel (ﬁxed) of available channels in cell However, notice that the usage information in each cell is updated and the time clock is blocking probabilities of using the FCA in various conditions advanced. To calculate the system performance, the number of described in the next subsection were calculated by operating new call arrivals and the number of blocked calls are recorded. the simulated system instead of using the above formula. 3) Simulation Procedures: To simulate the mobile com- munication system as a discrete-event dynamic system, a B. Results simulation clock is maintained. It gives the current value of A set of simulations were carried out, including the cases of simulated time of the whole system. The simulation clock is homogeneous and inhomogeneous trafﬁc distributions, time- advanced according to the time of occurrence of the most varying trafﬁc patterns, and channel failures. For the purpose 1450 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 Fig. 7. Performance comparison with nonuniform trafﬁc distribution (case 1): FCA by 3, MAXAVAIL by , Q-learning with table by +, Q-learning 2 with neural network by : of comparison, the results due to the FCA and the maximum on-line for 30 simulated hours by using the back-propagation availability based-DCA algorithm, MAXAVAIL [20], were algorithm in conjunction with the Q-learning. The learning rate included. The reason for selecting the MAXAVAIL is that it and momentum gains for network training were 0.3 and 0.9, has been claimed to be one of the best DCA algorithms in the respectively. The trained network was then used to select a sense that its performance is close to the best achievable in desired channel in response to a call attempt. this class of channel assignment algorithms where no intracell Fig. 5 shows the blocking probabilities of using the Q- handovers are involved. learning with the table structure (marked by “ ”), and with 1) Uniform Distribution: In this case, the trafﬁc load was the neural network structure (marked by “ ”). The results due assumed to be the same among all 49 cells. Six different to FCA (marked by “ ”), and MAXAVAIL (marked by “ ”) ’s in Erlangs were used, being 5, 6, 7, 8, 9, and 10 which are also shown. For the FCA scheme, each cell was assigned are equivalent, respectively, to call arrival rates of 100, 120, channels because a seven-cell cluster pattern was 140, 160, 180, and 200 calls/h. Two Q-value representation assumed. The testing time for all the algorithms was ﬁve mechanisms were considered. In the ﬁrst place, a three- simulated hours. dimensional lookup table was used. The Q-values were learned It can be seen from Fig. 5 that the Q-learning-based DCA by running the simulated mobile communication system for 30 performs better than the FCA although the improvement simulated hours with a constant arrival rate being 120 calls/h. degree gained by the DCA decreases slightly with the increase The discount factor was chosen to be 0.5 and the learning in trafﬁc load. For the interesting range of blocking probability rate was designed to be state-action pair varying with time. 2% to 6%, an increase in carried trafﬁc of 20% can be obtained. More speciﬁcally, each state-action was associated with Compared with the MAXAVAIL scheme, we conclude that a learning rate which was inversely proportional to the Q-learning-based DCA strategies are able to achieve a the frequency of the being visited up to the performance similar to that achieved by the MAXAVAIL. present time. That is, with However, the computational complexities are quite different. (if is visited) and The This issue will be discussed in some details in Section IV-C. parameters in cost evaluation of (12) were 2) Nonuniform Distribution: Fig. 6 shows a case [25] in and The learned table was then used to assign the which the trafﬁc densities in terms of calls/h are inhomoge- desired channel in the same communication system but with neously distributed among 49 cells. The average arrival call six different trafﬁc load conditions. rate is 91.83 calls/h. Fig. 7 shows the blocking probabilities of The same procedures were applied to the situation where a using the four methods described in the uniform case against multilayer neural network [13] was used to represent the Q- the arrival rates which were increased by 0, 20, 40, 60, 80, values. The network with three inputs representing state-action 100 percent over the base rates given in Fig. 6. Fig. 7 indicates values, eight nonlinear hidden units with sigmoid functions, some signiﬁcant improvements of the DCA algorithm over the and one linear output unit representing the Q-value was trained FCA scheme, namely about 50% increase in the trafﬁc load NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY 1451 Fig. 8. Performance comparison with nonuniform trafﬁc distribution (case 2): FCA by 3, MAXAVAIL by , Q-learning with table by +, Q-learning 2 with neural network by : at the same blocking probabilities. This is somewhat expected because the DCA scheme is on a call-by-call basis and thus is able to adapt to the spatial nonuniform situations. However, for the FCA to perform better, the trafﬁc in the system should be as homogeneous as possible. We notice that the Q-learning-based DCA, whether using the table or the neural network, again performed as well as the MAXAVAIL did. It is interesting to observe that neither the table nor the neural network was relearned and retrained. The Q-values learned in the uniform case were used. Fig. 8 gives another example where the base trafﬁc loads are given in Fig. 9 [25] with average arrival rate 106.53 calls/h. As expected, the DCA schemes in this case did not perform as well as in the case of Fig. 7 in terms of the improvement degree over that obtained by the FCA approach. This is partly Fig. 9. Nonuniform trafﬁc distribution: Case 2. because the trafﬁc loads were higher than those of Fig. 7. 3) Time-Varying Trafﬁc Load: The trafﬁc load in telephony systems is typically time-varying. Fig. 10 shows a pattern We also examined the case in which the trafﬁc loads concerning call arrivals during a typical business day from were both spatially nonuniformally distributed and temporarily 0:00 hour to 23:00 hours [12]. It can be seen that the peak varying. Fig. 12 gives the results due to the Q-learning with hours occur around 11 h and 16 h. Fig. 11 gives the simulation the table structure [Fig. 12(a)] and the FCA [Fig. 12(b)]. The results under the assumption that the trafﬁc load was spatially spatial distribution was in accordance with that given in Fig. 9 uniformly distributed among 49 cells (maximum 165 calls/h) and the temporal distribution was consistent with that given in but followed the time-varying pattern given in Fig. 10. The Fig. 10. As expected, a more signiﬁcant improvement in terms blocking probabilities were calculated on an hour-by-hour of blocking probability was seen in this case than that in the basis. The result obtained using the Q-learning with the table uniform distribution case. In particular, if again a 4% blocking structure is shown in Fig. 11(a) whereas that due to the FCA probability is set to be a threshold, the number exceeding that approach is shown in Fig. 11(b). The improvement of the Q- threshold is four in Fig. 12(a) and ten in Fig. 12(b). learning-based DCA over the FCA is apparent. For example, 4) Equipment Failure and On-Line Behavior: In a mobile the number of hours at which the blocking probabilities were communication system, equipment failure during the normal over 4% is two in Fig. 11(a), whereas that number is four in operating hours may occur. To simulate this situation, we Fig. 11(b). assumed that the various equipment failure cases will result 1452 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 Fig. 10. A trafﬁc pattern of a typical business day. (a) (b) Fig. 11. Performance with temporal varying and spatial uniform trafﬁc: (a) Q-learning; (b) FCA. (a) (b) Fig. 12. Performance with temporal varying and spatial nonuniform trafﬁc: (a) Q-learning; (b) FCA. in some frequency channels being temporally unavailable. zero (solid line), three (dotted line), ﬁve (dashed line), or Fig. 13 gives an example where the effect of channel failure on seven (dash-dotted line) channels were temporally shut down the system blocking probability was demonstrated under the Q- and thus not available for use. By comparing the results, it learning-based scheme with the table representation structure. seems that the channel assignment algorithm possesses certain The call arrival rate was 180 calls/h in all the cells. There were robustness to channel failure situations, particularly when the 70 channels available initially and, during 10 to 15 o’clock, number of failed channels is small, e.g., three to ﬁve. NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY 1453 Fig. 13. Robustness to channel failure: zero channel (solid line); three channel (dotted line); ﬁve channel (dashed line); seven channel (dash-dotted line). (a) (b) Fig. 14. On-line behavior of the Q-learning: (a) Blocking probability curve. (b) Averaging arrival rate with nonuniform distribution of case 1. Finally, we examined the on-line behavior of the Q-learning- mance similar to that achieved by the MAXAVAIL. However, based DCA in the sense that both learning and assigning the computational complexities are quite different. In the operations were carried out simultaneously. Fig. 14(a) shows process of assigning a channel, the complexity of using a one of the results where the blocking probability was computed table or neural network depends primarily on the number of accumulatively over two days (48 h). The call arrival rates channels, or more precisely, the number of available channels were nonuniformally distributed as shown in Fig. 9 with the comparisons with respect to Q’s are averaging varying according to Fig. 14(b). Some improvement needed to make a decision. To obtain individual Q’s, in the due to on-line learning can be seen in Fig. 14(a) in the sense case of table representation, it is a matter of index addressing that the accumulated blocking probabilities during the second which can be very fast. In the case of neural network repre- day were generally lower than those during the ﬁrst day. A sentation, it depends on the size of the network. In our case, similar behavior was observed in another case as shown in approximately operations (multiplications Fig. 15(a) where the call-arrival rates were nonuniformally or additions) were required.3 Notice that network size is distributed as shown in Fig. 9 with the averaging varying independent of the number of channels and the number according to Fig. 15(b). of cells Therefore, the total number of operations needed to assign a channel are for the table C. Computational Issues 3 It should be pointed out that the approximate number of operations given The results given in Figs. 5, 7, and 8 suggest that Q- does not include the number of eight sigmoid nonlinear operations on eight learning-based DCA strategies are able to achieve a perfor- hidden units. 1454 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999 (a) (b) Fig. 15. On-line behavior of the Q-learning: (a) Blocking probability curve. (b) Averaging arrival rate with nonuniform distribution of case 2. TABLE I NUMBER OF OPERATIONS REQUIRED FOR THREE DCA SCHEMES representation and for the neural Q-learning. The real-time simulation studies carried out in a network case as shown in Table I. As an example, 19 and 649 49-cell mobile communication system have demonstrated that operations (comparisons, additions, or multiplications) will be the proposed approach is a practical alternative to existing needed in the table and neural network cases, respectively, if schemes. In particular, comparative studies with the FCA and we assume that ten channels are available. the MAXAVAIL based DCA algorithm have suggested that The complexity of the MAXAVAIL scheme depends on the Q-learning-based DCA is able to perform better than the the number of channels, the number of the cells, and the FCA in different situations, including the trafﬁc load being number of interfering cells. Besides comparisons, for spatially uniformly and nonuniformly distributed, and being each available channel the availability of that channel in each time varying. Also, the new approach is capable of achieving of cells is checked. For each cell, interfering cells a performance similar to that achieved by the one of the best (in our case can be 18) have to be visited to determine known DCA algorithms, MAXIAVIAL. However, the on-line the channel status in that cell, requiring roughly Or computational efﬁciency of the proposed approach is far better operations and addition operations for each visit. Thus, than that of the MAXAVAIL. This is a deﬁnite advantage of the total number of operations needed to assign a channel our approach since time efﬁciency can be a critical issue in is as given in real time implementations. Table I. If we assume again that ten channels are available, the While the current result seems to be encouraging, there number of operations using the MAXAVAIL scheme would be certainly exist some issues worth pursuing further. First, some practical matters must be considered if the approach is to be In terms of storage requirement, however, the MAXAVAIL considered for implementation in a real system. They include method possesses the lowest number of memory units since the problem of scaling to larger systems with large number of it does not need to memorize much knowledge. The table- cells and channels, and distributed implementation in each base based Q-learning requires a higher number of memory units, station. Secondly, to use the DCA algorithm more efﬁciently, the maximum of which in our case is some limited number of intracell handovers may be considered whereas memory units are needed to store so as to create more favorable conditions for future use. The the weights in the case of the neural-network-based Q-learning third point that warrants investigation is how to introduce some approach. It should be mentioned that it is highly possible to fuzzy concepts and algorithms [19] into learning or computing reduce the storage requirement of the table-based Q-learning procedures. For example, the interference conditions may be by using some localized network such as CMAC, CPN, or expressed by fuzzy terms, leading to soft constraints. This RBF network. makes sense since the coverage of the cells in reality is not clearly deﬁned but with fuzzy boundaries (overlapping each V. CONCLUSION other to some degree). Finally, it may be worthwhile to explore We have described a novel approach to the problem of the possibility of keeping the table structure to represent the dynamic channel assignment. The optimal assignment pol- Q-value but with a reduced storage requirement by using some icy is obtained by using a self-learning scheme based on localized neural network conﬁgurations. NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY 1455 ACKNOWLEDGMENT [20] K. N. Sivarajan, R. J. McEliece, and J. W. Ketchum, “Dynamic channel assignment in cellular radio,” Proc. 40th Veh. Technol. Conf., The authors wish to thank the anonymous reviewers for their pp. 631–637, 1990. valuable comments and suggestions, which have helped us to [21] S. Tekinary and B. Jabbari, “Handover and channel assignment in mobile cellular networks,” IEEE Commun. Mag., pp. 42–46, Nov. 1991. improve the quality of this paper. [22] J. Tesauro, “Practical issues in temporal difference learning,” Machine Learning, vol. 8, pp. 257–277, 1992. REFERENCES [23] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D dissertation, Cambridge Univ., 1989. [1] A. Baiocchi, F. D. Priscoli, F. Grilli, and F. Sestini, “The Geometric [24] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, dynamic channel allocation as a practical strategy in mobile networks vol. 8, pp. 279–292, 1992. with bursty user mobility,” IEEE Trans. Veh. Technol., vol. 44, pp. [25] M. Zhang and T. S. Yum, “Comparisons of channel assignment strategies 14–23, 1995. in cellular mobile systems,” IEEE Trans. Veh. Technol., vol. 38, pp. [2] A. G. Barto, S. J. Bradtke, and S. P. Singh, “Learning to act using 211–215, 1989. real-time dynamic programming,” Artiﬁcial Intell., vol. 72, pp. 81–138, [26] , “The nonuniform compact pattern allocation algorithm for 1995. cellular mobile systems,” IEEE Trans. Veh. Technol., vol. 40, pp. [3] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. 387–391, 1991. Press, 1957. [4] P. T. Chen, M. Palaniswami, and D. Everitt, “Neural network-based dy- namic channel assignment for cellular mobile communication systems,” IEEE Trans. Veh. Technol., vol. 43, pp. 279–288, 1994. Junhong Nie received the B.S. and M.S. degrees [5] J. Chuang, “Performance issues and algorithms for dynamic channel from Northwest Telecommunications Engineering assignments,” IEEE J. Select Areas Commun., vol. 11, pp. 955–963, Institute (NTEI), Xi’an, China, and the Ph.D. de- 1993. gree from The University of Shefﬁeld, U.K., all in [6] D. C. Cox and D. O. Reudink, “Dynamic channel assignment in two electrical engineering. dimensional large mobile radio systems,” Bell Syst. Tech. J., vol. 51, pp. He served as a Lecturer in the Electrical Engi- 1611–1627, 1972. [7] R. H. Crites and A. G. Barto, “Improving elevator performance using neering Department of NTEI from 1985 to 1989. reinforcement learning,” in Advances in Neural Inform. Processing Syst. He was a Research Scientist in the Department 8, 1996. of Electrical Engineering, National University of [8] E. Del Re, R. Fantacci, and L. Ronga, “A dynamic channel alloca- Singapore from 1993 to 1995 and a Senior Research tion technique-based on Hopﬁeld neural networks,” IEEE Trans. Veh. Engineer in Communications Research Laboratory Technol., vol. 45, pp. 26–32, 1996. of McMaster University, Canada from 1996 to 1997. Since 1998, he has joined [9] D. D. Dimitrijevic and J. Vucetic, “Design and performance analysis of Nortel Networks working on advanced wireless communication systems. He the algorithms for channel allocation in cellular networks,” IEEE Trans. published one book and more than 50 articles, 30 of which appeared in Veh. Technol., vol. 42, pp. 526–534, 1993. internationally refereed journals and book chapters. [10] M. Duque-Anton, D. Kunz, and B. Ruber, “Channel assignment for cellular radio using simulated annealing,” IEEE Trans. Veh. Technol., vol. 42, pp. 14–21, 1993. [11] D. Everitt, “Trafﬁc engineering of the radio interface for cellular mobile Simon Haykin (F’82) received the B.Sc. degree networks,” Proc. IEEE, vol. 82, pp. 1371–1382, 1994. with First-Class Honors in 1953, the Ph.D. degree in [12] R. L. Freeman, Telecommunication System Engineering, 3rd ed. New 1956, and the D.Sc. degree in 1967, all in electrical York: Wiley, 1996. [13] S. Haykin, Neural Networks: A Comprehensive Foundation. New engineering from the University of Birmingham, York: Macmillan, 1994. U.K. [14] S. Haykin and Junhong Nie, “A preliminary investigation on channel He is the founding Director of the Communica- assignment problem in mobile communication systems,” Commun. Res. tions Research Laboratory at McMaster Unviersity, Lab., McMaster Univ., Hamilton, Ont., Canada, Tech. Rep., 1996. Hamilton, Ontario. In 1996 he was awarded the title [15] D. Kunz, “Channel assignment for cellular radio using neural networks,” “University Professor.” His research interests in- IEEE Trans. Veh. Technol., vol. 40, pp. 188–193, 1991. clude nonlinear dynamics, neural networks, adaptive [16] W. C. Y.Lee, Mobile Cellular Telecommunications. New York: ﬁlters, and their applications in radar and commu- McGraw-Hill, 1995. nication systems. [17] W. K. Lai and G. G. Coghill, “Channel assignment through evolutionary In 1980, Dr. Haykin was elected Fellow of the Royal Society of Canada. optimization,” IEEE Trans. Veh. Technol., vol. 45, pp. 91–96, 1996. He was awarded the McNaughton Gold Medal, IEEE (Region 7), in 1986. [18] V. H. Macdonald, “The cellular concept,” Bell Syst. Tech. J., vol. 58, He is a recipient of the Canadian Telecommunications Award from Queen’s pp. 15–41, 1979. University. He is the Editor for Adaptive and Learning Systems for Signal [19] J. Nie and D. A. Linkens, Fuzzy-Neural Control: Principles, Algorithms, Processing, Communications and Control, a new series of books for Wiley- and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1995. Interscience.