A dynamic channel assignment policy through Q-learning by Junhong Nie; Haykin_ S by farhatmasood


									IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999                                                                            1443

                          A Dynamic Channel Assignment
                            Policy Through Q-Learning
                                          Junhong Nie and Simon Haykin, Fellow, IEEE

   Abstract— One of the fundamental issues in the operation of            problem revolves around how the limited resource (channels)
a mobile communication system is the assignment of channels               can be utilized with maximum efficiency.
to cells and to calls. Since the number of channels allocated                The existing channel assignment methods may be roughly
to a mobile communication system is limited, efficient utiliza-
tion of these communication channels by using efficient channel            classified into fixed and dynamic schemes [18], [21]. In the
assignment strategies is not only desirable but also imperative.          fixed channel assignment (FCA) scheme, a set of channels is
This paper presents a novel approach to solving the dynamic               allocated to each cell permanently by a frequency planning
channel assignment (DCA) problem by using a form of real-                 process. In contrast to FCA, in dynamic channel assignment
time reinforcement learning known as Q-learning in conjunction            (DCA) schemes all the channels are available in all the cells
with neural network representation. Instead of relying on a
known teacher, the system is designed to learn an optimal                 and channels are assigned to cells only when they are required;
channel assignment policy by directly interacting with the mobile         there are no fixed relationships between cells and channels. In
communication environment. The performance of the Q-learning-             other words, channel assignment is carried out on a call-by-
based DCA was examined by extensive simulation studies on a               call basis in a dynamic manner. A number of FCA approaches
49-cell mobile communication system under various conditions.             exist ranging from simple heuristic ones to more mathe-
Comparative studies with the fixed channel assignment (FCA)
scheme and one of the best dynamic channel assignment strate-             matically involved ones in which various conventional or
gies, MAXAVAIL, have revealed that the proposed approach is               nonconventional optimization schemes are applied, including
able to perform better than the FCA in various situations and             neural networks, genetic algorithm, and simulated annealing
capable of achieving a performance similar to that achieved by            [10], [15], [17]. Likewise, a number of DCA schemes have
the MAXIAVIAL, but with a significantly reduced computational              been proposed [4]–[5], [8]–[9], [11], [20], [25]–[26]. It has
                                                                          been concluded that DCA performs better than FCA in terms
                                                                          of blocking probability in the case of nonuniform traffic and
                         I. INTRODUCTION                                  light to moderate traffic load. However, the implementation
                                                                          complexity of previously known DCA schemes is generally

O      NE of the fundamental issues in the operation of a
       mobile communication system is the assignment of
channels to cells and to calls. Since the number of channels
                                                                          higher than that of FCA.
                                                                             This paper proposes an alternative approach to solving the
                                                                          dynamic channel assignment problem. The optimal dynamic
allocated to a mobile communication system is limited and the             assignment policy is obtained through a form of real-time
population of mobile users is increasing dramatically, efficient           reinforcement learning [2], [3] known as Q-learning [23]. The
utilization of available communication resources by using                 scheme is based on the judgement that DCA can be regarded
efficient channel assignment strategies is not only desirable but          as a large-scale constrained dynamic optimization problem in
also imperative. In a cellular mobile communication system,               a stochastic environment, and learning is one of the effective
the service area is divided into a number of subareas called              ways to find a solution to this problem. Instead of relying on
cells, with each cell being served by a base station which                a known teacher providing a correct output in response to an
handles all calls made by mobile users within the cell. An                input, the system is designed to learn an optimal policy by
essential feature of the cellular concept is channel reuse [16],          directly interacting with the environment with which it works,
[18], that is, a single radio channel may be used simultaneously          a mobile communication environment in our case. Learning is
in a number of physically separated cells, provided that a                accomplished progressively by appropriately utilizing the past
cochannel interference constraint is satisfied. The channel                experience which is obtained during real-time operation. The
assignment problem involves efficiently assigning channels to              performance of the Q-learning-based DCA was examined by
each radio cell (or call) in a cellular mobile system in such a           extensive simulation studies on a 49-cell mobile communica-
way that the probability that incoming calls are blocked and              tion system under various conditions including homogeneous
the probability that the carrier-to-interference ratio falls below        and inhomogeneous traffic distributions, time-varying traffic
a prespecified value are sufficiently low. In other words, the              patterns, and channel failures. Also, we carried out some
                                                                          comparative studies with the FCA scheme and one of the best
  Manuscript received December 30, 1996; revised February 8, 1999. This
work was supported by Motorola, Vancouver, BC under ARRC/McMaster
                                                                          DCA strategies, MAXAVAIL [20].1
Research Grant.
  The authors are with the Communications Research Laboratory, McMaster      1 It is noteworthy that reinforcement learning has already been applied
University, Hamilton, Ont., Canada.                                       to other large-scale problems such as backgammon game-playing [22] and
  Publisher Item Identifier S 1045-9227(99)09448-5.                        elevator dispatching [7].

                                                       1045–9227/99$10.00 © 1999 IEEE
1444                                                         IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999

                                                                  nication system, each of which is served by a base station
                                                                  located at the center of the cell. Also we are given a set
                                                                  of     noninterfering radio channels, implying that adjacent-
                                                                  interference is neglected. Then the channel-assignment task
                                                                  concerns how to assign        channels to     cells and to indi-
                                                                  vidual calls subject to the cochannel interference constraint. It
                                                                  is evident that frequency reuse and cochannel interference are
                                                                  two major issues involved in solving the problem.

                                                                  B. Fixed Channel Assignment Scheme
                                                                     In FCA, a subset of the channels available to the radio
                                                                  system is permanently allocated to each cell. There is a
                                                                  definite relationship between a channel and a cell. A channel
                                                                  can be associated with many cells as long as the cochannel
                                                                  interference constraint is satisfied or equivalently two cells
                                                                  are located at least a cochannel reuse distance      apart. In
                                                                  other words, two cells at distance or more are allocated the
Fig. 1. Cell   i   and its interference cells.                    same subset of      channels. The number of channels       can
                                                                  be determined by
   This paper is organized as follows. In Section II, the prob-
lem of channel assignment is stated and some existing assign-
ment schemes are briefly reviewed. The proposed approach           where
for dynamic channel assignment is described in Section III
with implementation details. Section IV is devoted to reporting
simulation results of applying the proposed scheme to a 49-
cell mobile communication system. Various communication
environmental conditions were considered, including compar-       For example, for a reuse distance             ,         and the
ative studies with the FCA and MAXAVAIL algorithm. The            number of channels allocated to each cell is
paper ends with some concluding remarks by summarizing our           To associate channels with cells, a frequency planning
findings and outlining future work.                                process is used in the FCA. Once this is done, the relationship
                                                                  of the channels to cells is fixed. This suggests that a call
                                                                  attempt at a cell can only be served by one of the channels
                                                                  in       allocated to cell Consequently, if all the channels in
                                                                         are in use, a new call attempt at cell will be blocked
A. Problem Description
                                                                  even through there may be unoccupied channels in adjacent
   A core concept in cellular communication system is fre-        cells.
quency reuse, that is, the radio channels on the same carrier        Although FCA is relatively simple to operate, there are some
frequency can be repeatedly used by mobile users in different     potential drawbacks resulting from its use. For example, it is
cells, provided that the cells using the same channel are         not able to handle unpredicted time-varying traffic patterns
separated by sufficient distance. Such cells are referred to as    such as traffic jams and car accidents, because the capacity it
cochannel cells and the interference created by using the same    can provide is fixed. Also, frequency planning may become
channel simultaneously is known as cochannel interference         more difficult and tedious in a microcellular environment
which is considered to be a major constraint in the channel-      since more accurate knowledge on traffic and interference
assignment task. The cochannel interference is a function of a    conditions is required. Dynamic channel assignment is one
parameter known as cochannel reuse ratio defined by [16]           of the solutions to the problem encountered in FCA.

                                                                  C. Dynamic Channel Assignment
where , known as the frequency reuse distance, is the                The main feature of DCA is that all the channels are
distance between the centers of nearest neighboring cochannel     available in all the cells, and channel assignment is carried out
cells and    is the cell radius as shown in Fig. 1 where a        on a call-by-call basis in a dynamic manner. Therefore, traffic
hexagonal regular cellular layout is assumed.                     variability can be automatically adapted. This can potentially
   Assume that the minimum acceptable carrier-to-                 lead to improved performance, particularly if the spatial traffic
interference (C/I) is 18 dB, the minimum reuse distance           profile is unknown, poorly known, or varies according to time.
    has been found to be                [16]. This means that        The problem a DCA scheme tries to deal with may be
cochannel cells in Fig. 1 should be at least three cells apart.   described as follows. Assume that there are        cells and
   The channel assignment problem may be simply stated as         channels in a mobile system. Referring to Fig. 1, let denote
follows. Assume that there are      cells in a mobile commu-      a cell,       the set of cells interfering with , i.e., those
NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY                                                                                   1445

neighborhood cells that lie at a distance less than a reuse
distance , and            the set of all available channels at time
  in cell A channel is said to be available if channel is
neither being used in nor in cell                Now the problem
is, when a new call arrives at , how do we choose a channel
from                for the call? Obviously if                  and
no rearrangement (intracell hand-off) with respect to ongoing
calls is permitted, the new call will be blocked. On the other
hand, if more than one channel is available, a selection strategy
should be used.
   A number of DCA algorithms have been proposed. A critical
review of DCA may be found in our previous report [14]. Here
                                                                      Fig. 2. An illustration of learner-environment interaction.
only two types of strategies, namely exhaustive searching DCA
and neural-network-based DCA, are briefly described because
they are relevant to our approach.                                     III. SOLVING THE DCA PROBLEM THROUGH Q-LEARNING
   The strategies in the exhaustive searching DCA group share
                                                                         Conventional DCA strategies, as described in the last sec-
the following common features. Each available channel in cell
                                                                      tion, completely ignore the experience or knowledge that could
 , say               has a cost (reward)             with it. When
                                                                      be gained during real-time operation of the system. Although
a new call is attempted, the system searches exhaustively for
                                                                      the neural-network-based approach does have a training stage,
the channel      with minimum cost (maximum reward)
                                                                      it is crucial to have a good teacher (a known DCA algorithm)
                                                                      to guide the training. On the other hand, exhaustive searching
                                                               (3)    approaches are generally time-consuming to find a solution and
                                                                      are thus inefficient. Here, we propose an alternative approach
                                                                      to solving the channel assignment problem. By regarding
Then, channel      is assigned to the new call. Some criteria         the DCA as a large-scale constrained dynamic optimization
including maximum availability, maximum interferers, and              problem embedded in a stochastic environment, we may obtain
minimum damage have been used. The maximum availability               an optimal assignment policy through an effective learning
strategy, known as MAXAVAIL [20], has been claimed to                 scheme in which learning is accomplished progressively by
produce the best performance in the case of no intracell              appropriately utilizing the past experience which is gained
handovers being involved. The idea is to select channel               during real-time operation.
from          which maximizes the total number of channels               Learning without a teacher is difficult. A particular learning
available in the entire system       defined by                        paradigm we have adopted is known as reinforcement learn-
                                                                      ing (RL) [2]. In RL, a learner aims at learning an optimal
                                  is assigned to cell          (4)    control policy by repeatedly interacting with the controlled
                                                                      environment in such a way that its performance evaluated
                                                                      by a scalar reward (cost) obtained from the environment is
                                                                      maximized (minimized). The RL algorithms developed so far
Here it is assumed that channel is assigned to cell , where           are closely related to the well-known dynamic programming
is the set of cells in the system. Notice that the computational      (DP) procedure developed some decades ago by Bellman [3].
load for calculating            can be high because the number        There exists a variety of RL algorithms. A particular algorithm
of available channels due to the assignment of         must be        that appears to be suitable for the DCA task is called Q-
calculated for every channel and every base station.                  learning [23]. In what follows, we first describe the algorithm
   The DCA problem may be solved by neural-network-based              briefly and then present the details of how the DCA problem
approaches. For example, a Hopfield neural network was                 can be solved by means of Q-learning.
used in [8]. An energy function associated with a particular
cell is formulated by incorporating factors like interference
constraints, traffic requirement, and packing condition. Corre-
                                                                      A. Q-Learning Algorithm
sponding to this energy function, a Hopfield neural network is
constructed. When a new call arrives in cell , an equilibrium           Assume that the environment, which a learner interacts
point of the network is found by solving the corresponding            with, is a finite-state discrete-time stochastic dynamical system
dynamic equation iteratively. The stable states (zero or one) of      as shown in Fig. 2. Let           be the set of possible states,
the neurons represent the desired solution. Another possibility                                  and     be a set of possible actions,
is to use a multilayer feedforward neural network (MFNN) [4].
By providing training data, a MFNN is trained to behave as a            The interaction between the learner and the environment at
specific DCA scheme does. After the neural network is trained          each time instant consists of the following sequence.
using the backpropagation algorithm, the network is used in             • The learner senses the state
real time to give a desired channel number in response to a             • Based on , the learner chooses an action              to
new call request.                                                         perform.
1446                                                          IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999

   • As a result, the environment makes a transition to the new    and and        are the action taken at time and the immediate
     state                 according to probability        , and   cost due to     at , respectively. The Q-learning rule is
     thereby generates a return (cost)
   • The     is passed back to the learner and the process is
     repeated.                                                                                           if        and            (7)
The objective of the learner is then to find an optimal policy                                            otherwise
              for each , which minimizes some cumulative
measure of the costs                    received over time. A      where     is the learning rate, and
particular measure, which is referred to as the total expected
discounted return (cost) over an infinite time horizon, is given
                                                                   It has been shown [24] that if the Q-value of each admissible
                                                             (5)          pair is visited infinitely often, and if the learning rate is
                                                                   decreased to zero in a suitable way, then as            ,
where     stands for the expectation operator and                  converges to              with probability 1.
is a discount factor.       is often called the value function        Q-learning is a method of asynchronous dynamic pro-
of the state                                                       gramming. However, unlike traditional dynamic programming,
   Equation (5) can be rewritten as [23]                           the Q-learning algorithm is model-free in the sense that
                                                                   its operation does not need to know the state transition
                                                                   probabilities of the system and it can be used in an on-line
                                                                   manner. In addition, Q-learning is computationally efficient.
                                                                   It does not maintain two memory structures, the evaluation
where                                   is the mean value of       function and the policy; rather, it maintains only one memory
             The optimal policy       satisfies Bellman’s opti-     structure, namely, the estimated Q value of taking action at
mality criterion                                                   state

                                                                   B. DCA-Q-Learning Formulation
                                                                      The mobile communication system can be considered as
                                                             (6)   a discrete-time event system. As shown in Fig. 3, without
                                                                   considering handovers the major events which may occur
The task of Q-learning is to determine a    without knowing        in a cell include new call arrivals and call departures due
        and        , which makes it well suited for the DCA        to the completion of the call. These events are modeled as
problem. This is achieved by reformulating (6). For a policy       stochastic variables with appropriate probability distributions.
 , define a Q value (or state-action value) as                      In particular, new call arrivals in cell are independent of
                                                                   all other arrivals and obey a Poisson distribution with a mean
                                                                   arrival rate , as shown by

which is the expected discounted cost for executing action                                       arrivals occur in
at state and then following policy thereafter.
   Let                                                                                                                            (8)

                                                                   The interarrival time           has an exponential density, de-
                                                                   fined by
We then get

                                                                   Call holding time             is assumed to be exponentially
Thus the optimal value function      that satisfies Bellman’s       distributed with a mean call duration     The density function
optimality criterion can be obtained from           , and in       is given by
turn           may be expressed as

                                                                      To utilize the Q-learning scheme, it is necessary to formulate
  The Q-learning process tries to find            in a recursive    the DCA into a dynamic programming problem, or equiva-
manner using available information                   , where       lently, to identify the system state , action , associated cost
and            are the states at time and        , respectively;    , and the next state
NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY                                                                                       1447

                                                                            2) Actions: Applying an action is to assign a channel
                                                                         from       available channels to the current call request in cell
                                                                            Here, is defined as

                                                                            3) Costs: The cost           assesses the immediate cost in-
                                                                         curred due to the assignment of at state More specifically,
                                                                         it is a cost of choosing channel        to serve the currently
                                                                         concerned call attempt in cell There are many possibilities to
                                                                         define Here, we assess the cost of applying action           by
                                                                         evaluating usage conditions in cochannel cells associated with
                                                                         cell The basic idea is to assign higher costs to those usages
                                                                         in which cochannel cells are located further away from cell
                                                                         And thus, the lower costs are associated with those usages in
                                                                         which cochannel cells have minimum compact distance. More
                                                                         specifically,          is calculated by the following weighted
Fig. 3. Mobile communication system with a channel assignment scheme.
 1) State: Recall that it is assumed that there are cells and            In the above equation,         is the number of compact cells in
  channels available in the mobile communication system.
                                                                         reference to cell in which channel is being used. Compact
We define state     at time as
                                                                         cells are the cells with minimum average distance between
                                                                         cochannel cells [26]. In the case of a regular hexagonal layout
                                                                         shown in Fig. 1, compact cells are located on the third tier
where                       is the cell index specifying there is
                                                                         with three cells apart;        is the number of cochannel cells
an event, either call arrival or departure, occurring in cell
                                                                         which are located on the third tier but not compact cells in
                        is the number of available channels in
                                                                         which channel is being used;              is the number of other
cell at time , which depends on the channel usage conditions
                                                                         cochannel cells currently using channel ; and , , and
in cell and in its interfering cells
                                                                            are constant subcosts associated with the above-mentioned
   To obtain       at time , we define the channel status for
                                                                         conditions related to        ,        , and         , respectively.
cell                     as a -dimensional vector:
                                                                         The ordering relation between , , and              should be kept
                                                                         in such a way that                      For example,              ,
                                                                                   , and            were used in the simulation studies
                                                                         reported in the next section.
                     if channel is in use in cell                           4) Next State: According to the definition of state          de-
                     otherwise                                           scribed before, the state transition from         to        is de-
where                  and                                               termined by two stochastic events, call arrivals and call
   Furthermore, an availability vector                              is   departures. Therefore, the next state             can be obtained
formed                                                                   whenever one of these events occurs. However, in this paper
                                                                         only call arrivals are treated explicitly as sources to trigger
                                                                         the state transition in which actions, i.e., channel assignments
with each component             being defined as                          are required to be taken. Although call departures do alter the
                  if channel is available for use in cell                number of available channels, we will not carry out any actions
                  otherwise.                                             for them (here no intracell handover is considered) except to
  Once channel status in cell and in its interfering cells               release the channel on which a call is just completed.
           are known, availability vector      can be formed
easily with the corresponding components being obtained from             C. Algorithm Implementation
                                                                            Having specified the state, action, cost, and next state,
                                                                         we are ready to describe a detailed implementation of the
                                                                         Q-learning algorithm for solving the DCA problem. Fig. 4
where                     and  denotes the number of inter-
                                                                         illustrates the structure of the Q-learning-based DCA system.
ference cells of cell , and denotes the logical Or operation.
                                                                         As pointed out in Section III-A, Q-learning is an on-line
      is defined by
                                                                         learning scheme. In our case, it means that the task of learning
                                                                         a good assignment policy and assigning a channel to a call
                                                                 (11)    attempt can be performed simultaneously. The system using
                                                                         Q-learning, however, may work in a fashion consisting of two
where         is the logical negation of                                 successive procedures, learning and assigning. The Q-value
1448                                                                        IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999

Fig. 4. Structure of Q-learning-based DCA.

is first learned on-line with a sufficiently long time period,2                        Step 4: Q-value update. Update the Q-values, once the next
with the learned Q-values being stored in a representation                        state and the instant cost           becomes available. The
mechanism. Then the task of on-line assignment is carried                         target Q-value denoted by            according to (7) is
out by using the learned Q-values. Here, an important issue
arises as to how to store the Q-values.
   There exists a variety of approaches to representing the Q-
values [2]. A lookup table is the most straightforward method.
It has the advantage of being both computationally efficient                       where                  are available channels at state    The
and completely consistent with the structure assumption made                      Q-value               is updated according to the difference
in proving the convergence of the Q-learning scheme. How-                                                  and the chosen learning rate
ever, when the input space consisting of state-action pairs is                       Step 5: Network parameters update. If the Q-value is stored
large or the input variables are continuous, using a lookup                       in a neural network or any type of approximator, the sec-
table can be prohibitive because memory requirement may be                        ond learning procedure (training) is necessary to learn the
huge. In this case, some function approximators such as neural                    weight parameters associated with the network. In this case,
networks may be used in an efficient manner. As expected,                                                   is served as an error signal which is
a second learning (or training) procedure will be involved in                     backpropagated.
which the network parameters such as weights are determined.                         It can be seen that if the Q-values are learned and repre-
In this report, both the lookup table and the neural network                      sented faithfully, the task of assignment with learning being
are considered as the representational mechanism.                                 stopped can be very efficient, since in this case only the first
   Now the steps concerning learning and assigning corre-                         four steps are involved.
sponding to Fig. 4 are given as follows.
   Step 1: State-action construction. Construct current state
                 by identifying the current cell number and                                         IV. SIMULATION RESULTS
using channel usage information associated with and its
interfering cells. Also, find a list of       available channels                   A. Issues Related to the Simulation
denoted by the set              Here, we use       , instead of
                                                                                     1) Simulated model: The performance of the proposed
     , to signify explicitly the number of available channels
                                                                                  DCA algorithm was evaluated by simulating a mobile
corresponding to state
                                                                                  communication system consisting of 49 hexagonal cells as
   Step 2: Q-value retrieval. Form a set of         argumented
                                                                                  shown in Fig. 1. With the reuse distance              , it turns
inputs                             and feed them into the Q-
                                                                                  out that if a channel is allocated to cell in Fig. 1, it cannot
value representation mechanism, thereby deriving a set of
                                                                                  be reused in two tiers of adjacent cells with because of
                                                                                  unacceptable cochannel interference levels. Thus there are at
   Step 3: Channel assignment. According to the definition of
                                                                                  most 18 interfering cells for a specified reference cell.
the Q-values, the optimal action, i.e., the optimal channel ,
                                                                                    The assumptions and the parameters used in the simulation
is the one with minimum Q-values
                                                                         (13)       • New call arrivals obey Poisson distributions with uniform
                                                                                       and nonuniform mean interarrival times among the cells.
                                                                                       The mean arrival rate can be from 20 calls/h to 250
as indicated in Fig. 4.                                                                calls/h in each cell.
   2 Here, on-line learning means that the learner interacts with the operating
                                                                                    • The call-holding time obeys an exponential distribution
environment in a real-time fashion. However, the environment can be either             with a mean call-duration         Throughout this report,
a real system or a simulated network.                                                               s was used for all calls.
NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY                                                                                           1449

Fig. 5. Performance comparison with uniform traffic: FCA by 3, MAXAVAIL by , Q-learning with table by   +, Q-learning with neural network by 2

  • The offered traffic        in cell   is given by

  • There are            channels available in the system,
     although the number of channels can vary.
  • Blocked new and handover calls are dropped and cleared
     (Erlang B).
  2) Performance Evaluation: The performance of a chan-
nel assignment algorithm at a particular traffic loading was
assessed by measuring the new call-blocking probability   ,
given by
                number of blocked calls in a cell
            number of new call arrivals at that cell
Because Erlang B is assumed, the performance of the DCA                 Fig. 6. Nonuniform traffic distribution: Case 1.
can be readily compared with that of the FCA. The blocking
probability in cell in the case of FCA is given by                      imminent future event, which can be a call arrival or a call
                                                                        departure. To this end, it is necessary to maintain dynamically
                                                                        a list of future events. If the event occurring is a call arrival, a
                                                                (15)    set of steps described in Section III-C is performed, resulting
                                                                        in either the call being blocked or served by a channel. If
                                                                        necessary, learning is carried out. On the other hand, if the
                                                                        event occurring is a call departure, the occupied channel is
where      and       are the offered traffic and the number
                                                                        released. After the event is processed accordingly, the channel
(fixed) of available channels in cell However, notice that the
                                                                        usage information in each cell is updated and the time clock is
blocking probabilities of using the FCA in various conditions
                                                                        advanced. To calculate the system performance, the number of
described in the next subsection were calculated by operating
                                                                        new call arrivals and the number of blocked calls are recorded.
the simulated system instead of using the above formula.
  3) Simulation Procedures: To simulate the mobile com-
munication system as a discrete-event dynamic system, a                 B. Results
simulation clock is maintained. It gives the current value of             A set of simulations were carried out, including the cases of
simulated time of the whole system. The simulation clock is             homogeneous and inhomogeneous traffic distributions, time-
advanced according to the time of occurrence of the most                varying traffic patterns, and channel failures. For the purpose
1450                                                             IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999

Fig. 7. Performance comparison with nonuniform traffic distribution (case 1): FCA by 3, MAXAVAIL by   , Q-learning with table by +, Q-learning
with neural network by :

of comparison, the results due to the FCA and the maximum             on-line for 30 simulated hours by using the back-propagation
availability based-DCA algorithm, MAXAVAIL [20], were                 algorithm in conjunction with the Q-learning. The learning rate
included. The reason for selecting the MAXAVAIL is that it            and momentum gains for network training were 0.3 and 0.9,
has been claimed to be one of the best DCA algorithms in the          respectively. The trained network was then used to select a
sense that its performance is close to the best achievable in         desired channel in response to a call attempt.
this class of channel assignment algorithms where no intracell           Fig. 5 shows the blocking probabilities of using the Q-
handovers are involved.                                               learning with the table structure (marked by “ ”), and with
   1) Uniform Distribution: In this case, the traffic load was         the neural network structure (marked by “ ”). The results due
assumed to be the same among all 49 cells. Six different              to FCA (marked by “ ”), and MAXAVAIL (marked by “ ”)
 ’s in Erlangs were used, being 5, 6, 7, 8, 9, and 10 which           are also shown. For the FCA scheme, each cell was assigned
are equivalent, respectively, to call arrival rates of 100, 120,                   channels because a seven-cell cluster pattern was
140, 160, 180, and 200 calls/h. Two Q-value representation            assumed. The testing time for all the algorithms was five
mechanisms were considered. In the first place, a three-               simulated hours.
dimensional lookup table was used. The Q-values were learned             It can be seen from Fig. 5 that the Q-learning-based DCA
by running the simulated mobile communication system for 30           performs better than the FCA although the improvement
simulated hours with a constant arrival rate being 120 calls/h.       degree gained by the DCA decreases slightly with the increase
The discount factor was chosen to be 0.5 and the learning             in traffic load. For the interesting range of blocking probability
rate was designed to be state-action pair varying with time.          2% to 6%, an increase in carried traffic of 20% can be obtained.
More specifically, each state-action         was associated with       Compared with the MAXAVAIL scheme, we conclude that
a learning rate           which was inversely proportional to         the Q-learning-based DCA strategies are able to achieve a
the frequency            of the         being visited up to the       performance similar to that achieved by the MAXAVAIL.
present time. That is,                          with                  However, the computational complexities are quite different.
                  (if       is visited) and                 The       This issue will be discussed in some details in Section IV-C.
parameters in cost evaluation of (12) were                               2) Nonuniform Distribution: Fig. 6 shows a case [25] in
and             The learned table was then used to assign the         which the traffic densities in terms of calls/h are inhomoge-
desired channel in the same communication system but with             neously distributed among 49 cells. The average arrival call
six different traffic load conditions.                                 rate is 91.83 calls/h. Fig. 7 shows the blocking probabilities of
   The same procedures were applied to the situation where a          using the four methods described in the uniform case against
multilayer neural network [13] was used to represent the Q-           the arrival rates which were increased by 0, 20, 40, 60, 80,
values. The network with three inputs representing state-action       100 percent over the base rates given in Fig. 6. Fig. 7 indicates
values, eight nonlinear hidden units with sigmoid functions,          some significant improvements of the DCA algorithm over the
and one linear output unit representing the Q-value was trained       FCA scheme, namely about 50% increase in the traffic load
NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY                                                                                         1451

Fig. 8. Performance comparison with nonuniform traffic distribution (case 2): FCA by 3, MAXAVAIL by   , Q-learning with table by +, Q-learning
with neural network by :

at the same blocking probabilities. This is somewhat expected
because the DCA scheme is on a call-by-call basis and thus
is able to adapt to the spatial nonuniform situations. However,
for the FCA to perform better, the traffic in the system should
be as homogeneous as possible.
   We notice that the Q-learning-based DCA, whether using
the table or the neural network, again performed as well as
the MAXAVAIL did. It is interesting to observe that neither
the table nor the neural network was relearned and retrained.
The Q-values learned in the uniform case were used.
   Fig. 8 gives another example where the base traffic loads are
given in Fig. 9 [25] with average arrival rate 106.53 calls/h.
As expected, the DCA schemes in this case did not perform
as well as in the case of Fig. 7 in terms of the improvement
degree over that obtained by the FCA approach. This is partly         Fig. 9. Nonuniform traffic distribution: Case 2.
because the traffic loads were higher than those of Fig. 7.
   3) Time-Varying Traffic Load: The traffic load in telephony
systems is typically time-varying. Fig. 10 shows a pattern               We also examined the case in which the traffic loads
concerning call arrivals during a typical business day from           were both spatially nonuniformally distributed and temporarily
0:00 hour to 23:00 hours [12]. It can be seen that the peak           varying. Fig. 12 gives the results due to the Q-learning with
hours occur around 11 h and 16 h. Fig. 11 gives the simulation        the table structure [Fig. 12(a)] and the FCA [Fig. 12(b)]. The
results under the assumption that the traffic load was spatially       spatial distribution was in accordance with that given in Fig. 9
uniformly distributed among 49 cells (maximum 165 calls/h)            and the temporal distribution was consistent with that given in
but followed the time-varying pattern given in Fig. 10. The           Fig. 10. As expected, a more significant improvement in terms
blocking probabilities were calculated on an hour-by-hour             of blocking probability was seen in this case than that in the
basis. The result obtained using the Q-learning with the table        uniform distribution case. In particular, if again a 4% blocking
structure is shown in Fig. 11(a) whereas that due to the FCA          probability is set to be a threshold, the number exceeding that
approach is shown in Fig. 11(b). The improvement of the Q-            threshold is four in Fig. 12(a) and ten in Fig. 12(b).
learning-based DCA over the FCA is apparent. For example,                4) Equipment Failure and On-Line Behavior: In a mobile
the number of hours at which the blocking probabilities were          communication system, equipment failure during the normal
over 4% is two in Fig. 11(a), whereas that number is four in          operating hours may occur. To simulate this situation, we
Fig. 11(b).                                                           assumed that the various equipment failure cases will result
1452                                                                   IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999

Fig. 10.   A traffic pattern of a typical business day.

                                                   (a)                                          (b)
Fig. 11.   Performance with temporal varying and spatial uniform traffic: (a) Q-learning; (b) FCA.

                                                   (a)                                          (b)
Fig. 12.   Performance with temporal varying and spatial nonuniform traffic: (a) Q-learning; (b) FCA.

in some frequency channels being temporally unavailable.                     zero (solid line), three (dotted line), five (dashed line), or
Fig. 13 gives an example where the effect of channel failure on              seven (dash-dotted line) channels were temporally shut down
the system blocking probability was demonstrated under the Q-                and thus not available for use. By comparing the results, it
learning-based scheme with the table representation structure.               seems that the channel assignment algorithm possesses certain
The call arrival rate was 180 calls/h in all the cells. There were           robustness to channel failure situations, particularly when the
70 channels available initially and, during 10 to 15 o’clock,                number of failed channels is small, e.g., three to five.
NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY                                                                                                           1453

Fig. 13.   Robustness to channel failure: zero channel (solid line); three channel (dotted line); five channel (dashed line); seven channel (dash-dotted line).

                                                      (a)                                              (b)
Fig. 14.   On-line behavior of the Q-learning: (a) Blocking probability curve. (b) Averaging arrival rate with nonuniform distribution of case 1.

   Finally, we examined the on-line behavior of the Q-learning-                   mance similar to that achieved by the MAXAVAIL. However,
based DCA in the sense that both learning and assigning                           the computational complexities are quite different. In the
operations were carried out simultaneously. Fig. 14(a) shows                      process of assigning a channel, the complexity of using a
one of the results where the blocking probability was computed                    table or neural network depends primarily on the number of
accumulatively over two days (48 h). The call arrival rates                       channels, or more precisely, the number of available channels
were nonuniformally distributed as shown in Fig. 9 with the                                          comparisons with respect to           Q’s are
averaging varying according to Fig. 14(b). Some improvement                       needed to make a decision. To obtain individual Q’s, in the
due to on-line learning can be seen in Fig. 14(a) in the sense                    case of table representation, it is a matter of index addressing
that the accumulated blocking probabilities during the second                     which can be very fast. In the case of neural network repre-
day were generally lower than those during the first day. A                        sentation, it depends on the size of the network. In our case,
similar behavior was observed in another case as shown in                         approximately                         operations (multiplications
Fig. 15(a) where the call-arrival rates were nonuniformally                       or additions) were required.3 Notice that network size is
distributed as shown in Fig. 9 with the averaging varying                         independent of the number of channels            and the number
according to Fig. 15(b).                                                          of cells     Therefore, the total number of operations needed
                                                                                  to assign a channel are                             for the table
C. Computational Issues
                                                                                    3 It should be pointed out that the approximate number of operations given
   The results given in Figs. 5, 7, and 8 suggest that Q-                         does not include the number of eight sigmoid nonlinear operations on eight
learning-based DCA strategies are able to achieve a perfor-                       hidden units.
1454                                                                        IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER 1999

                                                     (a)                                             (b)
Fig. 15.   On-line behavior of the Q-learning: (a) Blocking probability curve. (b) Averaging arrival rate with nonuniform distribution of case 2.

                                                                            TABLE I
                                                NUMBER     OF   OPERATIONS REQUIRED FOR THREE DCA SCHEMES

representation and                               for the neural                  Q-learning. The real-time simulation studies carried out in a
network case as shown in Table I. As an example, 19 and 649                      49-cell mobile communication system have demonstrated that
operations (comparisons, additions, or multiplications) will be                  the proposed approach is a practical alternative to existing
needed in the table and neural network cases, respectively, if                   schemes. In particular, comparative studies with the FCA and
we assume that ten channels are available.                                       the MAXAVAIL based DCA algorithm have suggested that
   The complexity of the MAXAVAIL scheme depends on                              the Q-learning-based DCA is able to perform better than the
the number of channels, the number of the cells, and the                         FCA in different situations, including the traffic load being
number of interfering cells. Besides          comparisons, for                   spatially uniformly and nonuniformly distributed, and being
each available channel the availability of that channel in each                  time varying. Also, the new approach is capable of achieving
of     cells is checked. For each cell,        interfering cells                 a performance similar to that achieved by the one of the best
(in our case      can be 18) have to be visited to determine                     known DCA algorithms, MAXIAVIAL. However, the on-line
the channel status in that cell, requiring roughly           Or                  computational efficiency of the proposed approach is far better
operations and      addition operations for each visit. Thus,                    than that of the MAXAVAIL. This is a definite advantage of
the total number of operations needed to assign a channel                        our approach since time efficiency can be a critical issue in
is                                                  as given in                  real time implementations.
Table I. If we assume again that ten channels are available, the                    While the current result seems to be encouraging, there
number of operations using the MAXAVAIL scheme would be                          certainly exist some issues worth pursuing further. First, some
                                                                                 practical matters must be considered if the approach is to be
   In terms of storage requirement, however, the MAXAVAIL                        considered for implementation in a real system. They include
method possesses the lowest number of memory units since                         the problem of scaling to larger systems with large number of
it does not need to memorize much knowledge. The table-                          cells and channels, and distributed implementation in each base
based Q-learning requires a higher number of memory units,                       station. Secondly, to use the DCA algorithm more efficiently,
the maximum of which in our case is                                              some limited number of intracell handovers may be considered
whereas                      memory units are needed to store                    so as to create more favorable conditions for future use. The
the weights in the case of the neural-network-based Q-learning                   third point that warrants investigation is how to introduce some
approach. It should be mentioned that it is highly possible to                   fuzzy concepts and algorithms [19] into learning or computing
reduce the storage requirement of the table-based Q-learning                     procedures. For example, the interference conditions may be
by using some localized network such as CMAC, CPN, or                            expressed by fuzzy terms, leading to soft constraints. This
RBF network.                                                                     makes sense since the coverage of the cells in reality is not
                                                                                 clearly defined but with fuzzy boundaries (overlapping each
                            V. CONCLUSION                                        other to some degree). Finally, it may be worthwhile to explore
   We have described a novel approach to the problem of                          the possibility of keeping the table structure to represent the
dynamic channel assignment. The optimal assignment pol-                          Q-value but with a reduced storage requirement by using some
icy is obtained by using a self-learning scheme based on                         localized neural network configurations.
NIE AND HAYKIN: DYNAMIC CHANNEL ASSIGNMENT POLICY                                                                                                          1455

                          ACKNOWLEDGMENT                                          [20] K. N. Sivarajan, R. J. McEliece, and J. W. Ketchum, “Dynamic
                                                                                       channel assignment in cellular radio,” Proc. 40th Veh. Technol. Conf.,
  The authors wish to thank the anonymous reviewers for their                          pp. 631–637, 1990.
valuable comments and suggestions, which have helped us to                        [21] S. Tekinary and B. Jabbari, “Handover and channel assignment in mobile
                                                                                       cellular networks,” IEEE Commun. Mag., pp. 42–46, Nov. 1991.
improve the quality of this paper.                                                [22] J. Tesauro, “Practical issues in temporal difference learning,” Machine
                                                                                       Learning, vol. 8, pp. 257–277, 1992.
                               REFERENCES                                         [23] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D dissertation,
                                                                                       Cambridge Univ., 1989.
 [1] A. Baiocchi, F. D. Priscoli, F. Grilli, and F. Sestini, “The Geometric       [24] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
     dynamic channel allocation as a practical strategy in mobile networks             vol. 8, pp. 279–292, 1992.
     with bursty user mobility,” IEEE Trans. Veh. Technol., vol. 44, pp.          [25] M. Zhang and T. S. Yum, “Comparisons of channel assignment strategies
     14–23, 1995.                                                                      in cellular mobile systems,” IEEE Trans. Veh. Technol., vol. 38, pp.
 [2] A. G. Barto, S. J. Bradtke, and S. P. Singh, “Learning to act using               211–215, 1989.
     real-time dynamic programming,” Artificial Intell., vol. 72, pp. 81–138,      [26]        , “The nonuniform compact pattern allocation algorithm for
     1995.                                                                             cellular mobile systems,” IEEE Trans. Veh. Technol., vol. 40, pp.
 [3] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ.                   387–391, 1991.
     Press, 1957.
 [4] P. T. Chen, M. Palaniswami, and D. Everitt, “Neural network-based dy-
     namic channel assignment for cellular mobile communication systems,”
     IEEE Trans. Veh. Technol., vol. 43, pp. 279–288, 1994.                                                  Junhong Nie received the B.S. and M.S. degrees
 [5] J. Chuang, “Performance issues and algorithms for dynamic channel                                       from Northwest Telecommunications Engineering
     assignments,” IEEE J. Select Areas Commun., vol. 11, pp. 955–963,                                       Institute (NTEI), Xi’an, China, and the Ph.D. de-
     1993.                                                                                                   gree from The University of Sheffield, U.K., all in
 [6] D. C. Cox and D. O. Reudink, “Dynamic channel assignment in two
                                                                                                             electrical engineering.
     dimensional large mobile radio systems,” Bell Syst. Tech. J., vol. 51, pp.
                                                                                                                He served as a Lecturer in the Electrical Engi-
     1611–1627, 1972.
 [7] R. H. Crites and A. G. Barto, “Improving elevator performance using                                     neering Department of NTEI from 1985 to 1989.
     reinforcement learning,” in Advances in Neural Inform. Processing Syst.                                 He was a Research Scientist in the Department
     8, 1996.                                                                                                of Electrical Engineering, National University of
 [8] E. Del Re, R. Fantacci, and L. Ronga, “A dynamic channel alloca-                                        Singapore from 1993 to 1995 and a Senior Research
     tion technique-based on Hopfield neural networks,” IEEE Trans. Veh.                                      Engineer in Communications Research Laboratory
     Technol., vol. 45, pp. 26–32, 1996.                                          of McMaster University, Canada from 1996 to 1997. Since 1998, he has joined
 [9] D. D. Dimitrijevic and J. Vucetic, “Design and performance analysis of       Nortel Networks working on advanced wireless communication systems. He
     the algorithms for channel allocation in cellular networks,” IEEE Trans.     published one book and more than 50 articles, 30 of which appeared in
     Veh. Technol., vol. 42, pp. 526–534, 1993.                                   internationally refereed journals and book chapters.
[10] M. Duque-Anton, D. Kunz, and B. Ruber, “Channel assignment for
     cellular radio using simulated annealing,” IEEE Trans. Veh. Technol.,
     vol. 42, pp. 14–21, 1993.
[11] D. Everitt, “Traffic engineering of the radio interface for cellular mobile
                                                                                                           Simon Haykin (F’82) received the B.Sc. degree
     networks,” Proc. IEEE, vol. 82, pp. 1371–1382, 1994.
                                                                                                           with First-Class Honors in 1953, the Ph.D. degree in
[12] R. L. Freeman, Telecommunication System Engineering, 3rd ed. New
                                                                                                           1956, and the D.Sc. degree in 1967, all in electrical
     York: Wiley, 1996.
[13] S. Haykin, Neural Networks: A Comprehensive Foundation. New                                           engineering from the University of Birmingham,
     York: Macmillan, 1994.                                                                                U.K.
[14] S. Haykin and Junhong Nie, “A preliminary investigation on channel                                       He is the founding Director of the Communica-
     assignment problem in mobile communication systems,” Commun. Res.                                     tions Research Laboratory at McMaster Unviersity,
     Lab., McMaster Univ., Hamilton, Ont., Canada, Tech. Rep., 1996.                                       Hamilton, Ontario. In 1996 he was awarded the title
[15] D. Kunz, “Channel assignment for cellular radio using neural networks,”                               “University Professor.” His research interests in-
     IEEE Trans. Veh. Technol., vol. 40, pp. 188–193, 1991.                                                clude nonlinear dynamics, neural networks, adaptive
[16] W. C. Y.Lee, Mobile Cellular Telecommunications. New York:                                            filters, and their applications in radar and commu-
     McGraw-Hill, 1995.                                                           nication systems.
[17] W. K. Lai and G. G. Coghill, “Channel assignment through evolutionary           In 1980, Dr. Haykin was elected Fellow of the Royal Society of Canada.
     optimization,” IEEE Trans. Veh. Technol., vol. 45, pp. 91–96, 1996.          He was awarded the McNaughton Gold Medal, IEEE (Region 7), in 1986.
[18] V. H. Macdonald, “The cellular concept,” Bell Syst. Tech. J., vol. 58,       He is a recipient of the Canadian Telecommunications Award from Queen’s
     pp. 15–41, 1979.                                                             University. He is the Editor for Adaptive and Learning Systems for Signal
[19] J. Nie and D. A. Linkens, Fuzzy-Neural Control: Principles, Algorithms,      Processing, Communications and Control, a new series of books for Wiley-
     and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1995.                 Interscience.

To top