Document Sample
plugin Powered By Docstoc
					J. R. Statist. Soc. B (1999)
61, Part 4, pp. 793^815

      Multivariate bandwidth selection for local
      linear regression

      Lijian Yang
      Michigan State University, East Lansing, USA

      and Rolf Tschernig
      Humboldt-Universita Berlin, Germany

      [Received December 1997. Final revision January 1999]

      Summary. The existence and properties of optimal bandwidths for multivariate local linear
      regression are established, using either a scalar bandwidth for all regressors or a diagonal band-
      width vector that has a different bandwidth for each regressor. Both involve functionals of the
      derivatives of the unknown multivariate regression function. Estimating these functionals is dif®cult
      primarily because they contain multivariate derivatives. In this paper, an estimator of the multivariate
      second derivative is obtained via local cubic regression with most cross-terms left out. This estimator
      has the optimal rate of convergence but is simpler and uses much less computing time than the full
      local estimator. Using this as a pilot estimator, we obtain plug-in formulae for the optimal bandwidth,
      both scalar and diagonal, for multivariate local linear regression. As a simpler alternative, we also
      provide rule-of-thumb bandwidth selectors. All these bandwidths have satisfactory performance in
      our simulation study.

      Keywords: Asymptotic optimality; Blocked quartic ®t; Functional estimation; Partial local
      regression; Plug-in bandwidth; Rule-of-thumb bandwidth; Second derivatives

1.   Introduction
Nonparametric estimation in general requires little a priori knowledge on the functions to be
estimated. The estimation results, however, depend crucially on the choice of bandwidth.
Whereas choosing too large a bandwidth may introduce a large bias, selecting too small a
bandwidth may cause large estimation variance. An asymptotically optimal bandwidth usually
exists and can be obtained through a bias±variance trade-o€. Such an optimal bandwidth in
general involves functionals of the unknown underlying functions, and the selection of the
optimal bandwidth has always been a challenge. Bandwidth selection methods with good
theoretical properties and practical performance exist for univariate density estimation, e.g.
Jones et al. (1996a, b), Cheng (1997) and Grund and Polzehl (1998), and univariate local least
squares regression, e.g. Fan and Gijbels (1995) and Ruppert et al. (1995). Wand and Jones
(1993) provided an excellent analysis of bandwidth selection for bivariate density estimation,
Sain et al. (1994) looked at multivariate cross-validation for density estimation bandwidth
selection and Wand and Jones (1994) developed plug-in (PI) bandwidths for multivariate
kernel density estimation. Herrmann et al. (1995) studied bandwidth selection for bivariate
regression with ®xed regular design, but it is unclear whether their method works for random

  Address for correspondence: Lijian Yang, Department of Statistics and Probability, Michigan State University,
East Lansing, MI 48824, USA.

& 1999 Royal Statistical Society                                                              1369±7412/99/61793
794      L. Yang and R. Tschernig

designs and higher dimensions. Ruppert (1997) proposed the empirical bias bandwidth
selector (EBBS) which could be applied in a multivariate setting. However, the theoretical
properties of the EBBS are unknown and its practical performance has only been studied in
the univariate setting. As pointed out by Ruppert (1997), the diculty in obtaining a reliable
multivariate data-driven bandwidth is essentially due to the complexity of estimating higher
order multivariate derivatives. To address this diculty, we estimate multivariate second-
order derivatives by a local cubic ®t, where all unnecessary terms are left out. This both solves
the theoretical problem and makes a practical implementation feasible. We are not aware of
any other work that makes use of the desirable properties of a partial local polynomial ®t.
   Another new feature is the use of a di€erent bandwidth for each of the regressors, in
contrast with the EBBS method which uses the same bandwidth for all regressors. We have
established some general assumptions that ensure the existence of asymptotic optimal band-
widths. This is not trivial because, when using a bandwidth vector, there is no simple closed
formula for the solution of the bias±variance trade-o€, and the existence of the solution needs
to be implicitly proven. This use of a `diagonal bandwidth' is a good compromise between
¯exibility and optimality on the one hand (which need more smoothing parameters, such as a
bandwidth matrix) and interpretation and simplicity on the other hand (fewer parameters,
such as one bandwidth for all explanatory variables). We have also studied the theoretical
properties and practical performances of the latter option, which we call the `scalar optimal
   The computation of our PI bandwidths requires the estimation of functionals that include
derivatives up to the fourth order. The idea is to estimate unknown functionals by blocked
polynomial ®ts, as in Hardle and Marron (1995) and Ruppert et al. (1995), but we had
further to re®ne the idea. To do a quartic ®t of a function of four variables could require up
to 70 parameters in each block, which not only makes it impossible to have a sucient
number of blocks to adapt to local features of the function but also introduces much noise for
the estimation within each block. Since the computation of one functional does not require
all the derivatives included in a complete fourth-order Taylor expansion, we propose to use a
partial quartic ®t in the blocks, with only 23 parameters for the same function. This advan-
tage in number grows drastically if the number of variables becomes even more.
   The PI bandwidths are necessarily computation intensive, so we also investigated rule-of-
thumb (ROT) bandwidths as simpler alternatives.
   The paper is organized as follows. In the next section, we show the existence of an asymp-
totically optimal bandwidth vector. Section 3 de®nes an ROT and a PI bandwidth selector
for local linear regression, and asymptotic optimality is established for the PI bandwidth. In
Section 4, we describe in detail how functionals of second derivatives are estimated via partial
local cubic regression, while parallel results using a scalar bandwidth selector are summarized
in Appendix A. Implementation of the methods proposed is discussed in Section 5. In Section
6 we describe the results of a rather comprehensive Monte Carlo study. Section 7 concludes,
while all assumptions and proofs are contained in Appendix B.
   The programs that are used in the analysis can be obtained from

2.    Background
To formulate the problem, consider a multivariate regression model
                                       Y ˆ f…X† ‡ g…X†
                                                               Multivariate Bandwidth Selection    795

where Y is a scalar dependent variable, X ˆ …X1 , X2 , . . ., Xd † is a vector of explanatory
variables,  is independent of X, E…† ˆ 0 and var…† ˆ 1. Let …Xi , Yi †, i ˆ 1, 2, . . ., n, be an
independent and identically distributed sample. Then the local linear estimator of f at a given
point x ˆ …x1 , x2 , . . ., xd † is obtained by doing a ®rst-order Taylor expansion of the function
 f at point x for all the data points Xi and solving a least squares problem locally weighted by
kernels: i.e. the estimator f”…x† is the ®rst element c of the vector
                                            fc, …c †144d g
that minimizes
                           n               P
                                  Yi À c À   c …Xi À x † K h …Xi À x†
                           iˆ1             ˆ1

where K is a symmetric, compactly supported, univariate probability kernel (so that K is non-
negative and K…u† ˆ 1) and
                                          Q 1
                                           d         Xi À x
                           K h …Xi À x† ˆ        K              :
                                          ˆ1 h        h
   Denoting by p…x† the density of X, using a weight function w…x†, the weighted mean
integrated squared error (MISE)
                            E    f f”…x† À f …x†g2 w…x† p…x† dx

is a function of the bandwidth vector h ˆ …h1 , . . ., hd † T , and the h that minimizes this error is
called the optimal bandwidth vector. An asymptotic formula of the MISE in this setting is
given by

                               4 P
                                      d                                  1
                   AMISE…h† ˆ            C … f †h2 h2 ‡ kK k2d B…g†
                                4 ,ˆ1                 2
                                                                    nh1 . . . hd
             „                       „
in which 2 ˆ u2 K…u† du, kK k2 ˆ K 2 …u† du and
          K                     2
                                   B…g† ˆ g2 …x† w…x† dx,
             C … f † ˆ f …x† f …x† p…x† w…x† dx,         ,  ˆ 1, . . ., d,

where f …x† denotes the second derivative of f…x† with respect to the th variable x . For the
general formula of AMISE using a bandwidth matrix, see p. 140 of Wand and Jones (1995).
We denote by C… f † the matrix whose …, †th entry is C … f †, which is always non-negative
de®nite. Also, we always have B…g† > 0 by assumption 2 in Appendix B.
  We introduce a function Q: Rd  M‡ …d †  R‡ 3 R‡

                            1 P d               a          1               a
             Q…v, M, a† ˆ          M v v ‡p               ˆ v T Mv ‡ p                            …2:1†
                            4 ,ˆ1     …v1 . . . vd † 4           …v1 . . . vd †

where M‡ …d† is a set of non-negative de®nite d  d matrices whose de®nition is given by
expression (B.1) in Appendix B, R‡ ˆ …0, I† and M is the …, †th entry of a matrix M.
With h2 ˆ …h2 , . . ., h2 † T , we then have
             1          d
796       L. Yang and R. Tschernig
                                                         kK k2d B…g†
                            AMISE…h† ˆ Q h2 , 4 C… f †,
                                               K                       :
     Simple algebra gives
                      Q…v, M, a† ˆ a4=…d‡4† cd=…d‡4† Q…aÀ2=…d‡4† c2=…d‡4† v, cÀ1 M, 1†
and therefore
                                  v…M, a† ˆ a2=…d‡4† cÀ2=…d‡4† v…cÀ1 M, 1†                             …2:2†
where v…M, a† denotes the vector v that minimizes Q …v, M, a†. By theorem 6, v …M, a† is a
well-de®ned and in®nitely smooth function of M and a. Using equation (2.2), we have the
following theorem.
  Theorem 1. Under assumptions (1)±(4), AMISE…h† is minimized by a unique vector which
depends on B…g† and C… f † smoothly via
                                 kK k2d B…g†
                        hopt ˆ                        v1=2 fC… f †, 1g:           …2:3†

   The optimal bandwidth hopt given in formula (2.3) contains unknown quantities B…g† and
C… f † and hence cannot be directly used in the estimation of f…x†. In Section 4, appropriate
estimators of f …x† and C… f † are given based on a partial local cubic ®t with many cross-
terms left out. These estimators have very simple bias formulae and their biases and variances
are of the same order as keeping the cross-terms in.

3.    Bandwidth selection
A quick substitute of hopt is the simple ROT bandwidth de®ned as
                                 kK k2d BROT …g†
                      hROT ˆ                              v1=2 fCROT … f †, 1g                         …3:1†

in which CROT … f † and BROT …g† are based on piecewise quartic estimation of f …x†, and whose
formulae are given by expressions (5.1) and (5.2).
   The bandwidth vector hROT is of the same order as hopt but does not estimate hopt eciently
because f may not always be represented by a piecewise fourth-order polynomial. The idea of
such an ROT bandwidth exists in the univariate case; see Fan and Gijbels (1995) and Ruppert
et al. (1995). The latter additionally used a blocking idea that we have also adopted.
   To improve on the ROT idea, we propose a PI bandwidth, so called as it approximates hopt
by plugging in consistent estimates of the functionals B…g† and C… f †. Speci®cally, we de®ne
                                    kK k2d B…g†
                            hPI ˆ        2                      ”
                                                         v1=2 f C … f †, 1g               …3:2†

            ”          ”                 ”
in which C… f † ˆ f C … f †g ˆ f C … f , h†g where for each ,  ˆ 1, . . ., d
                  1P n
    C”  … f † ˆ        f” …Xi † f” …Xi † w…Xi † ˆ f” …x† f” …x† w…x† p…x† dx ‡ Op …nÀ1=2 †   …3:3†
                  n iˆ1
and the f” …x† is a partial local cubic estimator of f …x† given in equation (4.2). The estimator
B…g† is de®ned in equation (5.3).
                                                                 Multivariate Bandwidth Selection         797

  Using standard results such as contained in Wand and Jones (1995) or Fan and Gijbels
(1996), we have B…g†=B…g† ˆ 1 ‡ Op …nÀ2=…d‡4† † as n 3 I. On the basis of this, the asymptotics
   ” f † as given in corollary 2 and the smooth dependence of hopt on C… f † as in theorem 1,
of C…
we have the following theorem.
   Theorem 2. Under assumptions (1)±(4), for n 3 I, the bandwidth de®ned in equation
(3.2) is asymptotically optimal. In particular,
                                        hPI ˆ hopt fId ‡ Op …nÀ2=…d‡6† †g:
  Since hPI is a consistent substitute for the optimal bandwidth hopt , it can be shown that hPI
performs asymptotically similarly to hopt in terms of MISE. hPI will on average work better
than hROT since the latter is an inconsistent bandwidth estimator. Simulation results in Section
6 strongly support these expectations.
  In Appendix A, we discuss the use of a scalar bandwidth h instead of h, i.e. h ˆ …h, . . ., h†.
In that case, the corresponding ROT bandwidth hROT and PI bandwidth hPI are given by
equations (A.2) and (A.3).

4.   Partial local estimation
In this section, we formulate an estimator f” …x† of f …x†. In what follows, denote for any
compact supported function L
                                  r …L† ˆ       ur L…u† du;

hence in particular 2 …K † ˆ   We write 4 …K † ˆ 4 where  denotes the kurtosis of the
                                 2 .
                                  K                   K
kernel K, which is always greater than 1. For the derivative functions, we denote
 …x† ˆ               f…x†,
                                                       @x @x @x

Z ˆ ‰1, f…Xi À x †2 g, f…Xi À x † …Xi À x †gTˆ , …Xi À x †, f…Xi À x †…Xi À x †2 gTˆ ,

      f…Xi À x †2 …Xi À x †gTˆ , …Xi À x †3 Š n ,
                                                      iˆ1                                             …4:1†
and then de®ne
                                    f” …x† ˆ 2e T …Z T WZ †À1 Z T WY
where e is a (5d À 1)-vector of 0s whose ( ‡ 1)-element is 1, Y ˆ …Yi †nÂ1 and with a scalar
bandwidth h ˆ …h, . . ., h†
                                W ˆ diag     Kh …Xi À x†      ;
                                           n              iˆ1

see Ruppert and Wand (1994) for a general de®nition of a local polynomial estimator. We
then obtain the following theorem.
  Theorem 3. Under assumptions (1)±(3), for  ˆ 1, . . ., d, as h 3 0 and nhd‡4 3 I, we
798       L. Yang and R. Tschernig
                              …nhd‡4 † f f” …x† À f …x† À b …x†h2 g 3 N f0, 2 …x†g

                                          6 …K † À 6K             P 2  K
                              b …x† ˆ                  f …x† ‡         f   …x†,                                                …4:3†
                                          12… À 1†K 4
                                                                     Tˆ 2 

                                  4 f 4 …K 2 † À 2 2 …K 2 †2 ‡ kK k2 4 gkK k2
                                                              K       2 K                            g2 …x†
                     2 …x† ˆ
                                                       8         2
                                                                                                              :                      …4:4†
                                                        K … À 1† p…x†
  The formation of the matrix Z is much simpler than the corresponding matrix of a full
local cubic expansion. This makes the explicit mathematical derivation of …Z T WZ †À1 easier
and computing expression (4.2) less costly.
  On the basis of equation (4.2), the asymptotic properties of C … f, h† are presented in the
next theorem. In the following, we denote K *…u† ˆ K…u† …u2 À 2 †, K * K * the convolution
between K and K *,
                             …K * K *† …t† ˆ K…t À u† K *…u† du,
which is also K…t ‡ u† K *…u† du by symmetry, and K …2† ˆ K * K, the self-convolution of K.
Also, we denote by  the Kronecker index, which equals 1 or 0 depending on whether  ˆ 
or not.

  Theorem 4. Under assumptions (1)±(4), as h 3 0 and nhd‡4 3 I, we have
                                                                 B…g† V …K †                           1
  C … f, h† ˆ C … f † ‡ fB … f † ‡ B … f †gh2 ‡                        ‡  ‡ Op …h4 † ‡ op    p d‡8
                                                                    nhd‡4                              n h
in which
                                    B … f † ˆ          f …x† b …x† w…x† p…x† dx,

                4kK k2          f 4 …K 2 †kK k2 ‡ …1 À  † 2 …K 2 † À 22 2 …K 2 †kK k2 ‡ 4 kK k4 g
                                                  2               2            K              2    K     2
   V …K † ˆ                                                                                                                        ,
                                                               8 … À 1†2

                                           p         D
                                          n hd‡8  3 N f0, 2 …g, K †g

                                               32        g4 …x† w2 …x† dx …
                           2 …g,
                                      K† ˆ                                    F …t† F …t† dt
                                                        16 … À 1†4

and where we de®ne for any ,  ˆ 1, . . ., d
                                                                       Q                                      Q
        F …t† ˆ …1 À  † …K * * K †…t †…K * * K † …t †                K …2† …t
 † ‡  K *…2† …t †            K …2† …t
                      „   4         2                                            ”
   Note here that g …x† w …x† dx > 0 by assumption (2), and that therefore C … f, h† has a
standard deviation of order …n hd‡8 †À1 , which is smaller than one of the bias terms …nhd‡4 †À1 .
                                                                 Multivariate Bandwidth Selection           799

The two bias terms point to a trade-o€ if both are positive or cancellation if the h2 -term
happens to be negative. Assuming that the h2 -term is never 0, we obtain the following results
for estimating C … f † optimally.
   Corollary 1. Under the same assumptions as in theorem 4, the optimal bandwidth to
estimate C … f † by C … f, h† is
                    8                               1=…d‡6†
                    > À          B…g† V …K †
                    >                                                     if B … f † ‡ B … f † < 0,
                    <      fB … f † ‡ B … f †g n
    hC … f †,opt ˆ                                    1=…d‡6†                                           …4:5†
                    > d‡4
                    >               B…g† V …K †
                    :                                                     if B … f † ‡ B … f † > 0,
                           2 fB … f † ‡ B … f †g n
                                                     p d‡8
which gives the standard error of order n hC … f †,opt ˆ O…nÀ…d‡4†=2…d‡6† † and the bias of order
h4  … f †,opt ˆ O…nÀ4=…d‡6† † when B … f † ‡ B … f † < 0 and of order h2  … f †,opt ˆ O…nÀ2=…d‡6† † when
 C                                                                           C
B … f † ‡ B … f † > 0.
                                                 ”                         ”
  We can now complete the de®nition of C … f †, and therefore of C… f †. Estimate all the
derivatives f …x† etc. by blocked quartic ®ts and obtain the estimates of B … f † as described in
Section 5. Then plug them in formula (4.5) together with B…g† in place of B…g†. Use the result-
ing bandwidth in the estimator C … f †. De®nition (3.3) and theorem 4 yield the following
  Corollary 2. Under assumptions (1)±(4), for n 3 I, the C… f † de®ned by equation (3.3) has
the consistency property
                                    C… f † ˆ C… f †fId ‡ Op …nÀ2=…d‡6† †g:

5.   Implementation
In this section we describe how to estimate C… f † and B…g† to obtain the ROT and PI
bandwidths (3.1) and (A.2), and (3.2) and (A.3) respectively.
   For the ROT estimation of f…Xi † and f …Xi †, Fan and Gijbels (1995) suggested the use of a
quartic Taylor expansion. Ruppert et al. (1995) proposed to separate the data into equal-
sized blocks and to use a quartic Taylor expansion on each block in case the function is very
wiggly. Since the number of parameters per block increases dramatically with the increase in
dimension and since the wiggliness of the function depends on each variable di€erently, our
¯exible blocking scheme sorts the data one variable at a time. Let N denote the number of
blocks in the th direction and collect them in N ˆ …N1 , . . ., Nd †. Then N ˆ Åd N is the
total number of blocks given the choices of N ,  ˆ 1, 2, . . ., d. To choose the optimal
blocks we follow Ruppert et al. (1995) and use Mallows's Cp ,
                                  RSS…N† fn À k…d †‰n=4k…d †Š g
                       Cp …N† ˆ                                 À fn À 2 k…d †N g
                                       minN fRSS…N†g
where RSS…N† denotes the residual sum of squares based on the quartic ®t with blocking N,
                                           P d‡iÀ1
                               k…d † ˆ 1 ‡
                                           iˆ1    i
is the maximum number of parameters in one block and ‰aŠ denotes the integer part of a.
800       L. Yang and R. Tschernig

Denoting the block vector chosen with Cp …N† by N*, and by f”ROT,N* , f”ROT,N*, the corres-
ponding function estimators, we then estimate

                                            1       P fYi À f”ROT,N* …Xi †g2 w…Xi †
                          BROT …g† ˆ                                                    ,       …5:1†
                                      n À k…d †N* iˆ1              p ROT …Xi †
                        C ROT … f † ˆ         f”ROT,N*, …Xi † f”ROT,N*, …Xi † w…Xi †        …5:2†
                                        n iˆ1
where p ROT …x† is the uniform density on the range of the data. These estimates are necessarily
inconsistent unless the true f is a piecewise quartic function.
   This problem is avoided by PI estimation described in Section 4 and Appendix A. Now we
estimate B…g† by

                                               1 P w…Xi † fYi À f”hROT …Xi †g
                                                  n                           2
                                     B…g† ˆ                                                     …5:3†
                                               n iˆ1        ~
                                                            phS …Xi †
where hROT is de®ned by equations (3.1), (5.1) and (5.2), hS is the d-dimensional normal
reference bandwidth (Silverman (1986), pages 86±87)) and phS …x† is a kernel density estimator.
    To estimate C… f † we use partial local cubic regression as described in Section 4. This is better
done with a bandwidth that is slightly larger than the optimal pilot bandwidth hC ,opt . The reason
is that the bandwidth hC ,opt only minimizes the absolute value of the bias, whereas it completely
ignores the variance in C… f † estimation. Asymptotically this is justi®ed since the standard
deviation of C… f, h† is of higher order than the bias. To use a bandwidth that is not equal to
hC  ,opt would produce in the C… f † estimate a larger bias than the minimum bias, but not
signi®cantly so if the bandwidth is larger. To see this e€ect we take the bandwidth hC ,opt and
plot in Fig. 1 the ratio of the bias against the minimum bias (achieved at  ˆ 1) as a function
r…† of  P ‰0:6, 2Š, for d ˆ 4. Here, we assume that B ‡ B > 0, and the ratio then is

                               f…d ‡ 4†=2†g2=…d‡6† 2 ‡ f…d ‡ 4†=2†gÀ…d‡4†=…d‡6† À…d‡4†
                      r…† ˆ                                                             :
                                    f…d ‡ 4†=2†g2=…d‡6† ‡ f…d ‡ 4†=2gÀ…d‡4†=…d‡6†
The broken line has the height of r…2†. We can see that r…† is about 3 or less for  > 1,
whereas there is a dramatic increase for  < 1. Also it is easy to verify that r…2† 4 4 for any
d. Meanwhile, for  ˆ 2, the ratio of decrease in the standard deviation is À…d‡8†=2 , which is
1=32 if d ˆ 2 and more for larger d: This decrease in standard deviation is drastic whereas the
increase in bias is at most 4 times. We therefore use 2hC ,opt , which should have very good
®nite sample performance.

Fig. 1. Ratio of the increase in bias when estimating C( f ) with hC (f ),opt , for d ˆ 4
                                                                             Multivariate Bandwidth Selection                  801

   The estimation of the unknown functional B is based on blocked quartic ®ts of f if d 4 3.
For larger d, the number of parameters can be so large that for medium sample sizes blocking
or even global estimation becomes impossible. We therefore use a partial quartic ®t for the
estimation of each B , consisting of only those fourth-order derivatives f ,  ˆ 1, . . ., d,
in equation (4.3) plus all lower order derivatives that form a subset of those. As an example,
such a partial quartic polynomial centred at zero reads
       d                P
                        d                     P
                                              d                   P
                                                                  d                      P
                                                                                         d                       P
b0 ‡         b X  ‡         b X X ‡              b X2 ‡
                                                                       b X X2 ‡
                                                                                                 b X2 X ‡
                                                                                                                      b X2 X2 :
       ˆ1              ˆ1                 ˆ1,Tˆ              ˆ1                  ˆ1,Tˆ                  ˆ1

For this partial expansion the number of parameters k…d† ˆ 6d À 1 grows linearly in d. It also
includes all terms to estimate f . For d ˆ 4, the number of parameters drops to about a third
of the complete expansion. The partial expansion allows not only more blocks but also the
blocks to focus more on the variable directions that have more curvature. This is a new
feature that is not an issue in the univariate case.
  For ecient partial quartic ®ts we propose to use at least 4 k…d† observations in each block.
This amounts to requiring n 5 4 k…d † ˆ 4…6d À 1†. For example, if d ˆ 4, partial blocking
needs n 5 92 observations. If n < 4…6d À 1†, we cannot recommend any of our methods with
con®dence as there are just not enough data to estimate f accurately.
  To use standard software for the numerical optimization problem in equations (3.1), (3.2)
or (2.3), it may be convenient to reparameterize v as v ˆ exp…u†, i.e. v1 ˆ exp…u1 †, . . ., vd ˆ
exp…ud † where u ˆ …u1 , . . ., ud † T P Rd so there is no constraint on u. The gradient and Hessian
matrices are then
                                0                            1
                                B exp…u1 †      M1 exp…u † C                             0 1
                                B           ˆ1              C                               1
     @Q fexp…u†, M, ag 1 B      B               F
                                                             CÀ p              a           BFC
                         ˆ B                    F            C 2 fexp…u †. . . exp…u †g @ F A
             @u              2B                 F            C               1          d
                                @           Pd               A                               1
                                   exp…ud †     Md exp…u †

       @ 2 Q fexp…u†, M, ag  1
                            ˆ diag fexp…u1 †, . . ., exp…ud †gM diag fexp…u1 †, . . ., exp…ud †g
               @u@u T        2
                             ‡ diag fM11 exp…2u1 †, . . ., Mdd exp…2ud †g
                             ‡ p                          1 ,
                               4 fexp…u1 † . . . exp…ud †g dÂd
which make it easy to ®nd u…M, a† ˆ lnfv…M, a†g, the u-value that minimizes Q fexp…u†, M,
ag. We then obtain the following substitute for equation (2.3):
                               kK k2d B…g†
                                   2                     u fC… f †, 1g
                      hopt ˆ                        exp                  :
                                  n4K                        2
Similar substitutes are used to compute equations (3.1) and (3.2).

6.     Simulation results
In this section we investigate the ®nite sample performance of the ROT bandwidths hROT
802     L. Yang and R. Tschernig
           Table 1. Model 1: mean of the ISE for various bandwidth selectors

             Bandwidth            Kind                Means for the following sample sizes:
                                                         100           250          500

             Diagonal       Asymptotic optimal          0.0326       0.0165       0.00893
                            PI                          0.0409       0.0183       0.00931
                            ROT                         0.0656       0.0233       0.0112
             Scalar         Asymptotic optimal          0.0338       0.0174       0.00934
                            PI                          0.0401       0.0180       0.00928
                            ROT                         0.0734       0.0244       0.0113

           Table 2. Model 2: mean of the ISE for various bandwidth selectors

             Bandwidth            Kind                Means for the following sample sizes:
                                                         100           250          500

             Diagonal       Asymptotic optimal          0.0505       0.0246        0.0140
                            PI                          0.0523       0.0248        0.0139
                            ROT                         0.0752       0.0301        0.0168
             Scalar         Asymptotic optimal          0.0505       0.0246        0.0140
                            PI                          0.0503       0.0240        0.0136
                            ROT                         0.0824       0.0314        0.0175

           Table 3. Model 3: mean of the ISE for various bandwidth selectors

             Bandwidth            Kind                Means for the following sample sizes:
                                                         100           250          500

             Diagonal       Asymptotic optimal          0.0704       0.0327        0.0186
                            PI                          0.0711       0.0334        0.0185
                            ROT                         0.114        0.0469        0.0242
             Scalar         Asymptotic optimal          0.0760       0.0361        0.0208
                            PI                          0.0747       0.0354        0.0202
                            ROT                         0.137        0.0560        0.0272

           Table 4. Model 4: mean of the ISE for various bandwidth selectors

             Bandwidth            Kind                Means for the following sample sizes:
                                                         100           250          500

             Diagonal       Asymptotic optimal          0.0765       0.0344        0.0198
                            PI                          0.0873       0.0378        0.0219
                            ROT                         0.230        0.0525        0.0268
             Scalar         Asymptotic optimal          0.147        0.0604        0.0349
                            PI                          0.105        0.0536        0.0337
                            ROT                         0.282        0.0823        0.0431

given by equation (3.1) and hROT given by equation (A.2), the PI bandwidths hPI given by
equation (3.2) and hPI given by equation (A.3) for the local linear estimation of f …x†. These
results are compared with those obtained by using the asymptotic optimal bandwidths hopt as
in equation (2.3) and hopt as in equation (A.1), which we can compute since f…x†, g…x† and p…x†
are explicitly given. For each pseudodata set generated, we compute the ISE
                                                           Multivariate Bandwidth Selection     803
           Table 5. Model 5: mean of the ISE for various bandwidth selectors

               Bandwidth           Kind               Means for the following sample sizes:
                                                          100            250         500

               Diagonal     Asymptotic optimal          0.0662          0.0324      0.0193
                            PI                          0.142           0.0573      0.0297
                            ROT                         0.306           0.100       0.0466
               Scalar       Asymptotic optimal          0.150           0.0716      0.0425
                            PI                          0.145           0.0692      0.0410
                            ROT                         0.338           0.116       0.0606

           Table 6. Model 6: mean of the ISE for various bandwidth selectors

               Bandwidth           Kind                 Means for the following sample sizes:
                                                                250                500

               Diagonal    Asymptotic optimal                   0.459             0.291
                           PI                                   0.363             0.225
                           ROT                                  0.641             0.502
               Scalar      Asymptotic optimal                   0.791             0.502
                           PI                                   0.447             0.308
                           ROT                                  0.793             0.747

                                      1P s
                                           f f” …U † À f…Ui †g2
                                      n iˆ1 h i
for the diagonal bandwidths h ˆ hopt , hPI , hROT and scalar bandwidths h ˆ hopt , hPI , hROT ,
where the Ui are taken from an equally spaced grid of roughly 2500 points. In total we consider
six models of varying complexity and in all cases 100 replications are conducted. For all the
models the random design matrix X is an independent sample from the uniform distribution
on ‰0, 1Šd with sample size n, and for the estimation of C… f † we screen o€ observations Xi that
are outside ‰a, 1 À aŠd where a ˆ …1 À 0:91=d †=2. The sample size n is 100, 250 or 500 except
for model 6 where n ˆ 250 and n ˆ 500. The models are
                             Y ˆ f…X† ‡ ,             $ N…0, 0:25†,
where the regression function is

                                          f…x† ˆ x2 ‡ x4
                                                  1    2

for model 1,
                                     f …x† ˆ sin f…x1 ‡ x2 †g
for model 2,
                                    f …x† ˆ sin…x1 † sin…2x2 †
for model 3,
                                f…x† ˆ fsin…x1 † ‡ sin…4x2 †g=2
for model 4,
804       L. Yang and R. Tschernig

Fig. 2. Density of the ISE for (a) model 1, (b) model 2, (c) model 3, (d) model 4, (e) model 5 and (f) model 6
(Ð, hopt ; - - - - -, hPI ; ± ± ±, hROT ; n ˆ 500)

                                  f…x† ˆ f…x1 À 0:5†2 ‡ x2 g sin…2x3 †

for model 5 and
                               f…x† ˆ sin f…x1 ‡ 0:5x2 ‡ 2x3 ‡ 0:5x4 †g
for model 6.
  Table 1 reports in its ®rst three rows the mean of the ISE based on hopt , hPI , hROT for each
sample size of model 1. The last three rows present the corresponding results for the scalar
bandwidth hopt , hPI and hROT . Similarly, Tables 2±6 display the results for models 2±6. In
addition, Fig. 2 presents plots of the densities of the ISE based on hopt (full curve), hPI (short
broken curve) and hROT (long broken curve) for all models, where the sample size is 500.
Inspecting Tables 1±6 and Fig. 2, we ®nd that the PI bandwidth is always superior to the
ROT bandwidth as it delivers function estimates that have smaller ISEs. Furthermore, the
mean of the ISE declines with sample size as expected.
  If we use C… f † to measure complexity, the two-dimensional models 1±4 are of increasing
complexity with C… f † ˆ 48:8, 194:8, 608:8, 3129:3 respectively. Since the ROT bandwidth is
based on piecewise quartic ®ts, we might expect the ROT function estimates to perform worse
                                                                     Multivariate Bandwidth Selection              805

Fig. 3. Function f (.) of model 3: (a) true function; (b) estimate with hPI , n ˆ 100; (c) estimate with hROT , n ˆ 100;
(d) estimate with hopt , n ˆ 500; (e) estimate with hPI , n ˆ 500; (f) estimate with hROT , n ˆ 500
806        L. Yang and R. Tschernig

Fig. 4. Densities of bandwidth ratios for model 3 (- - - - - , PI; ± ± ±, ROT): (a) ®rst elements of hPI =hopt and
hROT =hopt , n ˆ 100; (b) second elements of hPI =hopt and hROT =hopt , n ˆ 100; (c) h PI =h opt and h ROT =h opt , n ˆ 100;
(d) ®rst elements of hPI =hopt and hROT =hopt , n ˆ 500; (e) second elements of hPI =hopt and hROT =hopt , n ˆ 500; (f)
h PI =h opt and h ROT =h opt , n ˆ 500

than the PI estimates for complex models that are far from being piecewise quartic. This is
con®rmed by Tables 1±4 and the ISE densities in Figs 2(a)±2(d). Fig. 3 shows the true
function of model 3 and its estimates for one replication based on hPI and hROT , n ˆ 100, or
hopt , hPI and hROT , n ˆ 500.
  For higher dimension, the advantage of PI over ROT estimates becomes more signi®cant.
For the three-dimensional model 5 and four-dimensional model 6, Tables 5 and 6 show a
reduction in MISE up to 60%. The ISE densities are plotted in Figs 2(e) and 2(f ).
  For models 1 and 2, there is practically no di€erence between the diagonal and scalar
bandwidth, whereas for model 3 the diagonal has only a slight advantage. This is due to the
form of f, which does not require di€erent amounts of smoothing in each direction. For
models 4±6, the improvement in diagonal bandwidth becomes signi®cant as the f-function
requires very di€erent amounts of smoothing for various directions. Overall, we clearly prefer
the diagonal to the scalar bandwidth.
  Although the function estimate is the ultimate goal, it is also informative to analyse how well
the PI and ROT bandwidths approximate the asymptotically optimal bandwidth hopt . When
comparing the densities of ratios, we ®nd that hPI =hopt is always to the right of hROT =hopt . The
                                                           Multivariate Bandwidth Selection         807

ratio for the PI bandwidth is always closer to 1 except for a few cases where the ROT ratio is
also close to 1. The PI ratio hPI =hopt also approaches 1 faster than the ROT ratio hROT =hopt as
the sample size n increases. All these conclusions can be drawn from inspecting the density
plots. Fig. 4 presents the density plots for the complicated model 3 based on 100 (Figs 4(a)±
4(c)) and 500 (Figs 4(d)±4(f )) observations. The densities of the ®rst and second elements of
hPI =hopt and hROT =hopt are depicted in Figs 4(a), 4(b), 4(d) and 4(e). Figs 4(c) and 4(f ) show
the densities of hPI =hopt and hROT =hopt . Increasing complexity or dimension makes the band-
width estimation more dicult.

7.   Conclusion
We have shown the existence of both a diagonal and a scalar optimal bandwidth for
multivariate regression and have proposed both an ROT and a PI bandwidth selector. The
ROT method is simple to use but may perform poorly because of its inconsistency. The PI
method is achieved by using partial local cubic regression to estimate the unknown func-
tionals of second derivatives. The partial local cubic expansion does not need many terms and
has simple bias and variance formulae. The pilot bandwidths for the functional estimation
are determined by blocked partial quartic ®ts. In Monte Carlo simulations we investigated
the performance of these automatic bandwidth selectors for models up to four dimensions
and various sample sizes. The diagonal PI method is found to be the best in terms of the ISE,
although the ROT method can be useful if the model is not too complicated. Furthermore, a
bandwidth vector is always preferable to a scalar bandwidth. These conclusions are observed
with a sample size as small as 100.
   The results of this paper can be generalized to autoregressive time series under conditions
that lead to geometric mixing; see, for instance, Hardle et al. (1998) for such conditions. The
scalar plug-in bandwidth was used to obtain a data-driven asymptotic ®nal prediction error
for the non-linear time series lag selection developed by Tschernig and Yang (1999).

The authors thank Rong Chen and Michael Neumann for their many helpful discussions. The
very insightful comments from the referees and the Joint Editor are also gratefully acknow-
ledged. These comments stimulated a substantial extension of the method. This work was
supported by Sonderforschungsbereich 373 `Quanti®kation und Simulation Okonomischer
Prozesse' of the Deutsche Forschungsgemeinschaft, at Humboldt-Universitat zu Berlin.

Appendix A
We demonstrate that a scalar bandwidth h can be used instead of a bandwidth vector h ˆ …h1 , . . ., hd †
for estimating f…x†. Estimation based on a single bandwidth h achieves the same convergence rate as
multiple bandwidths but may be less ecient when in each variable direction a di€erent amount of
smoothing is appropriate. However, a single bandwidth is more easily computed.
  When using a single bandwidth h to estimate the function f …x†, the AMISE is the following function
of h:

                                            4K                        1
                              AMISE…h† ˆ        C… f †h4 ‡ kK k2d B…g† d
                                             4                        nh
where C… f † ˆ Æd
                ,ˆ1 C … f †. The scalar bandwidth h that minimizes AMISE…h† is
808        L. Yang and R. Tschernig
                                                           d kKk2d B…g†
                                             hopt ˆ                                    :                                       …A:1†
                                                           4n4 C… f †

Denote C… f, h† ˆ Æd     ”
                   ,ˆ1 C … f, h†; then we have the following theorem.

  Theorem 5. Under assumptions (1)±(4) in Appendix B, as h 3 0 and nhd‡4 3 I, we have
                                                           B…g†           V …K†                                         
                                      d                           ,ˆ1                                             1
           C… f, h† ˆ C… f † ‡ 2h2           B … f † ‡                            ‡  ‡ Op …h4 † ‡ op            p
                                     ,ˆ1                          nhd‡4                                         n hd‡8

in which
                                               p       D
                                              n hd‡8  3 N f0, 2 …g, K †g
                                             32       g4 …x† w2 …x† dx …  P
                             …g, K † ˆ                                                F …t†        dt:
                                                  16 … À 1†4
                                                   K                           ,ˆ1

The optimal bandwidth to estimate C… f † by C … f, h† is
                          88                       91=…d‡6†
                          >          Pd
                          > > B…g†
                          >>              V …K†> >
                          >                        =                                              P
                          > À       ,ˆ1
                          >>                                                               if            B … f † < 0,
                          > > 2 P B … f †n >
                                                   >                                             ,ˆ1
                          >:                       >
                          <      ,ˆ1

             hC… f †,opt ˆ 8                              91=…d‡6†
                          >>                Pd
                          >>                              >
                          ><         B…g†        V …K†> >
                          >                ,ˆ1                                                  P
                          > …d ‡ 4†
                          >>                                                               if            B … f † > 0:
                          >>             Pd               >
                          >           4        B … f †n >

  By using the formulae in theorem 5, we can obtain both the ROT and the PI bandwidth
                                      8                        91=…d‡4†
                                      >                        >
                                      < d kK k2d B …g† >       =
                                                2  ROT
                               hROT ˆ                                   ,                                                      …A:2†
                                      >4n4 P C
                                              d                >
                                      :    K       ROT, … f †>

                                                           d kK k2d B…g†
                                              hPI ˆ                                                                            …A:3†
                                                               K ”
                                                            4n4 C… f †
in which BROT …g† and CROT, … f † are the same as used in equation (3.1). It is easy to verify that
hPI ˆ hopt f1 ‡ Op …nÀ2=…d‡6† †g.

Appendix B
We denote by S the support of w…x†, and all integrals in variable x are over S. The following assump-
tions are needed to prove theorems 1, 3 and 4.
  (a) The density p…x† of X exists, is continuously di€erentiable up to order 2 and is bounded below
      away from 0 on S (assumption 1).
  (b) The function f…x† is continuously di€erentiable on S up to order 4 whereas g…x† is continuous on
      S and positive on an open subset of S (assumption 2).
  (c) The bandwidth h ˆ hn is a positive number depending on n such that h 3 0 and nhd‡4 3 I
      (assumption 3).
                                                                      Multivariate Bandwidth Selection              809
  Denote by M‡ …d† the set of all non-negative de®nite matrices M that are positive de®nite for non-
negative vectors, i.e.
               M‡ …d † ˆ M: M non-negative definite and inf …h T Mh† ˆ c…M† > 0                 …B:1†
                                                                       hPS dÀ1

where S dÀ1 ˆ S dÀ1 ’ Rd denotes the portion of the …d À 1†-dimensional unit sphere S dÀ1 with non-
negative co-ordinates. For the diagonal bandwidth to exist, we assume that the matrix C… f † P M‡ …d †
(assumption 4).
  Assumption 4 is not trivial. Take the function f…x1 , x2 † ˆ exp…x1 † sin…x2 †; together with p…x† equal to
the uniform density on ‰0, 1Š2 and w…x† ˆ p…x†, we have
                                          2            1  À2      1                   1    À2
           C… f † ˆ          exp…2x1 † sin …x2 † dx             ˆ fexp…2† À 1g
                      ‰0,1Š2                          À2 4       4                  À2 4

and therefore …2 , 1† C… f † …2 , 1† T ˆ 0 even though …2 , 1† T P R2 .
  For the proof of theorem 1, we also need lemmas 1 and 2 and theorem 6.
  Lemma 1.
                                          0        1
                                            M1 v C                0 À1 1
                                      B                              v1
                                      B ˆ1        C
                        @Q…v, M, a† 1 B
                                      B     F
                                                   CÀ p a
                                                                    B F C
                                                                    B F C,
                                   ˆ B      F
                                            F      C 2 …v . . . v † @ F A                                          …B:2†
                            @v      2B             C     1       d
                                      @ Pd         A                 vÀ1
                                            Md v
                                                         0                                                    1
                                                                1           vÀ1 vÀ1
                                                                             1   2         FFF      vÀ1 vÀ1
                                                                                                     1   d
                                                  B                                                           C
                                                  B À1 À1                                                     C
                   2                              B v2 v1                     3vÀ2         FFF       À1 À1
                                                                                                    v2 vd C
                 @ Q…v, M, a† 1       a           B                             2
                             ˆ M‡ p               B                                                           C:   …B:3†
                    @v@v T    2  4 …v1 . . . vd † B F                            F         FF          F      C
                                                  B F                            F                     F      C
                                                  @ F                            F              F      F      A
                                                               À1 À1          À1 À1
                                                              vd v1         vd v2          FFF       3vÀ2

  Proof. Direct matrix calculation plus de®nition (2.1) yield the results.
  Lemma 2. For any ®xed M P M‡ …d † and a > 0, Q…v, M, a† is a strictly convex function of v P Rd and
therefore can be minimized by at most one vector v in Rd .

  Proof. Both terms in equation (B.3) are non-negative de®nite, whereas the second term is positive
de®nite, so the Hessian of Q…v, M, a† with respect to v is positive de®nite.
  Theorem 6. For any ®xed M P M‡ …d † and a > 0, Q…v, M, a† has a unique minimum vector vopt ˆ
v…M, a†, which is in®nitely continuously di€erentiable in both M and a.
  Proof. M P M‡ …d † implies that
                                      1 P d          1    P 2 1
                       Q…v, M, a† 5          M v v 5 c…M†     v ˆ c…M†kvk2
                                      4 ,ˆ1    4    ˆ1     4

for all v P Rd and Euclidean norm kvk ˆ …Æd v2 †1=2 , so
             ‡                            ˆ1 

                               lim fQ…v, M, a†g 5        lim fc…M †kvk2 g ˆ I
                             kvk3I                    4 kvk3I
while it is also clear that we always have
                                                    1                     a
                            lim fQ…v, M, a†g 5         lim        p                        ˆ I:
                           kvk30                    4 kvk30           …v1 . . . vd †
810               L. Yang and R. Tschernig
Thus the in®mum of Q…v, M, a† is reached at a v whose norm is bounded away from I and 0, and by
the previous lemma this v is unique. Equation (B.2) and the implicit function theorem show that v…M, a†
is an in®nitely continuously di€erentiable function of M and a.
   To prove theorem 3, we need the dispersion matrix of the partial local cubic estimator.
   Lemma 3. Let Z be as in equation (4.1) and
                                            H ˆ diag…1, hÀ2 I2dÀ1 , hÀ1 Id , hÀ3 I2dÀ2 , hÀ3 †:
As n 3 I,
              T                                1              2 11Âd
          H       Z T WZ H
                             ˆ diag                           K                  4
                                                                               , K IdÀ1 , D f p…x†I5dÀ1 ‡ op …1†g          …B:4†
                                            2 1dÂ1
                                             K         4 f… À 1†Id ‡ 1dÂd g

                                    … À 1 ‡ d †… À 1†À1            ÀÀ2 … À 1†À1 11Âd                       I5dÀ1
…H   T
         Z T WZ H †À1
                          ˆ diag                                      K                      À4
                                                                                           , K IdÀ1 , D À1
                                                                                                                     ‡ op …1†
                                     ÀÀ2 … À 1†À1 1dÂ1
                                        K                             À4 … À 1†À1 Id
                                                                       K                                       p…x†
uniformly in a compact neighbourhood of x. Here
                   0 2                                                                                         1
                       K Id      A                     B                                              C
                   B T                                                                                         C
                   B A        6 IdÀ1
                                 K                 0…dÀ1†Â…dÀ1†                                     0…dÀ1†Â1   C
                DˆB T
                   B B
                   @         0…dÀ1†Â…dÀ1† 6 f… À 1†IdÀ1 ‡ 1…dÀ1†Â…dÀ1† g
                                           K                                                      6 1…dÀ1†Â1 C
                                                                                                    K          A
                       CT      01…dÀ1†            6 11…dÀ1†
                                                      K                                             6 …K†

in which
                     0                               1                   0             1                       0        1
                            IÀ1        0…À1†Â…dÀ†                      0…À1†Â…dÀ1†                         0…À1†Â1
                 B                        01…dÀ† C,                   B              C                     B          C
          A ˆ 4 @ 01…À1†
               K                                     A           B ˆ 4 @ 11…dÀ1† A,
                                                                      K                              C ˆ 4 @ 1 A:
                   0…dÀ†Â…À1†             IdÀ                          0…dÀ†Â…dÀ1†                         0…dÀ†Â1

  Proof. Equation (B.4) follows by the usual array-type central limit theorem, K being a symmetric
compact product kernel; see, for example, Wand and Jones (1995). Equation (B.5) follows by direct
veri®cation. The exact inverse of D can be obtained by using the tools in Lutkepohl (1996).
  Proof of theorem 3. Given any  ˆ 1, . . ., d and denoting Z by Z, we ®rst note that

                         e T H…H T Z T WZH †À1 H T Z T WZe0 f …x† ˆ e T …Z T WZ †À1 Z T WZe0 f…x† ˆ 0,

                                       e T H…H T Z T WZH †À1 H T Z T WZe f …x† ˆ f …x†,

whereas for any H ˆ 1, . . ., 5d À 1, H Tˆ ,
                              e T H…H T Z T WZH †À1 H T Z T WZeH ˆ e T …Z T WZ †À1 Z T WZeH ˆ 0:

Combining these with equation (B.5) of lemma 3, we have
                                   f” …x† À f …x† ˆ 2e T H…H T Z T WZH †À1 H T Z T WY À f …x†

                                                      ˆ T1 ‡ T2 ‡ T3

                       2 ‡ op …1†    P
                                     n                        …Xi À x †2
         T1 ˆ      4
                                        Kh …Xi À x†                        À 2
                                                                              K        f…Xi † À f…x† À …Xi À x† T r f…x†
                  K … À 1† p…x†nh iˆ1
                                   2                               h2
                    1                                 1                  P 3
                  À …Xi À x† T r2 f…x† …Xi À x† À f …x†…Xi À x †3 À         f …x† …Xi À x †2 …Xi À x †
                    2                                 3!                 Tˆ 3!
                    P 3
                  À         f …x† …Xi À x †…Xi À x †2 ,
                    Tˆ 3!
                                                                                       Multivariate Bandwidth Selection                811
            2 ‡ op …1†    Pn                                    …Xi À x †2                      P
 T2 ˆ À 4                     Kh …Xi À x†                                    À 2
                                                                                K                         …Xi À x †…Xi À x † f …x† ,
       K … À 1† p…x†nh2 iˆ1                                        h2                     <,TˆTˆ

                                2 ‡ op …1†    Pn                …Xi À x †2    2
                       T3 ˆ 4                     Kh …Xi À x†                À K g…Xi † i :
                           K … À 1† p…x†nh2 iˆ1                    h2

     The asymptotic variance of T3 is calculated as
                                                        …                                          2
                              4 ‡ o…1†                                        …u À x †2
                                                            Kh …u À x†2                   À 2
                                                                                             K           g2 …u† p…u† du
                        K … À 1†2 p…x†2 nh4
                         8                                                        h2

which (using u ˆ x ‡ hv)
                          4 ‡ o…1†
               ˆ                                        K…v†2 …v2 À 2 †2 g2 …x ‡ hv† p…x ‡ hv† dv
                   8 … À 1†2 p…x†2 nhd‡4

                   4 f4 …K 2 † À 2 2 …K 2 †2 ‡ kK k2 4 g kK k2
                                              K       2 K                              g2 …x†
               ˆ                                                                                f1 ‡ o…1†g ˆ 2 …x† ‡ o…h2 †
                                     8 … À 1†2 p…x†nhd‡4

with 2 …x† as given in equation (4.4).
  A similar procedure applied to T1 yields
                2 ‡ op …1†                        …u À x †2
     T1 ˆ    4
                                      Kh …u À x†              À 2
                                                                 K   f…u† À f…x† À …u À x† T r f …x†
            K … À 1† p…x†h2                         h2
             1                                  1                             P 3
          À …u À x† T r2 f…x†…u À x† À f …x†…u À x †3 À                           f …x†…u À x †…u À x †2
             2                                  3!                            Tˆ 3!
              P 3
          À           f …x†…u À x †2 …u À x † p…u† du
             Tˆ 3!
              2 ‡ op …1†                                                                        h2              h3
        ˆ 4                         K…v†…v2 À 2 † f…x ‡ hv† À f…x† À hv T rf …x† À v T r2 f …x†v À f …x† v3
          K … À 1† p…x†h2                                                                     2               3!
              P h3                      P h3
          À            f …x† v v2 À
                                                f …x† v2 v p…x ‡ hv† dv
             Tˆ 2                     Tˆ 2
          h2 K…v†…v2 À 2 † v2 v2 dv 
                               K               P                                                 P f …x† p…x†
        ˆ                                                  f f …x† p …x† ‡ f …x† p …x†g ‡
                 4 … À 1† p…x†
                   K                         <,TˆTˆ                                            <       2
             h2 K…v†…v2 À 2 † v4 dv  P                                                        
 …x† p
 …x† P f
 …x† p…x†
          ‡          4
                                                                       ‡                           ‡ o…h2 †:
                   K … À 1† p…x†             
Tˆ           3           
ˆ1         12

We note then that
                                                                     0                           a < ,  Tˆ  Tˆ ,
                           K…v†…v2     À   2 † v2 v2   dv ˆ
                                                                     … À 1†6
                                                                             K                    Tˆ  ˆ ,
 Tˆ ,
                                  K…v†…v2 À 2 † v4 dv ˆ

                                                                         6 …K † À 6
 ˆ ,

812        L. Yang and R. Tschernig

                          h2 … À 1† 6 P f …x† p…x† h2 f6 …K † À 6 g f …x† p…x†
                                      K                                  K
                T1 ˆ                                    ‡ 4                                 ‡ o…h2 †
                         4 … À 1† p…x† Tˆ
                          K                      2        K … À 1† p…x†          12


                                                     T1 ˆ b …x†h2 ‡ o…h2 †,                                                   …B:8†

where b …x† is as given in equation (4.3).
  The term T2 is treated as
               2 ‡ op …1†       Pn               …Xi À x †2             P        Xi À x Xi À x
     T2 ˆ À                        Kh …Xi À x†                 À 2K                                  f …x†
           4 … À 1† p…x†n iˆ1
            K                                         h2               <,TˆTˆ     h        h
              P        2 f …x†f1 ‡ op …1†g                 …u À x †2      2    u À x u À x
      ˆÀ                                       Kh …u À x†                À K                      p…u† du
          <,TˆTˆ    4 … À 1† p…x†
                           K                                      h2                  h       h
              P            2 f …x†
      ˆÀ                4
                                            K…v†…v2 À 2 † v v p…x ‡ hv† dvf1 ‡ op …1†g
          <,TˆTˆ K … À 1† p…x†
                   P          K…v†…v2 À 2 † v2 v2 dv f …x† p …x†
      ˆ À2h                               4
                                                                      f1 ‡ op …1†g ˆ o…h2 †                   …B:9†
               <,TˆTˆ              K … À 1† p…x†
since K…v†…v2 À 2 † v2 v2 dv ˆ 0 for  < ,  Tˆ  Tˆ , by equation (B.7). Now equations (B.8), (B.9)
and (B.6) together prove theorem 3.

     Proof of theorem 4. Since
                           f” …x† ˆ f …x† ‡ fb …x†h2 ‡ T3, g 1 ‡ op h2 ‡ p d‡4
                                                                                  …nh †

                                           2         Pn                …Xi À x †2
                        T3, ˆ                          Kh …Xi À x†                À 2 g…Xi †i
                                  4 … À 1† p…x†nh2 iˆ1
                                   K                                        h2

we have
C … f, h† ˆ f” …x† f” …x† w…x† p…x† dx ‡ Op …nÀ1=2 †
          ˆ C … f † ‡ h2 fB … f † ‡ B … f †g ‡ T3, T3, w…x† p…x† dx ‡ op h4 ‡ p d‡4 :  …B:10†
                                                                                         …nh †
Thus it remains to compute the term T3, T3, p…x† dx, which has a simple decomposition as
                                       P                                   P
                                                2 H …Xi , Xj †i j ‡           H …Xi , Xi †2
                                     14i< j4n                             14i4n

                        4 Kh …Xi À x† Kh …Xj À x† w…x†          …Xi À x †2                …Xj À x †2
 H …Xi , Xj † ˆ                                                            À 2
                                                                                K                        À 2
                                                                                                            K       g…Xi † g…Xj † dx:
                            8 … À 1†2 p…x†n2 h4
                             K                                       h2                          h2

Note ®rst that
                                                                            Multivariate Bandwidth Selection                   813
                             …                                                                    2                   
                              4 K 2 …X1 À x† w…x† dx …X1 À x †2
                                   h                                             …X1 À x †
 E fH …X1 , X1 †2 g ˆ E
                   1                                                   À 2K                  À 2 g2 …X1 †
                               8 … À 1†2 p…x†n2 h4
                                 K                             h2                     h2
                        … …                                                                 
                             4 K 2 …y À x† w…x† dx … y À x †2
                                 h                                           …y À x †2
                      ˆ                                            À 2K                  À 2 g2 …y† p…y† dy
                             8 … À 1†2 p…x†n2 h4
                               K                           h2                     h2
                            … …
                     yÀxˆhu              4 w…x†
                       ˆ                                 K 2 …u†…u2 À 2 † …u2 À 2 † g2 …x ‡ hu† p…x ‡ hu† dx du
                                                                        K        K
                                 8 … À 1†2 p…x†n2 hd‡4
                          …                  …
                        4 g2 …x† w…x† dx K 2 …u† …u2 u2 À 2 u2 À 2 u2 ‡ 4 † duf1 ‡ o…1†g
                                                               K       K      K
                      ˆ                               8         2 2 d‡4
                                                    K … À 1† n h
                          B…g† V …K†
                      ˆ                f1 ‡ o…1†g
                            n2 hd‡4
and similarly that
                                          EfH …X1 , X1 †2 4 g ˆ O
                                                             1                        :
                                                                            n4 h8‡3d
Hence by an array-type central limit theorem
                              P                           B…g† V …K †          1
                                     H …Xi , Xi †2 ˆ
                                                    i                   ‡ Op p 3 8‡3d .                                     …B:11†
                             14i4n                           nhd‡4            …n h   †

  Meanwhile, using a central limit theorem for non-degenerate U-statistics as contained in Hall (1984),
one can verify that
                                              2 H …Xi , Xj †i j
                                                14i< j4n

is asymptotically normal with mean 0 and asymptotic variance

   Ef4H …X1 , X2 †2 2 2 g ˆ 2n2 E fH …X1 , X2 †2 g
                       1 2
           …                                                                                     2
                4 Kh …X1 À x† Kh …X2 À x† w…x† …X1 À x †2               …X2 À x †2
   ˆ 2n2 E                                                  À 2 g…X1 †
                                                               K                       À 2 g…X2 † dx
                       8 … À 1†2 p…x†n2 h4
                         K                               h2                    h2
which is
      …                                                                                   2
         4 Kh …X1 À x† Kh …X2 À x† w…x† …X1 À x †2         …X2 À x †2
2n2 E                                                À 2
                                                        K                 À 2 g…X1 † g…X2 † dx
               8 … À 1†2 n2h4 p…x†
                K                            h2                   h2
            … …                                                                                                             2
                   4 Kh …y1 À x† Kh …y2 À x† w…x†             … y1 À x †2             … y2 À x †2
    ˆ 2n2                                                                   À 2
                                                                               K                      À 2
                                                                                                         K       g…y1 † g…y2 † dx
                        8 … À 1†2 n2h4 p…x†
                          K                                         h2                        h2

        p…y1 † p…y2 † dy1 dy2
        32 Kh …y1 À x1 † Kh …y2 À x1 † Kh …y1 À x2 † Kh …y2 À x2 † w…x1 † w…x2 † …y1 À x1 †2    2
    ˆ                                                                                          À K
                             16 … À 1†4 n2h8 p…x1 † p…x2 †
                              K                                                       h2
           …y2 À x1 †2            … y1 À x2 †2           … y2 À x2 †2
      Â                  À 2K                     À 2K                    À 2
                h2                        h2                       h2

        g2 …y1 † g2 …y2 † p… y1 † p…y2 † dx1 dx2 dy1 dy2 :
814         L. Yang and R. Tschernig
This, after doing a change of variables y1 À x1 ˆ hu1 and y2 À x2 ˆ hu2 , becomes
                        … …
             32             K…u1 † Kh …x2 À x1 ‡ hu2 † Kh …x1 À x2 ‡ hu1 † K…u2 † w…x1 † w…x2 † 2
                                                                                               …u1 À 2 †
      16 … À 1†4 n2h8
       K                                               p…x1 † p…x2 †
              …x2 À x1 ‡ hu2 †2                …x1 À x2 ‡ hu1 †2
      Â                            À 2
                                      K                                À 2 …u2 À 2 † g2 …x1 ‡ hu1 † g2 …x2 ‡ hu2 †
                                                                          K   2   K
                      h2                                  h2

       p…x1 ‡ hu1 † p…x2 ‡ hu2 † dx1 dx2 du1 du2 :

With another change of variables x2 À x1 ˆ ht1 , this becomes
                           … …
                32             K…u1 † K…t1 ‡ u2 † K…Àt1 ‡ u1 † K…u2 † w…x1 † w…x1 ‡ ht1 † 2
                   4 2 d‡8
                                                                                         …u1 À 2 †
         K … À 1† n h                            p…x1 † p…x1 ‡ ht1 †

               f…t1 ‡ u2 †2 À 2 gf…Àt1 ‡ u1 †2 À 2 g…u2 À 2 † g2 …x1 ‡ hu1 † g2 …x1 ‡ ht1 ‡ hu2 †
                                  K                     K    2   K

            p…x1 ‡ hu1 † p…x1 ‡ ht1 ‡ hu2 † dx1 dt1 du1 du2
          32 g4 …x1 † w2 …x1 † dx1 …
        ˆ                            K…u1 †…u2 À 2 † K…Àt1 ‡ u1 † f…Àt1 ‡ u1 †2 À 2 g K…u2 †
                                             1    K                                  K
            16 … À 1†4 n2 hd‡8

            …u2 À 2 † K…t1 ‡ u2 † f…t1 ‡ u2 †2 À 2 g dt1 du1 du2 f1 ‡ o…1†g
                 2    K                                 K
          32 g4 …x1 † w2 …x1 † dx1 …         …                                                          
        ˆ                                dt1     K…u1 † K…u1 À t1 † …u2 À 2 † f…u1 À t1 †2 À 2 g du1
                                                                      1   K                     K
             16 … À 1†4 n2hd‡8
                   K…u2 † K…u2 ‡ t1 † …u2 À 2 † f…u2 ‡ t2 †2 À 2 g du2 f1 ‡ o…1†g
                                         2     K                     K

            32       g4 …x† w2 …x† dx …
        ˆ                                 F …t† F …t† dt f1 ‡ o…1†g
              16 … À 1†4 n2hd‡8

where F …t† is as given in theorem 4. Hence
                                      ˆ                        2 H …Xi , Xj †i j
                                                       14i< j4n

has mean 0 and asymptotic variance
                              32 g4 …x† w2 …x† dx …
                                                                        F …t† F …t† dt:
                                             16 … À 1†4 n2hd‡8

This, plus equations (B.10) and (B.11), completes the proof of theorem 4.

Cheng, M. Y. (1997) A bandwidth selector for local linear density estimators. Ann. Statist., 25, 1001±1013.
Fan, J. and Gijbels, I. (1995) Adaptive order polynomial ®tting: bandwidth robusti®cation and bias reduction. J.
  Comput. Graph. Statist., 4, 213±227.
Ð (1996) Local Polynomial Modelling and Its Applications. London: Chapman and Hall.
Grund, B. and Polzehl, J. (1998) Bias corrected bootstrap bandwidth selection. J. Nonparam. Statist., 8, 97±126.
Hall, P. (1984) Central limit theorem for integrated square error of multivariate nonparametric density estimators. J.
  Multiv. Anal., 14, 1±16.
                                                                     Multivariate Bandwidth Selection              815
Hardle, W. and Marron, J. S. (1995) Fast and simple smoothing parameter selection. Comput. Statist. Data Anal., 20,
Hardle, W., Tsybakov, A. B. and Yang, L. (1998) Nonparametric vector autoregression. J. Statist. Planng Inf., 68,
Herrmann, E., Wand, M. P., Engel, J. and Gasser, Th. (1995) A bandwidth selector for bivariate kernel regression. J.
        R. Statist. Soc. B, 57, 171±180.
Jones, M. C., Marron, J. S. and Sheather, S. J. (1996a) A brief survey of bandwidth selection for density estimation.
        J. Am. Statist. Ass., 91, 401±407.
Ð (1996b) Progress in data-based bandwidth selection for kernel density estimation. Comput. Statist., 11, 337±381.
Lutkepohl, H. (1996) Handbook of Matrices. Chichester: Wiley.
Ruppert, D. (1997) Empirical-bias bandwidths for local polynomial nonparametric regression and density estim-
        ation. J. Am. Statist. Ass., 92, 1049±1062.
Ruppert, D., Sheather, S. J. and Wand, M. P. (1995) An e€ective bandwidth selector for local least squares
        regression. J. Am. Statist. Ass., 90, 1257±1270.
Ruppert, D. and Wand, M. P. (1994) Multivariate locally weighted least squares regression. Ann. Statist., 22, 1346±1370.
Sain, S. R., Baggerly, K. A. and Scott, D. W. (1994) Cross-validation of multivariate densities. J. Am. Statist. Ass.,
        89, 807±817.
Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. London: Chapman and Hall.
Tschernig, R. and Yang, L. (1999) Nonparametric lag selection for time series. J. Time Ser. Anal., to be published.
Wand, M. P. and Jones, M. C. (1993) Comparison of smoothing parametrizations in bivariate kernel density
        estimation. J. Am. Statist. Ass., 88, 520±528.
Ð (1994) Multivariate plug-in bandwidth selection. Comput. Statist., 9, 97±117.
Ð (1995) Kernel Smoothing. London: Chapman and Hall.

Shared By: