Discrete local scoring rules Question Outline 1. Discrete scoring

Document Sample
scope of work template
							                                                                             Question


                   Discrete local scoring rules

                                                                                   Can we extend the idea of locality to discrete outcome spaces?
                               Matthew Parry
                                                                                                             Yes.
          Joint work with Philip Dawid and Steffen Lauritzen
                                                                                          And the normalization remains irrelevant!
                  Statistical Laboratory, University of Cambridge
                  Department of Statistics, University of Oxford




                                                                    1 / 40                                                                          2 / 40


Outline
   1. Discrete scoring rules
      Example: Log score and Bregman score
      Applications: University exams and non-parametric tests
   2. Some geometrical considerations
       Metrics
       Open questions                                                                           1. Discrete scoring rules
   3. Discrete local scoring rules
      Nearest neighbour scoring rules
      General pairwise scoring rules
      The metric tensor
      Examples: Bernoulli trials and Poisson distribution
   4. Applications and current work
      Ising model
      Pseudolikelihood
      Sequential prediction

                                                                    3 / 40                                                                          4 / 40
Discrete scoring rules                                                            Two key requirements

   Let X be a random variable taking values in a discrete outcome
   space X .                                                                         The requirements for a discrete scoring rule are the same for any
                                                                                     scoring rule:
       A scoring rule Sx (Q) is a loss function associated with
       choosing the distribution Q to represent our uncertainty about                    S(P, Q) is affine in P. (True by construction here.)
       X.                                                                                S(P, Q) is proper: for all P, it is minimised in Q at Q = P. A
       If P is the true distribution, the expected score is                              scoring rule is said to be strictly proper if

                    S(P, Q) = EP [Sx (Q)] =         px Sx (Q),                                                 S(P, Q) > S(P, P),
                                               x
                                                                                         for Q = P.
       where px is the probability that event x is observed.                             In other words, it does not pay to be dishonest. If you know
       Note that only the outcome space is discrete; the qx will                         P, you ought to quote it for Q.
       typically depend on continuous parameters.



                                                                         5 / 40                                                                           6 / 40


Inference                                                                         The logarithmic scoring rule


                                                                                     The simplest and best example of a discrete scoring rule is the
   In the case of a one parameter model, i.e. Q = Qθ , the estimator                 logarithmic scoring rule or log score:
   is simply
                           θ = arg min Sx (Qθ ).                                                              Sx (Q) = − ln qx .
                                     θ

   It is a nice property of scoring rules that the estimating equation
                                                                                         It is the only scoring rule that depends on Q solely at the
                                     ∂Sx (Qθ )
                         e(x, θ) ≡             =0                                        observed outcome x.
                                       ∂θ                                                                         (1) (2)
                                                                                         It is extensive: if qx = qx1 qx2 then
   unbiased.                                                                             Sx (Q) = Sx1 (Q (1) ) + Sx2 (Q (2) ).
                                                                                         Leads to maximum likelihood estimation.




                                                                         7 / 40                                                                           8 / 40
Separable Bregman scoring rules                                                      Example of the log score in testing



                                                                                        The student is given a statement and asked to assign a degree of
   In fact all separable Bregman scoring rules are also discrete scoring                confidence γ ∈ [0, 1] in the statement being true. In other words
   rules:                                                                               X = {T , F } = {1, 0} and
                  Sx (Q) = −ψx (qx ) −    ψy − qy ψy (qy ),
                                          y                                                                           1 − γ,       x =0
                                                                                                              qx =                      .
   where ψx (·) is convex for all x.                                                                                  γ,           x =1
        Not extensive.
                                                                                        The student’s scoring rule for the question is + ln qx . The student’s
                                                                                        mark on the exam is the empirical score.




                                                                            9 / 40                                                                               10 / 40


Example of log score in testing (2)                                                  The divergence and non-parametric tests


                                                                                        The divergence or discrepancy is
        This is called the Certainty-based Marking scheme for testing
        knowledge.                                                                                      d(P, Q) = S(P, Q) − S(P, P) ≥ 0.
        In practice, the student assigns “low”, “medium” or “high”
        confidence to what they think is the correct answer. An                              Let px be the observed frequency and qx the expected
        approximate log score is                                                            frequency of x ∈ X = {1, . . . , K } in n independent trials.
                                                                                            Then
        answer\confidence      low      medium    high   no response
        correct                 1            2      3             0                                  λ2 = 2n d(P, Q) = 2n           px (Sx (Q) − Sx (P))
        incorrect               0           -2     -6             0                                                            x

        Levels of confidence correspond to less than 67%, between                            is a non-parametric test statistic for the hypothesis that
        67% and 80%, and greater than 80%.                                                  P = Q.



                                                                           11 / 40                                                                               12 / 40
The divergence and non-parametric tests (2)                                       But not the χ2 -test



   Analysis of the probability distribution for λ2 typically requires
   expanding d(P, Q) in P about P = Q. Asymptotic in n results                           The Pearson chi-squared test statistic is
   follow...
                                                                                                                2            (px − qx )2
        The log score gives the G-test which is very close to the                                              χ =n                      .
        Pearson chi-squared test:                                                                                        x
                                                                                                                                 qx

                                  λ2 ∼ χ2 −1 .                                           Since there are terms in qx not linear in px , it cannot be
                                        K
                                                                                         obtained from a scoring rule.
        The other separable scores are too difficult to analyze!




                                                                        13 / 40                                                                            14 / 40


                                                                                  Some geometrical considerations


                                                                                     For discrete outcome spaces, the distributions P and P + δP are
                                                                                     separated by the infinitesimal squared distance

                             2. Geometry
                                                                                          (ds)2 ≡ ds 2 =2 d(P, P + δP)
                                                                                                       =       ψx (px ) (δpx )2       (Bregman score)
                                                                                                           x
                                                                                                               (δpx )2
                                                                                                       =                                     (Log score)
                                                                                                           x
                                                                                                                 px




                                                                        15 / 40                                                                            16 / 40
Metrics                                                                                  Open questions about metrics



   Typically, probability distributions depend on model parameters
   {θi }. Then δpx (θ) = ∂px (θ) dθi and we obtain the metric tensor on
                            ∂θi
   the manifold of models:                                                                  What is the role of non-Fisherian metrics?
                             ∂px (θ) ∂px (θ)                                                    Generalized Cramér-Rao inequality?
     gij =         ψx (px )                                   (Bregman score)
              x
                              ∂θi     ∂θj                                                       Specialized Chentsov’s theorem?
                       ∂ ln px (θ) ∂ ln px (θ)
          =       px                             (Log score → Fisher metric)
              x
                           ∂θi         ∂θi




                                                                               17 / 40                                                                           18 / 40


Efficiency                                                                                 Sufficiency


                                                                                            The second role of the Fisher metric is in Chentsov’s theorem
   The Fisher metric plays two statistical roles. The first is in the                        [Chentsov (1972)].
   Cramér-Rao inequality. The mean squared error of an unbiased                             Theorem
   estimator θi is bounded below by the inverse metric tensor:                              Up to a constant multiplicative factor, the only metric that is
                            ij                                                               invariant with respect to sufficient reduction of the data on all
                           gF ≤ E      θi − θi     θj − θj   .                                                                          F
                                                                                            statistical manifolds is the Fisher metric gij .

                                                                                                This justifies calling the Fisher metric the information;
       θi is efficient when it saturates this bound.
                                                                                                sufficient statistics retain the necessary information of the
       Less efficient ⇐⇒ different metric bound?                                                   data.
                                                                                                On a particular manifold, can there be more invariant metrics?




                                                                               19 / 40                                                                           20 / 40
                                                                                     Discrete local scoring rules

                                                                                        To start with, we take X to be an ordered, nowhere dense set —
                                                                                        without loss of generality, a subset of the integers. Consider the
                                                                                        discrete version of an m = 2 local scoring rule:

                                                                                                            Sx (Q) = sx (qx−1 , qx , qx+1 ).
                 3. Discrete local scoring rules
                                                                                        This is a nearest neighbour scoring rule.
                                                                                                                          S x(Q)




                                                                                                                 x−1       x        x+1


                                                                                                          Figure: The score at the point x.




                                                                           21 / 40                                                                                 22 / 40


Nearest neighbour rule                                                               Nearest neighbour rule (2)
   We need to ensure

                   S(P, Q) =        px sx (qx−1 , qx , qx+1 )
                                x
                                                                                            The constraint is satisfied if we use a generating function
   is minimised in Q at Q = P for all P.
                                                                                            φx,x+1 (ξ) :
       The necessary form of the scoring rule is
                                                                                            Lx,x+1 (ξ) = −φx,x+1 (ξ)+ξφx,x+1 (ξ),         Lx+1,x (ξ −1 ) = −φx,x+1 (ξ).
                                         qx+1                   qx−1
             sx = −λ ln qx + Lx,x+1               + Lx,x−1             ,
                                          qx                     qx                         The resulting scoring rule is (strictly) proper IFF φx,x+1 (ξ) is
                                                                                            (strictly) convex in ξ for all x.
       with the constraint Lx,x+1 (ξ) = ξ −1 Lx+1,x (ξ −1 ).
       With λ = 0, the scoring rule depends only on probability
       ratios, i.e. it is independent of the normalization.
       There can be no “one-sided” rules; boundary terms are an
       issue.

                                                                           23 / 40                                                                                 24 / 40
Continuum limit                                                                           Scoring rule on an undirected graph
                                                                                             The generalization to undirected graphs is helpful and solves
                                                                                             problems at the boundaries. We look for scoring rules with a
                                                                                             pair-wise form:
   Let                                                                                                                               qy
                         qx+1                      d                                                              Sx (Q) =      Lxy        ,
               ux = ln        = ln qx+1 − ln qx ≈    ln q(x).                                                                        qx
                          qx                      dx                                                                                    y ∈Cx

   And define the functions Fx (·) = φx,x+1 (exp(·)). Then                                    where Cx is the connection set of the point x and we call Lxy (ξ) a
                                                                                             link function.
   Sx (Q) = Fx (ux ) − Fx−1 (ux−1 ) + (1 − e−ux−1 )Fx−1 (ux−1 ) − Fx (ux ) .
                      d · ∂F                      u ∂F             F (x,u(x))
                                                    ∂u
                      dx ∂u                                                                                                              L xy           L yx

         Same general form as the continuous m = 2 local scoring rule.                                                          x                                      y




                                                                                                    Figure: The score at x in terms of the link functions Lxy .
                                                                                25 / 40                                                                                       26 / 40


Scoring rule on an undirected graph (2)                                                   The metric tensor


                                                                                             In terms of the probability ratios ηxy = qy /qx and ζxy = py /px ,
         Emphasis is now on links not nodes.                                                 the divergence is
         Cx → X if we put Lxy = Lyx = 0 when points x and y are not
         connected. In particular, Lxx = 0. This means we can handle                                     1
         missing data.                                                                       d(P, Q) =                  px φxy (ζxy ) − φxy (ηxy ) + (ζxy − ηxy )φxy (ηxy )     .
                                                                                                         2
                                                                                                              x,y ∈X
         Generalizes nearest neighbour results. The allowed link
         functions are given in terms of a generating function φxy (ξ) :                     Infinitesimally separated distributions imply the metric tensor
            Lxy (ξ) = −φxy (ξ) + ξφxy (ξ),       Lyx (ξ −1 ) = −φxy (ξ).                                                   1
                                                                                                                  gij =                  px φxy (ζxy )ζxy ,i ζxy ,j ,
                                                                                                                           2
         The resulting scoring rule is (strictly) proper IFF φxy (ξ) is                                                        x,y ∈X
         (strictly) convex in ξ for all x, y .                                               where ζxy ,i =
                                                                                                               ∂ζxy
                                                                                                                    .
                                                                                                               ∂θi




                                                                                27 / 40                                                                                       28 / 40
Inference for Bernoulli trials                                                                Inference for the Poisson distribution

                                                                                                 We have X = {0, 1, 2, . . .} and observations {fx }. The Poisson
                                                                                                 model is
                                                                                                                                     e −λ λx
                                                                                                                            qx (λ) =         .
   We have X = {0, 1} and observations {f0 , f1 = n − f0 }.                                                                             x!
       The only graph on the outcome space is 0 − 1.                                                 Choose the undirected graph 0 − 1 − 2 − . . .
       Every scoring rule gives the same estimator for the success                                   Choose the generating function
       probability:
                                     f1                                                                                               (x + 1)a m+1
                                θ=      = θML .                                                                        φx,x+1 (ξ) =            ξ   .
                                     n                                                                                                m(m + 1)

                                                                                                     Then the non-zero link functions are
                                                                                                                     (x + 1)a m+1                         (x + 1)a −m
                                                                                                      Lx,x+1 (ξ) =           ξ    ,      Lx+1,x (ξ) = −           ξ .
                                                                                                                      m+1                                    m



                                                                                    29 / 40                                                                             30 / 40


Inference for the Poisson distribution (2)



       The empirical scoring rule is
                                 ∞
                                                   x                   λ
       S(Pn , Q) = f0 L01 (λ)+         fx Lx,x−1     + Lx,x+1                   .                                         4. Applications
                                                   λ                 x +1
                                 x=1

       The resulting estimator is
                     ∞     −(m−a) f                    ∞
                     x=1 x         x   m=a−1       1
         λ=     ∞                       =                    x fx = x = λML .
                x=0 (x + 1)−(m−a+1) fx             n
                                                       x=0




                                                                                    31 / 40                                                                             32 / 40
Ising model                                                                     Ising model (2)

                                                                                       Physics: want to know the expectation parameter U ≡ ∂ ∂θZ ln

   Let X be the possible configurations of spins – taken to be ±1 –                     and
                                                                                                                      ∂ 2 ln Z
   on a lattice. Then                                                                                   V[H − U] =             ≡ gθθ .
                                   1 θHx                                                                                ∂θ2
                            qx =       e ,
                                 Z (θ)                                                 The metric diverges, is discontinuous or has a kink at a phase
                                                                                       transition.
   where Hx is the Hamiltonian/energy of configuration x. The
   partition function Z (θ) contains all information about the                         Statistics: want to estimate θ. Suppose e(x, θ) = 0 is an
   macroscopic physics.                                                                unbiased estimating equation then
       Can we learn anything about the physics, e.g. phase                                                     V [e(x, θ)]
       transitions, without knowing Z (θ)?                                                                                        2
                                                                                                                                       ≥ g θθ .
                                                                                                                   ∂e(x,θ)
                                                                                                           E         ∂θ

                                                                                       Knowing the LHS may shed light on the RHS.


                                                                      33 / 40                                                                            34 / 40


Ising model (3)                                                                 Pseudolikelihood
                                                                                   The pseudolikelihood is defined via

                                                                                          − ln PL(X = x) = −            ln Pr(X k = x k |X \k = x \k )
                                                                                                                   k
                                                                                                                              qx
   Let x = {x i } be a configuration in X . Need to choose an                                              =−            ln
                                                                                                                             qx \k
                                                                                                                   k
   undirected graph on X .
                                                                                                                             xk   qx
    1. Single site flips: Cx = Cx [k], the set of configurations that                                       =        ln
                                                                                                                             qx
       differ from x only at site k.                                                                            k
                                                                                                                                                 
    2. Complete graph: Cx = X .                                                                                                               qy 
                                                                                                          =        ln 1 +                         .
   Next choose generating functions...                                                                                                        qx
                                                                                                               k                  y ∈Cx [k]


                                                                                       Does not depend on the normalization of the model.
                                                                                       Is it a discrete local scoring rule?

                                                                                                                       Yes!
                                                                      35 / 40                                                                            36 / 40
Pseudolikelihood (2)                                                                                Sequential prediction

                                                                                                       We observe {x1 , . . . , xn } iid outcomes and wish to make a
       Pseudolikelihood is a sum of (possibly different) n-wise                                         prediction for xn+1 . Take the one parameter case and suppose we
       discrete local scoring rules                                                                    are in the model:
       Link functions are                                                                                  Let θn be a consistent estimator for θ given the observations,
                                                       n                                                   i.e. θn → θ.
                              Lxy1 ···yn = ln(1 +           ηi )                                           Quote Qn ≡ Qθn as the distribution for xn+1 .
                                                                                                                       b
                                                      i=1
                                                                                                           To give a measure of how far Qn is from P, we define
       where ηi = qyi /qx .
                                                                                                                                dn = Eθ [KL(P, Qn )] .
       The convex generating function that makes everything work is
                                   n                   n                n                                  Desirable result is BIC compatible:
            φxy1 ···yn = −(1 +          ηi ) ln(1 +         ηi ) +           ηi ln ηi .                                                  1
                                  i=1                 i=1              i=1                                                          dn −→ n−1 .
                                                                                                                                         2


                                                                                          37 / 40                                                                           38 / 40


Sequential prediction (2)                                                                           Summary


   In the minimum description length approach, normalized maximum                                       1. Scoring rules have a rich geometric structure still to be
   likelihood gives the optimal sequential prediction:                                                     explored.
                                       q(xn+1 |θn+1 (x))                                                2. Discrete local scoring rules are the counterparts of local
                   q(xn+1 ) =                                      ,                                       scoring rules on continuous outcome space. They do not
                                       yn+1   q(yn+1 |θn+1 (y ))
                                                                                                           depend on the normalisation of the model.
   where θn+1 is the MLE.                                                                               3. Current work on Ising models, pseudolikelihood and sequential
                                                                                                           prediction.
       But numerator is often infinite!
       Idea is to choose discrete local scoring rule to give different
       estimator θ and divergence which does not depend on
       normalization.




                                                                                          39 / 40                                                                           40 / 40

						
Related docs