# Discrete local scoring rules Question Outline 1. Discrete scoring

Shared by:
Categories
-
Stats
views:
2
posted:
6/2/2009
language:
English
pages:
10
Document Sample

```							                                                                             Question

Discrete local scoring rules

Can we extend the idea of locality to discrete outcome spaces?
Matthew Parry
Yes.
Joint work with Philip Dawid and Steﬀen Lauritzen
And the normalization remains irrelevant!
Statistical Laboratory, University of Cambridge
Department of Statistics, University of Oxford

1 / 40                                                                          2 / 40

Outline
1. Discrete scoring rules
Example: Log score and Bregman score
Applications: University exams and non-parametric tests
2. Some geometrical considerations
Metrics
Open questions                                                                           1. Discrete scoring rules
3. Discrete local scoring rules
Nearest neighbour scoring rules
General pairwise scoring rules
The metric tensor
Examples: Bernoulli trials and Poisson distribution
4. Applications and current work
Ising model
Pseudolikelihood
Sequential prediction

3 / 40                                                                          4 / 40
Discrete scoring rules                                                            Two key requirements

Let X be a random variable taking values in a discrete outcome
space X .                                                                         The requirements for a discrete scoring rule are the same for any
scoring rule:
A scoring rule Sx (Q) is a loss function associated with
choosing the distribution Q to represent our uncertainty about                    S(P, Q) is aﬃne in P. (True by construction here.)
X.                                                                                S(P, Q) is proper: for all P, it is minimised in Q at Q = P. A
If P is the true distribution, the expected score is                              scoring rule is said to be strictly proper if

S(P, Q) = EP [Sx (Q)] =         px Sx (Q),                                                 S(P, Q) > S(P, P),
x
for Q = P.
where px is the probability that event x is observed.                             In other words, it does not pay to be dishonest. If you know
Note that only the outcome space is discrete; the qx will                         P, you ought to quote it for Q.
typically depend on continuous parameters.

5 / 40                                                                           6 / 40

Inference                                                                         The logarithmic scoring rule

The simplest and best example of a discrete scoring rule is the
In the case of a one parameter model, i.e. Q = Qθ , the estimator                 logarithmic scoring rule or log score:
is simply
θ = arg min Sx (Qθ ).                                                              Sx (Q) = − ln qx .
θ

It is a nice property of scoring rules that the estimating equation
It is the only scoring rule that depends on Q solely at the
∂Sx (Qθ )
e(x, θ) ≡             =0                                        observed outcome x.
∂θ                                                                         (1) (2)
It is extensive: if qx = qx1 qx2 then
unbiased.                                                                             Sx (Q) = Sx1 (Q (1) ) + Sx2 (Q (2) ).

7 / 40                                                                           8 / 40
Separable Bregman scoring rules                                                      Example of the log score in testing

The student is given a statement and asked to assign a degree of
In fact all separable Bregman scoring rules are also discrete scoring                conﬁdence γ ∈ [0, 1] in the statement being true. In other words
rules:                                                                               X = {T , F } = {1, 0} and
Sx (Q) = −ψx (qx ) −    ψy − qy ψy (qy ),
y                                                                           1 − γ,       x =0
qx =                      .
where ψx (·) is convex for all x.                                                                                  γ,           x =1
Not extensive.
The student’s scoring rule for the question is + ln qx . The student’s
mark on the exam is the empirical score.

9 / 40                                                                               10 / 40

Example of log score in testing (2)                                                  The divergence and non-parametric tests

The divergence or discrepancy is
This is called the Certainty-based Marking scheme for testing
knowledge.                                                                                      d(P, Q) = S(P, Q) − S(P, P) ≥ 0.
In practice, the student assigns “low”, “medium” or “high”
conﬁdence to what they think is the correct answer. An                              Let px be the observed frequency and qx the expected
approximate log score is                                                            frequency of x ∈ X = {1, . . . , K } in n independent trials.
Then
answer\conﬁdence      low      medium    high   no response
correct                 1            2      3             0                                  λ2 = 2n d(P, Q) = 2n           px (Sx (Q) − Sx (P))
incorrect               0           -2     -6             0                                                            x

Levels of conﬁdence correspond to less than 67%, between                            is a non-parametric test statistic for the hypothesis that
67% and 80%, and greater than 80%.                                                  P = Q.

11 / 40                                                                               12 / 40
The divergence and non-parametric tests (2)                                       But not the χ2 -test

Analysis of the probability distribution for λ2 typically requires
expanding d(P, Q) in P about P = Q. Asymptotic in n results                           The Pearson chi-squared test statistic is
follow...
2            (px − qx )2
The log score gives the G-test which is very close to the                                              χ =n                      .
Pearson chi-squared test:                                                                                        x
qx

λ2 ∼ χ2 −1 .                                           Since there are terms in qx not linear in px , it cannot be
K
obtained from a scoring rule.
The other separable scores are too diﬃcult to analyze!

13 / 40                                                                            14 / 40

Some geometrical considerations

For discrete outcome spaces, the distributions P and P + δP are
separated by the inﬁnitesimal squared distance

2. Geometry
(ds)2 ≡ ds 2 =2 d(P, P + δP)
=       ψx (px ) (δpx )2       (Bregman score)
x
(δpx )2
=                                     (Log score)
x
px

15 / 40                                                                            16 / 40

Typically, probability distributions depend on model parameters
{θi }. Then δpx (θ) = ∂px (θ) dθi and we obtain the metric tensor on
∂θi
the manifold of models:                                                                  What is the role of non-Fisherian metrics?
∂px (θ) ∂px (θ)                                                    Generalized Cramér-Rao inequality?
gĳ =         ψx (px )                                   (Bregman score)
x
∂θi     ∂θj                                                       Specialized Chentsov’s theorem?
∂ ln px (θ) ∂ ln px (θ)
=       px                             (Log score → Fisher metric)
x
∂θi         ∂θi

17 / 40                                                                           18 / 40

Eﬃciency                                                                                 Suﬃciency

The second role of the Fisher metric is in Chentsov’s theorem
The Fisher metric plays two statistical roles. The ﬁrst is in the                        [Chentsov (1972)].
Cramér-Rao inequality. The mean squared error of an unbiased                             Theorem
estimator θi is bounded below by the inverse metric tensor:                              Up to a constant multiplicative factor, the only metric that is
ĳ                                                               invariant with respect to suﬃcient reduction of the data on all
gF ≤ E      θi − θi     θj − θj   .                                                                          F
statistical manifolds is the Fisher metric gĳ .

This justiﬁes calling the Fisher metric the information;
θi is eﬃcient when it saturates this bound.
suﬃcient statistics retain the necessary information of the
Less eﬃcient ⇐⇒ diﬀerent metric bound?                                                   data.
On a particular manifold, can there be more invariant metrics?

19 / 40                                                                           20 / 40
Discrete local scoring rules

To start with, we take X to be an ordered, nowhere dense set —
without loss of generality, a subset of the integers. Consider the
discrete version of an m = 2 local scoring rule:

Sx (Q) = sx (qx−1 , qx , qx+1 ).
3. Discrete local scoring rules
This is a nearest neighbour scoring rule.
S x(Q)

x−1       x        x+1

Figure: The score at the point x.

21 / 40                                                                                 22 / 40

Nearest neighbour rule                                                               Nearest neighbour rule (2)
We need to ensure

S(P, Q) =        px sx (qx−1 , qx , qx+1 )
x
The constraint is satisﬁed if we use a generating function
is minimised in Q at Q = P for all P.
φx,x+1 (ξ) :
The necessary form of the scoring rule is
Lx,x+1 (ξ) = −φx,x+1 (ξ)+ξφx,x+1 (ξ),         Lx+1,x (ξ −1 ) = −φx,x+1 (ξ).
qx+1                   qx−1
sx = −λ ln qx + Lx,x+1               + Lx,x−1             ,
qx                     qx                         The resulting scoring rule is (strictly) proper IFF φx,x+1 (ξ) is
(strictly) convex in ξ for all x.
with the constraint Lx,x+1 (ξ) = ξ −1 Lx+1,x (ξ −1 ).
With λ = 0, the scoring rule depends only on probability
ratios, i.e. it is independent of the normalization.
There can be no “one-sided” rules; boundary terms are an
issue.

23 / 40                                                                                 24 / 40
Continuum limit                                                                           Scoring rule on an undirected graph
The generalization to undirected graphs is helpful and solves
problems at the boundaries. We look for scoring rules with a
pair-wise form:
Let                                                                                                                               qy
qx+1                      d                                                              Sx (Q) =      Lxy        ,
ux = ln        = ln qx+1 − ln qx ≈    ln q(x).                                                                        qx
qx                      dx                                                                                    y ∈Cx

And deﬁne the functions Fx (·) = φx,x+1 (exp(·)). Then                                    where Cx is the connection set of the point x and we call Lxy (ξ) a
Sx (Q) = Fx (ux ) − Fx−1 (ux−1 ) + (1 − e−ux−1 )Fx−1 (ux−1 ) − Fx (ux ) .
d · ∂F                      u ∂F             F (x,u(x))
∂u
dx ∂u                                                                                                              L xy           L yx

Same general form as the continuous m = 2 local scoring rule.                                                          x                                      y

Figure: The score at x in terms of the link functions Lxy .
25 / 40                                                                                       26 / 40

Scoring rule on an undirected graph (2)                                                   The metric tensor

In terms of the probability ratios ηxy = qy /qx and ζxy = py /px ,
Emphasis is now on links not nodes.                                                 the divergence is
Cx → X if we put Lxy = Lyx = 0 when points x and y are not
connected. In particular, Lxx = 0. This means we can handle                                     1
missing data.                                                                       d(P, Q) =                  px φxy (ζxy ) − φxy (ηxy ) + (ζxy − ηxy )φxy (ηxy )     .
2
x,y ∈X
Generalizes nearest neighbour results. The allowed link
functions are given in terms of a generating function φxy (ξ) :                     Inﬁnitesimally separated distributions imply the metric tensor
Lxy (ξ) = −φxy (ξ) + ξφxy (ξ),       Lyx (ξ −1 ) = −φxy (ξ).                                                   1
gĳ =                  px φxy (ζxy )ζxy ,i ζxy ,j ,
2
The resulting scoring rule is (strictly) proper IFF φxy (ξ) is                                                        x,y ∈X
(strictly) convex in ξ for all x, y .                                               where ζxy ,i =
∂ζxy
.
∂θi

27 / 40                                                                                       28 / 40
Inference for Bernoulli trials                                                                Inference for the Poisson distribution

We have X = {0, 1, 2, . . .} and observations {fx }. The Poisson
model is
e −λ λx
qx (λ) =         .
We have X = {0, 1} and observations {f0 , f1 = n − f0 }.                                                                             x!
The only graph on the outcome space is 0 − 1.                                                 Choose the undirected graph 0 − 1 − 2 − . . .
Every scoring rule gives the same estimator for the success                                   Choose the generating function
probability:
f1                                                                                               (x + 1)a m+1
θ=      = θML .                                                                        φx,x+1 (ξ) =            ξ   .
n                                                                                                m(m + 1)

Then the non-zero link functions are
(x + 1)a m+1                         (x + 1)a −m
Lx,x+1 (ξ) =           ξ    ,      Lx+1,x (ξ) = −           ξ .
m+1                                    m

29 / 40                                                                             30 / 40

Inference for the Poisson distribution (2)

The empirical scoring rule is
∞
x                   λ
S(Pn , Q) = f0 L01 (λ)+         fx Lx,x−1     + Lx,x+1                   .                                         4. Applications
λ                 x +1
x=1

The resulting estimator is
∞     −(m−a) f                    ∞
x=1 x         x   m=a−1       1
λ=     ∞                       =                    x fx = x = λML .
x=0 (x + 1)−(m−a+1) fx             n
x=0

31 / 40                                                                             32 / 40
Ising model                                                                     Ising model (2)

Physics: want to know the expectation parameter U ≡ ∂ ∂θZ ln

Let X be the possible conﬁgurations of spins – taken to be ±1 –                     and
∂ 2 ln Z
on a lattice. Then                                                                                   V[H − U] =             ≡ gθθ .
1 θHx                                                                                ∂θ2
qx =       e ,
Z (θ)                                                 The metric diverges, is discontinuous or has a kink at a phase
transition.
where Hx is the Hamiltonian/energy of conﬁguration x. The
partition function Z (θ) contains all information about the                         Statistics: want to estimate θ. Suppose e(x, θ) = 0 is an
macroscopic physics.                                                                unbiased estimating equation then
Can we learn anything about the physics, e.g. phase                                                     V [e(x, θ)]
transitions, without knowing Z (θ)?                                                                                        2
≥ g θθ .
∂e(x,θ)
E         ∂θ

Knowing the LHS may shed light on the RHS.

33 / 40                                                                            34 / 40

Ising model (3)                                                                 Pseudolikelihood
The pseudolikelihood is deﬁned via

− ln PL(X = x) = −            ln Pr(X k = x k |X \k = x \k )
k
qx
Let x = {x i } be a conﬁguration in X . Need to choose an                                              =−            ln
qx \k
k
undirected graph on X .
xk   qx
1. Single site ﬂips: Cx = Cx [k], the set of conﬁgurations that                                       =        ln
qx
diﬀer from x only at site k.                                                                            k
                         
2. Complete graph: Cx = X .                                                                                                               qy 
=        ln 1 +                         .
Next choose generating functions...                                                                                                        qx
k                  y ∈Cx [k]

Does not depend on the normalization of the model.
Is it a discrete local scoring rule?

Yes!
35 / 40                                                                            36 / 40
Pseudolikelihood (2)                                                                                Sequential prediction

We observe {x1 , . . . , xn } iid outcomes and wish to make a
Pseudolikelihood is a sum of (possibly diﬀerent) n-wise                                         prediction for xn+1 . Take the one parameter case and suppose we
discrete local scoring rules                                                                    are in the model:
Link functions are                                                                                  Let θn be a consistent estimator for θ given the observations,
n                                                   i.e. θn → θ.
Lxy1 ···yn = ln(1 +           ηi )                                           Quote Qn ≡ Qθn as the distribution for xn+1 .
b
i=1
To give a measure of how far Qn is from P, we deﬁne
where ηi = qyi /qx .
dn = Eθ [KL(P, Qn )] .
The convex generating function that makes everything work is
n                   n                n                                  Desirable result is BIC compatible:
φxy1 ···yn = −(1 +          ηi ) ln(1 +         ηi ) +           ηi ln ηi .                                                  1
i=1                 i=1              i=1                                                          dn −→ n−1 .
2

37 / 40                                                                           38 / 40

Sequential prediction (2)                                                                           Summary

In the minimum description length approach, normalized maximum                                       1. Scoring rules have a rich geometric structure still to be
likelihood gives the optimal sequential prediction:                                                     explored.
q(xn+1 |θn+1 (x))                                                2. Discrete local scoring rules are the counterparts of local
q(xn+1 ) =                                      ,                                       scoring rules on continuous outcome space. They do not
yn+1   q(yn+1 |θn+1 (y ))
depend on the normalisation of the model.
where θn+1 is the MLE.                                                                               3. Current work on Ising models, pseudolikelihood and sequential
prediction.
But numerator is often inﬁnite!
Idea is to choose discrete local scoring rule to give diﬀerent
estimator θ and divergence which does not depend on
normalization.

39 / 40                                                                           40 / 40

```
Related docs