Discrete local scoring rules Question Outline 1. Discrete scoring
Document Sample


Question
Discrete local scoring rules
Can we extend the idea of locality to discrete outcome spaces?
Matthew Parry
Yes.
Joint work with Philip Dawid and Steffen Lauritzen
And the normalization remains irrelevant!
Statistical Laboratory, University of Cambridge
Department of Statistics, University of Oxford
1 / 40 2 / 40
Outline
1. Discrete scoring rules
Example: Log score and Bregman score
Applications: University exams and non-parametric tests
2. Some geometrical considerations
Metrics
Open questions 1. Discrete scoring rules
3. Discrete local scoring rules
Nearest neighbour scoring rules
General pairwise scoring rules
The metric tensor
Examples: Bernoulli trials and Poisson distribution
4. Applications and current work
Ising model
Pseudolikelihood
Sequential prediction
3 / 40 4 / 40
Discrete scoring rules Two key requirements
Let X be a random variable taking values in a discrete outcome
space X . The requirements for a discrete scoring rule are the same for any
scoring rule:
A scoring rule Sx (Q) is a loss function associated with
choosing the distribution Q to represent our uncertainty about S(P, Q) is affine in P. (True by construction here.)
X. S(P, Q) is proper: for all P, it is minimised in Q at Q = P. A
If P is the true distribution, the expected score is scoring rule is said to be strictly proper if
S(P, Q) = EP [Sx (Q)] = px Sx (Q), S(P, Q) > S(P, P),
x
for Q = P.
where px is the probability that event x is observed. In other words, it does not pay to be dishonest. If you know
Note that only the outcome space is discrete; the qx will P, you ought to quote it for Q.
typically depend on continuous parameters.
5 / 40 6 / 40
Inference The logarithmic scoring rule
The simplest and best example of a discrete scoring rule is the
In the case of a one parameter model, i.e. Q = Qθ , the estimator logarithmic scoring rule or log score:
is simply
θ = arg min Sx (Qθ ). Sx (Q) = − ln qx .
θ
It is a nice property of scoring rules that the estimating equation
It is the only scoring rule that depends on Q solely at the
∂Sx (Qθ )
e(x, θ) ≡ =0 observed outcome x.
∂θ (1) (2)
It is extensive: if qx = qx1 qx2 then
unbiased. Sx (Q) = Sx1 (Q (1) ) + Sx2 (Q (2) ).
Leads to maximum likelihood estimation.
7 / 40 8 / 40
Separable Bregman scoring rules Example of the log score in testing
The student is given a statement and asked to assign a degree of
In fact all separable Bregman scoring rules are also discrete scoring confidence γ ∈ [0, 1] in the statement being true. In other words
rules: X = {T , F } = {1, 0} and
Sx (Q) = −ψx (qx ) − ψy − qy ψy (qy ),
y 1 − γ, x =0
qx = .
where ψx (·) is convex for all x. γ, x =1
Not extensive.
The student’s scoring rule for the question is + ln qx . The student’s
mark on the exam is the empirical score.
9 / 40 10 / 40
Example of log score in testing (2) The divergence and non-parametric tests
The divergence or discrepancy is
This is called the Certainty-based Marking scheme for testing
knowledge. d(P, Q) = S(P, Q) − S(P, P) ≥ 0.
In practice, the student assigns “low”, “medium” or “high”
confidence to what they think is the correct answer. An Let px be the observed frequency and qx the expected
approximate log score is frequency of x ∈ X = {1, . . . , K } in n independent trials.
Then
answer\confidence low medium high no response
correct 1 2 3 0 λ2 = 2n d(P, Q) = 2n px (Sx (Q) − Sx (P))
incorrect 0 -2 -6 0 x
Levels of confidence correspond to less than 67%, between is a non-parametric test statistic for the hypothesis that
67% and 80%, and greater than 80%. P = Q.
11 / 40 12 / 40
The divergence and non-parametric tests (2) But not the χ2 -test
Analysis of the probability distribution for λ2 typically requires
expanding d(P, Q) in P about P = Q. Asymptotic in n results The Pearson chi-squared test statistic is
follow...
2 (px − qx )2
The log score gives the G-test which is very close to the χ =n .
Pearson chi-squared test: x
qx
λ2 ∼ χ2 −1 . Since there are terms in qx not linear in px , it cannot be
K
obtained from a scoring rule.
The other separable scores are too difficult to analyze!
13 / 40 14 / 40
Some geometrical considerations
For discrete outcome spaces, the distributions P and P + δP are
separated by the infinitesimal squared distance
2. Geometry
(ds)2 ≡ ds 2 =2 d(P, P + δP)
= ψx (px ) (δpx )2 (Bregman score)
x
(δpx )2
= (Log score)
x
px
15 / 40 16 / 40
Metrics Open questions about metrics
Typically, probability distributions depend on model parameters
{θi }. Then δpx (θ) = ∂px (θ) dθi and we obtain the metric tensor on
∂θi
the manifold of models: What is the role of non-Fisherian metrics?
∂px (θ) ∂px (θ) Generalized Cramér-Rao inequality?
gij = ψx (px ) (Bregman score)
x
∂θi ∂θj Specialized Chentsov’s theorem?
∂ ln px (θ) ∂ ln px (θ)
= px (Log score → Fisher metric)
x
∂θi ∂θi
17 / 40 18 / 40
Efficiency Sufficiency
The second role of the Fisher metric is in Chentsov’s theorem
The Fisher metric plays two statistical roles. The first is in the [Chentsov (1972)].
Cramér-Rao inequality. The mean squared error of an unbiased Theorem
estimator θi is bounded below by the inverse metric tensor: Up to a constant multiplicative factor, the only metric that is
ij invariant with respect to sufficient reduction of the data on all
gF ≤ E θi − θi θj − θj . F
statistical manifolds is the Fisher metric gij .
This justifies calling the Fisher metric the information;
θi is efficient when it saturates this bound.
sufficient statistics retain the necessary information of the
Less efficient ⇐⇒ different metric bound? data.
On a particular manifold, can there be more invariant metrics?
19 / 40 20 / 40
Discrete local scoring rules
To start with, we take X to be an ordered, nowhere dense set —
without loss of generality, a subset of the integers. Consider the
discrete version of an m = 2 local scoring rule:
Sx (Q) = sx (qx−1 , qx , qx+1 ).
3. Discrete local scoring rules
This is a nearest neighbour scoring rule.
S x(Q)
x−1 x x+1
Figure: The score at the point x.
21 / 40 22 / 40
Nearest neighbour rule Nearest neighbour rule (2)
We need to ensure
S(P, Q) = px sx (qx−1 , qx , qx+1 )
x
The constraint is satisfied if we use a generating function
is minimised in Q at Q = P for all P.
φx,x+1 (ξ) :
The necessary form of the scoring rule is
Lx,x+1 (ξ) = −φx,x+1 (ξ)+ξφx,x+1 (ξ), Lx+1,x (ξ −1 ) = −φx,x+1 (ξ).
qx+1 qx−1
sx = −λ ln qx + Lx,x+1 + Lx,x−1 ,
qx qx The resulting scoring rule is (strictly) proper IFF φx,x+1 (ξ) is
(strictly) convex in ξ for all x.
with the constraint Lx,x+1 (ξ) = ξ −1 Lx+1,x (ξ −1 ).
With λ = 0, the scoring rule depends only on probability
ratios, i.e. it is independent of the normalization.
There can be no “one-sided” rules; boundary terms are an
issue.
23 / 40 24 / 40
Continuum limit Scoring rule on an undirected graph
The generalization to undirected graphs is helpful and solves
problems at the boundaries. We look for scoring rules with a
pair-wise form:
Let qy
qx+1 d Sx (Q) = Lxy ,
ux = ln = ln qx+1 − ln qx ≈ ln q(x). qx
qx dx y ∈Cx
And define the functions Fx (·) = φx,x+1 (exp(·)). Then where Cx is the connection set of the point x and we call Lxy (ξ) a
link function.
Sx (Q) = Fx (ux ) − Fx−1 (ux−1 ) + (1 − e−ux−1 )Fx−1 (ux−1 ) − Fx (ux ) .
d · ∂F u ∂F F (x,u(x))
∂u
dx ∂u L xy L yx
Same general form as the continuous m = 2 local scoring rule. x y
Figure: The score at x in terms of the link functions Lxy .
25 / 40 26 / 40
Scoring rule on an undirected graph (2) The metric tensor
In terms of the probability ratios ηxy = qy /qx and ζxy = py /px ,
Emphasis is now on links not nodes. the divergence is
Cx → X if we put Lxy = Lyx = 0 when points x and y are not
connected. In particular, Lxx = 0. This means we can handle 1
missing data. d(P, Q) = px φxy (ζxy ) − φxy (ηxy ) + (ζxy − ηxy )φxy (ηxy ) .
2
x,y ∈X
Generalizes nearest neighbour results. The allowed link
functions are given in terms of a generating function φxy (ξ) : Infinitesimally separated distributions imply the metric tensor
Lxy (ξ) = −φxy (ξ) + ξφxy (ξ), Lyx (ξ −1 ) = −φxy (ξ). 1
gij = px φxy (ζxy )ζxy ,i ζxy ,j ,
2
The resulting scoring rule is (strictly) proper IFF φxy (ξ) is x,y ∈X
(strictly) convex in ξ for all x, y . where ζxy ,i =
∂ζxy
.
∂θi
27 / 40 28 / 40
Inference for Bernoulli trials Inference for the Poisson distribution
We have X = {0, 1, 2, . . .} and observations {fx }. The Poisson
model is
e −λ λx
qx (λ) = .
We have X = {0, 1} and observations {f0 , f1 = n − f0 }. x!
The only graph on the outcome space is 0 − 1. Choose the undirected graph 0 − 1 − 2 − . . .
Every scoring rule gives the same estimator for the success Choose the generating function
probability:
f1 (x + 1)a m+1
θ= = θML . φx,x+1 (ξ) = ξ .
n m(m + 1)
Then the non-zero link functions are
(x + 1)a m+1 (x + 1)a −m
Lx,x+1 (ξ) = ξ , Lx+1,x (ξ) = − ξ .
m+1 m
29 / 40 30 / 40
Inference for the Poisson distribution (2)
The empirical scoring rule is
∞
x λ
S(Pn , Q) = f0 L01 (λ)+ fx Lx,x−1 + Lx,x+1 . 4. Applications
λ x +1
x=1
The resulting estimator is
∞ −(m−a) f ∞
x=1 x x m=a−1 1
λ= ∞ = x fx = x = λML .
x=0 (x + 1)−(m−a+1) fx n
x=0
31 / 40 32 / 40
Ising model Ising model (2)
Physics: want to know the expectation parameter U ≡ ∂ ∂θZ ln
Let X be the possible configurations of spins – taken to be ±1 – and
∂ 2 ln Z
on a lattice. Then V[H − U] = ≡ gθθ .
1 θHx ∂θ2
qx = e ,
Z (θ) The metric diverges, is discontinuous or has a kink at a phase
transition.
where Hx is the Hamiltonian/energy of configuration x. The
partition function Z (θ) contains all information about the Statistics: want to estimate θ. Suppose e(x, θ) = 0 is an
macroscopic physics. unbiased estimating equation then
Can we learn anything about the physics, e.g. phase V [e(x, θ)]
transitions, without knowing Z (θ)? 2
≥ g θθ .
∂e(x,θ)
E ∂θ
Knowing the LHS may shed light on the RHS.
33 / 40 34 / 40
Ising model (3) Pseudolikelihood
The pseudolikelihood is defined via
− ln PL(X = x) = − ln Pr(X k = x k |X \k = x \k )
k
qx
Let x = {x i } be a configuration in X . Need to choose an =− ln
qx \k
k
undirected graph on X .
xk qx
1. Single site flips: Cx = Cx [k], the set of configurations that = ln
qx
differ from x only at site k. k
2. Complete graph: Cx = X . qy
= ln 1 + .
Next choose generating functions... qx
k y ∈Cx [k]
Does not depend on the normalization of the model.
Is it a discrete local scoring rule?
Yes!
35 / 40 36 / 40
Pseudolikelihood (2) Sequential prediction
We observe {x1 , . . . , xn } iid outcomes and wish to make a
Pseudolikelihood is a sum of (possibly different) n-wise prediction for xn+1 . Take the one parameter case and suppose we
discrete local scoring rules are in the model:
Link functions are Let θn be a consistent estimator for θ given the observations,
n i.e. θn → θ.
Lxy1 ···yn = ln(1 + ηi ) Quote Qn ≡ Qθn as the distribution for xn+1 .
b
i=1
To give a measure of how far Qn is from P, we define
where ηi = qyi /qx .
dn = Eθ [KL(P, Qn )] .
The convex generating function that makes everything work is
n n n Desirable result is BIC compatible:
φxy1 ···yn = −(1 + ηi ) ln(1 + ηi ) + ηi ln ηi . 1
i=1 i=1 i=1 dn −→ n−1 .
2
37 / 40 38 / 40
Sequential prediction (2) Summary
In the minimum description length approach, normalized maximum 1. Scoring rules have a rich geometric structure still to be
likelihood gives the optimal sequential prediction: explored.
q(xn+1 |θn+1 (x)) 2. Discrete local scoring rules are the counterparts of local
q(xn+1 ) = , scoring rules on continuous outcome space. They do not
yn+1 q(yn+1 |θn+1 (y ))
depend on the normalisation of the model.
where θn+1 is the MLE. 3. Current work on Ising models, pseudolikelihood and sequential
prediction.
But numerator is often infinite!
Idea is to choose discrete local scoring rule to give different
estimator θ and divergence which does not depend on
normalization.
39 / 40 40 / 40
Related docs
Get documents about "