Outline
• Finite state morphology – Finite state automata – Morphological recognition – Two-level morphology – Finite state transducers – Morphological parsing • Stochastic n-gram models – Stochastic language models – Frequency-based n-gram models – Sparse data and smoothing
1
Words
Joakim Nivre
Finite State Automata
A finite state automaton (FSA) is a quintuple Q, Σ, q0, F, δ where: • Q is a finite set of n states q0, q1, . . . , qn • Σ is a finite input alphabet of m symbols σ0, σ1, . . . , σm • q0 ∈ Q is the start state • F ⊆ Q is the set of final states • δ ⊆ (Q × Σ) × Q is the transition relation NB: If δ is a (partial) function from Q × Σ to Q, then the automaton is deterministic; otherwise it is nondeterministic.
2
E
Example
Deterministic:
ff x ¢¢ '$
1
0
&%
q0
0 1
'
ff x ¢¢ '$ E &%
q1
Nondeterministic:
ff x ¢¢ '$ E '$ E &%
1
&% ff ¢¢
q0 0
0
q1
3
A Bit of Formal Language Theory
• The set of strings accepted by a FSA is the language accepted by the FSA. • The class of languages that can be recognized by FSAs is known as the class of regular languages. • Regular languages can also be characterized by – Regular expressions – Regular grammars (Type 3) FSA, RE and RG are all equivalent. • Every finite language is regular.
4 5
Morphological Analysis
• Recognition: Determine whether a string of symbols is in the language or not (FSA).
• Parsing: Assign a structural description to strings in the language (FST).
Finite State Transducers Two-level Morphology
A finite state transducer (FSA) is a quintuple Q, Σ, q0, F, δ where: Word forms represented on a lexical level and a surface level. For example: Surface cat cats cans Lexical cat N SG cat N PL can N PL can V PRES SG3 • Q is a finite set of n states q0, q1, . . . , qn • Σ is a finite alphabet of complex symbols of the form i : o where i is a symbol from an input alphabet I (or ) and o is a symbol from an output alphabet O (or ). • q0 ∈ Q is the start state • F ⊆ Q is the set of final states Morphological parsing = Mapping from surface to lexical level.
6
• δ ⊆ (Q × Σ) × Q is the transition relation
7
Stochastic Language Models Finite State Transducers
FSTs define regular relations (sets of pairs of strings from I ∗ × O∗). Useful operations on FSTs: A stochastic (or probabilistic) language model is a model that assigns probabilities to strings in a language L. Formally: • 0 ≤ M (x) ≤ 1 (for all x ∈ L) •
x∈L M (x) = 1
Stochastic language models are used in many NLP applications: • Composition (cf. cascaded transducers). • Intersection (cf. parallel transducers). • Speech recognition • Machine translation • Optical character recognition • Spell checking
8 9
N-gram Models
In an n-gram model, each word is assumed to be dependent only on n-1 adjacent words: • Bigram model (n = 2): P (w1 · · · wm) =
m i=1
Parameter Estimation
Given a corpus sampled from the language to be modeled, n-gram probabilities can be estimated as follows: ˆ P (wi | wi−1) = C(wi−1wi) C(wi−1) C(wi−2wi−1wi) C(wi−2wi−1)
P (wi | wi−1)
ˆ P (wi | wi−2wi−1) =
• Trigram model (n = 3): P (w1 · · · wm) =
m i=1
where C(x) is the number of times the string x occurs in the corpus. P (wi | wi−2wi−1)
10
NB: Estimation of probabilities by (relative) frequencies is a special case of maximum likelihood estimation (MLE).
11
Sparse Data and Smoothing
A problem with the simple MLE approach is that many probabilities will be estimated to zero because of sparse data. Different ways of avoiding zero probabilities (and making the estimates for rare n-grams more reliable) are known as smoothing (or discounting methods): • Additive smoothing • Good-Turing estimation • Backoff smoothing • Deleted interpolation
12
Smoothing
Maximum likelihood estimation: ˆ P (x) = Additive smoothing: ˆ P (x) = Good-Turing: ˆ P (x) = f ∗(x) N E(Nf (x)+1) E(Nf (x))
13
f (x) N
f (x) + k N + kNX
f ∗(x) = (f + 1)
Smoothing
Linear interpolation: ˆ P (wi|wi−2, wi−1) = λ1 f (wi) f (wi−1, wi) f (wi−2, wi−1, wi) + λ2 + λ3 N f (wi−1) f (wi−2, wi−1) (λ1 + λ2 + λ3 = 1) Back-off: ˆ P (wi|wi−2, wi−1) =
f (wi−2,wi−1,wi ) (1 − δ iff (wi−2, wi−1, wi) > f wi−2 ,wi−1 ) f (w i−2 ,wi−1 ) α ˆ P (wi|wi−1) otherwise wi−2,wi−1
14