Embed
Email

NLP

Document Sample

Shared by: linzhengnd
Categories
Tags
Stats
views:
5
posted:
11/8/2011
language:
English
pages:
85
Chapter 5: POS Tagging



Heshaam Faili

hfaili@ut.ac.ir

University of Tehran

Part of Speech tag

 POS, word classes, morphological classes,

lexical tags

 Verb Vs Noun

 noun, verb, pronoun, preposition, adverb,

Conjunction, participle, and article

 possessive pronouns Vs personal pronouns

 Useful for pronunciation: conTENT, CONtent

 Useful for stemming for information retrieval

 Useful for Word-sense-disambiguation

 Useful for shallow parsing to detect name, date,

place, …

2

POS tags

 Based on syntactic and morphological function

 Divided into

 closed class types: pronoun, preposition, determiner(articles),

conjunction, auxiliary verbs, particles, numerical

 open class types: nouns, verbs, adjectives, adverb

 Proper noun, common noun

 Count noun, mass noun

 Main verb, auxiliary verb,

 Directional(locative) adverbs, degree adverb, manner adverb,

temporal adverbs

 particle, phrasal verb







3

Statistics

 the: 1,071,676 a: 413,887 an: 59,359









4

Statistics









5

English language

 Pronoun: personal, possessive, WH-pronoun

 Auxiliary verb: modal verb (copula)









6

7

Example English Part-of-

Speech Tag sets

 Brown corpus - 87 tags

 Allows compound tags

 “I'm” tagged as PPSS+BEM

 PPSS for "non-3rd person nominative personal pronoun" and BEM for

"am, 'm“

 Others have derived their work from Brown Corpus

 LOB Corpus: 135 tags

 Lancaster UCREL Group: 165 tags

 London-Lund Corpus: 197 tags.

 BNC – 61 tags (C5)

 PTB – 45 tags

 Other languages have developed other tagsets





8

PTB Tagset (36 main tags + punctuation tags)









9

Some examples from Brown

 The/DT grand/JJ jury/NN commented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./.

 There/EX are/VBP 70/CD children/NNS there/RB

 Although/IN preliminary/JJ findings/NNS were/VBD

reported/VBN more/RBR than/IN a/DT year/NN ago/IN ,/,

the/DT latest/JJS results/NNS appear/VBP in/IN today/NN

’s/POS New/NNP England/NNP Journal/NNP of/IN

Medicine/NNP ,/,



 Difficulty in tagging „around‟‟: particle, preposition, adverb

 Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO

joining/VBG

 All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT

corner/NN

 Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD





10

POS Tagging Problem

 Given a sentence W1…Wn and a tag set of lexical categories,

find the most likely tag C1…Cn for each word in the sentence

 Note that many of the words may have unambiguous tags

 Example

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN

the/DT race/NN for/IN outer/JJ space/NN

 But enough words are either ambiguous or unknown that it‟s a

nontrivial task

 Tokenization: punctuation marks (period, comma, etc) be

separated off of the words.







11

More details of the problem

 How ambiguous?

 Most words in English have only one Brown Corpus tag

 Unambiguous (1 tag) 35,340 word types

 Ambiguous (2- 7 tags) 4,100 word types = 11.5%

 7 tags: 1 word type “still”

 But many of the most common words are ambiguous

 Over 40% of Brown corpus tokens are ambiguous

 Obvious strategies may be suggested based on intuition

 to/TO race/VB

 the/DT race/NN

 will/MD race/VB

 This leads to hand-crafted rule-based POS tagging (J&M, 8.4)

 Sentences can also contain unknown words for which tags have

to be guessed: Secretariat/NNP is/VBZ

12

POS tagging









13

POS tagging methods

 First, we‟ll look at how POS-annotated corpora are

constructed

 Then, we‟ll narrow in on various POS tagging methods,

which rely on POS-annotated corpora

 Supervised learning: use POS-annotated corpora as training

material

 HMMs, TBL, Maximum Entropy, MBL

 Unsupervised learning: use training material which is

unannotated

 HMMs, TBL





 Two main categories for tagging: rule-based,

stochastic

14

Constructing POS-annotated

corpora

 By examining how POS-annotated corpora

are created, we will understand the task even

better

 Basic steps

 1. Tokenize the corpus

 2. Design a tagset, along with guidelines

 3. Apply the tagset to the text

 Knowing what issues the corpus annotators

had to deal with in hand-tagging tells us the

kind of issues we‟ll have to deal with in

automatic tagging 15

1. Tokenize the corpus

 The corpus is first segmented into sentences and

then each sentence is tokenized into unique tokens

(words)

 Punctuation usually split off

 Figuring out sentence-ending periods vs. abbreviators

periods is not always easy, etc.

 Can become tricky:

 Proper nouns and other multi-word units: one token or

separate tokens? e.g., Jimmy James, in spite of

 Contractions often split into separate words: can‟t  can n‟t

(because these are two different POSs)

 Hyphenations sometimes split, sometimes kept together



16

Some other preprocessing

issues

 Should other word form information be included?

 Lemma: the base, or root, form of a word. So, cat and cats

have the same lemma, cat.

 Sometimes, words are lemmatized by stemming, other times

by morphological analysis, using a dictionary &

morphological rules

 Sometimes it makes a big difference – roughly vs. rough

 Fold case or not (usually folded)?

 The the THE Mark versus mark

 One may need, however, to regenerate the original case

when presenting it to the user



17

2. Design the Tagset

 It is not obvious what the parts of speech are in any

language, so a tagset has to be designed.

 Criteria:

 External: what do we need to capture the linguistic

properties of the text?

 Internal: what distinctions can be made reliably in a text?

 Some tagsets are compositional: each character in

the tag has a particular meaning: e.g., Vr3s-f = verb-

present-3rd-singular--female

 What do you do with multi-token words, e.g. in terms

of?

18

Tagging Guidelines

 The tagset should be clearly documented and explained,

with a “case law” of what to do in problematic cases

 If you don‟t do this, you will have problems in applying

the tagset to the text

 If you have good documentation, you can have high

interannotator agreement and a corpus of high quality.









19

3. Apply Tagset to Text: PTB

Tagging Process

 1. Tagset developed

 2. Automatic tagging by rule-based and statistical POS taggers,

error rates of 2-6%.

 3. Human correction using a text editor

 Takes under a month for humans to learn this (at 15 hours a

week), and annotation speeds after a month exceed 3,000

words/hour

 Inter-annotator disagreement (4 annotators, eight 2000-word

docs) was 7.2% for the tagging task and 4.1% for the correcting

task









20

POS Tagging Methods

 Two basic ideas to build from:

 Establishing a simple baseline with unigrams

 Hand-coded rules

 Machine learning techniques:

 Supervised learning techniques

 Unsupervised learning techniques









21

A Simple Strategy for POS

Tagging

 Choose the most likely tag for each ambiguous word,

independent of previous words

 i.e., assign each token the POS category it occurred as most

often in the training set

 E.g., race – which POS is more likely in a corpus?

 This strategy gives you 90% accuracy in controlled

tests

 So, this “unigram baseline” must always be compared

against









22

Example of the Simple

Strategy

 Which POS is more likely in a corpus (1,273,000 tokens)?

NN VB Total

race 400 600 1000





 P(NN|race) = P(race&NN) / P(race) by the definition of

conditional probability

 P(race)  1000/1,273,000 = .0008

 P(race&NN)  400/1,273,000 =.0003

 P(race&VB)  600/1,273,000 = .0005





 And so we obtain:

 P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.4

 P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .6 23

Hand-coded rules

 Two-stage system:

 Dictionary assigns all possible tags to a word

 Rules winnow down the list to a single tag (or sometimes,

multiple tags are left, if it cannot be determined)

 Can also use some probabilistic information

 These systems can be highly effective, but they of

course take time to write the rules.

 We‟ll see an example later of trying to automatically learn

the rules (transformation-based learning)









24

Hand-coded Rules: ENGTWOL

System

 Based on two-level morphology

 Uses 56,000-word lexicon which lists parts-of-speech

for each word

 Each entry is annotated with a set of morphological

and syntactic features









25

ENGTWOL Lexicon









26

ENGTWOL tagger

 First Stage return all possible POS

 Input: “Pavlov had shown that salivation”

 Output:

 Pavlov PAVLOV N NOM SG PROPER

 had HAVE V PAST VFIN SVO

HAVE PCP2 SVO

 shown SHOW PCP2 SVOO SVO SV

 that ADV

PRON DEM SG

DET CENTRAL DEM SG

CS

 salivation N NOM SG





27

ENGTWOL tagger

 Large set of constraints: 3744 constraints









28

Machine Learning

 Machines can learn from examples

 Learning can be supervised or unsupervised

 Given training data, machines analyze the data, and

learn rules which generalize to new examples

 Can be sub-symbolic (rule may be a mathematical function)

e.g., neural nets

 Or it can be symbolic (rules are in a representation that is

similar to representation used for hand-coded rules)

 In general, machine learning approaches allow for

more tuning to the needs of a corpus, and can be

reused across corpora



29

Ways of learning

 Supervised learning:

 A machine learner learns the patterns found in an annotated

corpus

 Unsupervised learning:

 A machine learner learns the patterns found in an

unannotated corpus

 Often uses another database of knowledge, e.g., a

dictionary of possible tags

 Techniques used in supervised learning can be

adapted to unsupervised learning, as we will see.





30

HMMs

 Using probabilities in tagging first used on 1965 but

completed o 1980s (Marshall, 1983; Garside, 1987;

Church, 1988; DeRose, 1988(

 HMM is a special case of Bayesian inference









31

Transition/word likelihood

probabilities

 P(NN|DT) and P(JJ|DT) is high while P(DT|JJ) is low









32

HMMs: A Probabilistic

Approach

 What you want to do is find  Example: He will race

the “best sequence” of POS  Possible sequences:

tags T=T1..Tn for a sentence

W=W1..Wn.  He/PP will/MD race/NN

 (Here T1 is pos_tag(W1)).  He/PP will/NN race/NN

find a sequence of POS tags T  He/PP will/MD race/VB

that maximizes P(T|W) likelihood

Word  He/PP will/NN race/VB

 Using Bayes‟ Rule, we can  W = W1 W2 W3

say

= He will race

P(T|W) = P(W|T)*P(T)/P(W)

 T = T1 T 2 T 3

 We want to find the value of

T which maximizes the RHS  Choices:

 denominator can be

transition

 T= PP MD NN

discarded (it‟s the same for

every T)  T= PP NN NN

 T = PP MD VB

 Find T which maximizes  T = PP NN VB

33

P(W|T) * P(T)

N-gram Models

 POS problem formulation

 Given a sequence of words, find a sequence of categories

that maximizes P(T1…Tn| W1…Wn)

 i.e., that maximizes P(W1…Wn | T1…Tn) * P(T1…Tn) (by

Bayes‟ Rule)

 Chain Rule of probability:

P(W|T) = i=1, n P(Wi|W1…Wi-1T1…Ti)

prob. of this word based on previous words & tags

P(T) = i=1, n P(Ti|W1…WiT1…Ti-1)

prob. of this tag based on previous words & tags

 But we don‟t have sufficient data for this, and we

would likely overfit the data, so we make some

34

assumptions to simplify the problem …

Independence Assumptions

 Assume that current event is based only on previous

n-1 events (i.e., for bigram model, based on previous

event)

 P(T1…Tn)  i=1, n P(Ti| Ti-1)

 assumes that the event of a POS tag occurring is

independent of the event of any other POS tag occurring,

except for the immediately previous POS tag

 From a linguistic standpoint, this seems an unreasonable

assumption, due to long-distance dependencies

 P(W1…Wn | T1…Tn)  i=1, n P(Wi| Ti)

 assumes that the event of a word appearing in a category is

independent of the event of any surrounding word or tag,

except for the tag at this position.

35

Independence Assumptions

 P(Ti| Ti-1) represent the probability of a tag given the

previous tag.

 determiners are very likely to precede adjectives and nouns,

as in sequences like that/DT flight/NN and the/DT yellow/JJ

hat/NN.

 P(NN|DT) and P(JJ|DT) are high but P(DT|JJ) is low









36

Hidden Markov Models

 Linguists know both these assumptions are incorrect!

 But, nevertheless, statistical approaches based on

these assumptions work pretty well for part-of-

speech tagging

 In particular, one called a Hidden Markov Model

(HMM)

 Very widely used in both POS-tagging and speech

recognition, among other problems

 A Markov model, or Markov chain, is just a weighted Finite

State Automaton







37

HMMs = Weighted FSAs

 Nodes (or states): tag options

 Edges: Each edge from a node has a (transition) probability to

another state

 The sum of transition probabilities of all edges going out from a

node is 1

 The probability of a path is the product of the probabilities of

the edges along the path.

 Why multiply? Because of the assumption of independence (Markov

assumption).

 The probability of a string is the sum of the probabilities of all

the paths for the string.

 Why add? Because we are considering the union of all

interpretations.

 A string that isn‟t accepted has probability zero.

38

example









39

HMM

 Weighted finite-state automaton

 Markov Chain: a special case of a weighted

automaton in which the input sequence uniquely

determines which states the automaton will go

through

 Use only observed events : in POS tagging problem: only

words

 It‟s not useful for POS-tagging: because of ambiguity

 Hidden Markov Model: use both

 observed events ( words that we see in sentence)

 Hidden events (part-of-speech tags)







40

Hidden Markov Models

 Why the word “hidden”?

 Called “hidden” because one cannot tell what state

the model is in for a given sequence of words.

 e.g., will could be generated from NN state, or from MD

state, with different probabilities.

 That is, looking at the words will not tell you directly

what tag state you‟re in.









41

POS Tagging Based on

Bigrams

 Problem: Find T which maximizes P(W | T) * P(T)

 Here W=W1…Wn and T=T1…Tn

 Using the bigram model, we get:

 Transition probabilities (prob. of transitioning from one

state/tag to another):

 P(T1…Tn)  i=1, n P(Ti|Ti-1)

 Emission probabilities (prob. of emitting a word at a given

state):

 P(W1…Wn | T1…Tn)  i=1, n P(Wi| Ti)

 So, we want to find the value of T1…Tn which

maximizes:

i=1, n P(Wi| Ti) * P(Ti| Ti-1)

42

Using POS bigram

probabilities: transitions

MD .4

 P(T1….Tn)  i=1, n P(Ti|Ti-1) .8

NN

 Example: He will race 1

 PRP

 Choices for T=T1..T3

 T= PRP MD NN .3 .6

.2

 T= PRP NN VB NN

.7

 T = PRP MD VB VB

 T = PRP NN NN

 POS bigram probs from

C|R MD NN VB PRP

training corpus can be used for

MD .4 .6

P(T) POS

bigram NN .3 .7

 P(PRP-MD-NN)=1*.8*.4 =.32 probs

PRP .8 .2

 1



43

Factoring in lexical generation

probabilities

 From the training corpus, we need to find the Ti which maximizes

i=1, n P(Wi| Ti) * P(Ti| Ti-1)

 So, we‟ll need to factor the lexical generation (emission)

probabilities, somehow:





C E

MD .4

NN

.8

MD NN VB PRP

1

 PRP he 0 0 0 1



.3

+ will .8 .2 0 0

A B .2

.6 F race 0 .4 .6 0

NN

.7 VB lexical generation probs

D

44

Adding emission probabilities

MD NN VB PRP

he 0 0 0 1

will|MD

.8 will .8 .2 0 0

.4 race|NN race 0 .4 .6 0

.8 .4

lexical generation probs

| 1 he|PRP

.3

.3 .6 C|R MD NN VB PRP

.2 will|NN MD .4 .6

.2 .7 race|VB

.6 NN .3 .7

PRP .8 .2

 1



pos bigram probs

Like a weighted FSA, except that there are output

probabilities associated with each node.

45

HMM formulation









46

HMM formulation









47

Emission &Transition probability









48

Dynamic Programming

 In order to find the most likely sequence of

categories for a sequence of words, we don’t need to

enumerate all possible sequences of categories.

 Because of the Markov assumption, if you keep track

of the most likely sequence found so far for each

possible ending category, you can ignore all the other

less likely sequences.

 i.e., multiple edges coming into a state, but only keep the

value of the most likely path

 This is another use of dynamic programming

 The algorithm to do this is called the Viterbi

algorithm.





49

The Viterbi algorithm

 Assume we‟re at state I in the HMM

 Obtain

 the probability of each previous state H1…Hm

 the transition probabilities of H1-I, …, Hm-I

 the emission probability for word w at I

 Multiple the probabilities for each new path:

 P(H1,I) = Score(H1)*P(I|H1)*P(w|I)

 …

 One of these states (H1…Hm) will give the highest

probability

 Only keep the highest probability when using I for the next

state

50

51

Finding the best path through an HMM

C

will|MD E

.8

.4 race|NN

.4

A .8 MD NN VB PP

1 he|PP he 0 0 0 .3

| .3

lex(B) will .8 .2 0 0

.3 .6 F race 0 .4 .6 0

B .2 will|NN

.2 .7 race|VB

.6

lexical generation probs

D

 Score(I) = Max HVB

 2. VBP VB PREV1OR20R3TAG MD

 might/MD vanish/VBP-> VB

 3. NN VB PREV1OR2TAG MD

 might/MD not/MD reply/NN -> VB

 4. VB NN PREV1OR2TAG DT

 the/DT great/JJ feast/VB->NN

 5. VBD VBN PREV1OR20R3TAG VBZ

 He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP







63

Handling Unknown Words

Example Learned Rule Sequence

 Can also use the Brill method for Unknown Words

to learn how to tag unknown

words

 Instead of using surrounding

words and tags, use affix

info, capitalization, etc.

 Guess NNP if capitalized, NN

otherwise.

 Or use the tag most

common for words ending

in the last 3 letters.

 etc.

 TBL has also been applied to

some parsing tasks 64

Insights on TBL

 TBL takes a long time to train, but is relatively fast at

tagging once the rules are learned

 The rules in the sequence may be decomposed into

non-interacting subsets, i.e., only focus on VB

tagging (need to only look at rules which affect it)

 In cases where the data is sparse, the initial guess

needs to be weak enough to allow for learning

 Rules become increasingly specific as you go down

the sequence.

 However, the more specific rules don‟t overfit because they

cover just a few cases



65

Tag Indeterminacy

 1. Somehow replace the indeterminate tags with only

one tag.

 2. In testing, count a tagger as having correctly

tagged an indeterminate token if it gives either of the

correct tags. In training, somehow choose only one

of the tags for the word.

 3. Treat the indeterminate tag as a single complex

tag









66

Tokenization

 Word tokens

 Period detection

 Word splitting

 would/MD n‟t/RB

 multi-part words

 New York









67

Unknown words

 pretend that each unknown word is ambiguous

among all possible tags, with equal probability

 contextual POS-trigrams to suggest the proper tag

 the probability distribution of tags over unknown

words is very similar to the distribution of tags over

words that occurred only once in a training set

 hapax legomena: words occur once

 Good-Turing smoothing algorithm

 Unknown words and hapax legomena are similar in that they

are both most likely to be nouns followed by verbs, but are

very unlikely to be determiners or interjections







68

Unknown words

 Use more information: morphology

 Ended by –s  plural noun

 Ended by –ed  past or past participle verb

 Ended by –able adjective

 Use Orthographic information: word start with Capital letter are

proper nouns

 3 inflectional endings (-ed, -s, -ing), 32 derivational endings

(such as -ion, -al, -ive, and -ly), 4 values of capitalization

depending on whether a word is sentence-initial (+/-

capitalization, +/- initial) and whether the word was hyphenated

 P(wi|ti) = P(unknown-word|ti) ∗ P(capital|ti) (5.50) ∗

P(endings/hyph|ti)







69

Unknown words

 HMM-based approach: P(ti|ln−i+1 . . . ln)

 Train on open classes

 Only to word less occuring < 10

 Use Bayesian to calculate P(wi|ti )









70

3. Maximum Entropy

 Maximum entropy: model what you know as best you

can, remain uncertain about the rest

 Set up some features, similar to Brill‟s templates

 When tagging, the number of features which are true

determines the probability of a tag in a particular

context

 Make your probability model match the training data as well

as it can, i.e., if an example matches features we‟ve seen,

we know exactly what to do

 But the probability model also maximizes the entropy =

uncertainty … so, we are noncommittal to anything not seen

in the training data

 E.g. MXPOST 71

4. Decision Tree Tagging

 Each tagging decision can be viewed as a series of

questions

 Each question should make a maximal division in the data,

i.e., ask a question which gives you the most information at

that point

 Different metrics can be used here

 Each question becomes more specific as you go down the

tree

 Training consists of constructing a decision tree

 Tagging consists of using the decision tree to give us

transition probabilities … and the rest plays out like

an HMM

72

Decision Tree example

A tree for nouns:

Tag-1 = JJ?

YES NO

Tag-2 = DT? …

YES NO



NN = .85 …

JJ = .14





73

Other techniques

 Neural networks

 Conditional random fields

 A network of linear functions



 Basically, whatever techniques are being used in

machine learning are appropriate techniques to try

with POS tagging









74

Unsupervised learning

 Unsupervised learning:

 Use an unannotated corpus for training data

 Instead, will have to use another database of knowledge,

such as a dictionary of possible tags

 Unsupervised learning use the same general

techniques as supervised, but there are differences

 Advantage is that there is more unannotated data to

learn from

 And annotated data isn‟t always available









75

Unsupervised Learning #1:

HMMs

 We still want to use transition (P(ti|ti-1)) and emission

(P(w|t)) probabilities, but how do we obtain them?

 During training, the values of the states (i.e., the tags)

are unknown, or hidden

 Start by using a dictionary, to at least estimate the

emission probability

 Using the forward-backward algorithm, iteratively derive

better probability estimates

 The best tag may change with each iteration

 Stop when probability of the sequence of words has

been (locally) maximized

76

Forward-Backward Algorithm

 The Forward-Backward (or Baum-Welch) Algorithm

for POS tagging works roughly as follows:

 1. Initialize all parameters (forward and backward probability

estimates)

 2. Calculate new probabilities

 A. Re-estimate the probability for a given word in the sentence

from the forward and backward probabilities

 B. Use that to re-estimate the forward and backward

parameters

 3. Terminate when a local maximum for the string is

reached

 NB: The initial probabilities for words are based on

dictionaries … which you probably need tagged text

to obtain 77

Unsupervised Learning #2:

TBL

 With TBL, we want to learn rules of patterns, but

how can we learn the rules if there‟s no annotated

data?



 Main idea: look at the distribution of unambiguous

words to guide the disambiguation of ambiguous

words



 Example: the can, where can can be a noun, modal,

or verb

 Let‟s take unambiguous words from dictionary and

count their occurrences after the

 the elephant

 the guardian

 Conclusion: immediately after the, nouns are more

common than verbs or modals 78

Unsupervised TBL

 Initial state annotator

 Supervised: assign random tag to each word

 Unsupervised: for each word, list all tags in dictionary



 The templates change accordingly …

 Transformation template:

Change tag  of word to tag Y if the previous (next) tag

(word) is Z, where  is a set of 1 or more tags









 Don‟t change any other tags









79

Error Reduction in

Unsupervised Method

 Let a rule to change  to Y in context C be

represented as Rule(, Y, C).

 Rule1: {VB, MD, NN} NN PREVWORD the

 Rule2: {VB, MD, NN} VB PREVWORD the

 Idea:

 since annotated data isn‟t available, score rules so as to

prefer those where Y appears much more frequently in the

context C than all others in 

 frequency is measured by counting unambiguously tagged

words

 so, prefer {VB, MD, NN} NN PREVWORD the

to {VB, MD, NN} VB PREVWORD the

since dict-unambiguous nouns are more common in a corpus

after the than dict-unambiguous verbs 80

Using weakly supervised

methods

 Can mix supervised and unsupervised to improve

performance. For the TBL tagging:

 Use the unsupervised method for the initial state tagger

 Use the supervised method to adjust the tags

 Advantage of using weakly supervised over just

supervised:

 Lots of training data available for unsupervised part

 Less training data available for supervised part









81

Voting Taggers

 A more recent technique which seems to improve

performance is to set up a system of tagger voting.

 Train multiple taggers on a corpus and learn not only about

individual performance, but how they interact

 Can vote in different ways

 Each tagger gets a fraction of the vote

 For certain tagging decisions, one tagger is always right

 Works best when the taggers are looking at different,

or complementary, information

 Intuition: if taggers A, B, and C are all looking at different

info and two of them agree, that‟s better than the one that

disagrees

82

Ambiguity classes

 Another way that POS tagging methods have been

improved is by grouping words together in classes

 Words are grouped together if they have the same set of

possible tags

 Can calculate new probabilities based on these

classes

 This can help if a word occurs very rarely but other

members in its class are more frequent









83

Summary: POS tagging

 Part-of-speech tagging can use

 Hand-crafted rules based on inspecting a corpus

 Machine Learning-based approaches based on corpus statistics

 Machine Learning-based approaches using rules derived

automatically from a corpus

 And almost any other Machine Learning technique you can think of

 When Machine Learning is used, it can be

 Supervised (using an annotated corpus)

 Unsupervised (using an unannotated corpus)

 Combinations of different methods often improve performance









84

Practice

 5.1, 5.2

 Small project: 5.8 on Persian (delivery on 30 Azar)









85



Related docs
Other docs by linzhengnd
option strategy excel spreadsheet
Views: 3  |  Downloads: 0
Tips on Effective Listening
Views: 0  |  Downloads: 0
TO DOWNLOAD TEXT - Repairing The Breach
Views: 0  |  Downloads: 0
Power-Up Tested - Access Mobile
Views: 4  |  Downloads: 0
6502 Sell stone monuments and memorials
Views: 0  |  Downloads: 0
Sheet1 - Atlanta International School
Views: 2  |  Downloads: 0
AFRICAN UNION
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!