Learning Center
Plans & pricing Sign in
Sign Out

Confidence Estimation for Information Extraction


									                     Confidence Estimation for Information Extraction

                     Aron Culotta                                     Andrew McCallum
             Department of Computer Science                     Department of Computer Science
               University of Massachusetts                        University of Massachusetts
                  Amherst, MA 01002                                  Amherst, MA 01002

                        Abstract                            then automatically propagated in order to correct other
                                                            mistakes in the same record. Directing the user to
     Information extraction techniques automati-            the least confident field allows the system to improve
     cally create structured databases from un-             its performance with a minimal amount of user effort.
     structured data sources, such as the Web or            Kristjannson et al. (2004) show that using accurate con-
     newswire documents. Despite the successes of           fidence estimation reduces error rates by 46%.
     these systems, accuracy will always be imper-             Third, confidence estimates can improve performance
     fect. For many reasons, it is highly desirable to      of data mining algorithms that depend upon databases
     accurately estimate the confidence the system           created by information extraction systems (McCallum
     has in the correctness of each extracted field.         and Jensen, 2003). Confidence estimates provide data
     The information extraction system we evalu-            mining applications with a richer set of “bottom-up” hy-
     ate is based on a linear-chain conditional ran-        potheses, resulting in more accurate inferences. An ex-
     dom field (CRF), a probabilistic model which            ample of this occurs in the task of citation co-reference
     has performed well on information extraction           resolution. An information extraction system labels each
     tasks because of its ability to capture arbitrary,     field of a paper citation (e.g. AUTHOR , T ITLE), and then
     overlapping features of the input in a Markov          co-reference resolution merges disparate references to the
     model. We implement several techniques to es-          same paper. Attaching a confidence value to each field
     timate the confidence of both extracted fields           allows the system to examine alternate labelings for less
     and entire multi-field records, obtaining an av-        confident fields to improve performance.
     erage precision of 98% for retrieving correct             Sound probabilistic extraction models are most con-
     fields and 87% for multi-field records.                  ducive to accurate confidence estimation because of their
                                                            intelligent handling of uncertainty information. In this
1   Introduction                                            work we use conditional random fields (Lafferty et al.,
                                                            2001), a type of undirected graphical model, to automat-
Information extraction usually consists of tagging a se-    ically label fields of contact records. Here, a record is an
quence of words (e.g. a Web document) with semantic         entire block of a person’s contact information, and a field
labels (e.g. P ERSON NAME , P HONE N UMBER) and de-         is one element of that record (e.g. C OMPANY NAME). We
positing these extracted fields into a database. Because     implement several techniques to estimate both field con-
automated information extraction will never be perfectly    fidence and record confidence, obtaining an average pre-
accurate, it is helpful to have an effective measure of     cision of 98% for fields and 87% for records.
the confidence that the proposed database entries are cor-
rect. There are at least three important applications of    2   Conditional Random Fields
accurate confidence estimation. First, accuracy-coverage
trade-offs are a common way to improve data integrity in    Conditional random fields (Lafferty et al., 2001) are undi-
databases. Efficiently making these trade-offs requires an   rected graphical models to calculate the conditional prob-
accurate prediction of correctness.                         ability of values on designated output nodes given val-
   Second, confidence estimates are essential for inter-     ues on designated input nodes. In the special case in
active information extraction, in which users may cor-      which the designated output nodes are linked by edges in
rect incorrectly extracted fields. These corrections are     a linear chain, CRFs make a first-order Markov indepen-
dence assumption among output nodes, and thus corre-                    terminating in p∗ = argmax[δT (si )]. We can backtrack
spond to finite state machines (FSMs). In this case CRFs
                                                                        through the dynamic programming table to recover s∗ .
can be roughly understood as conditionally-trained hid-
den Markov models, with additional flexibility to effec-                    The Forward-Backward algorithm can be viewed as a
tively take advantage of complex overlapping features.                  generalization of the Viterbi algorithm: instead of choos-
   Let o = o1 , o2 , ...oT be some observed input data se-              ing the optimal state sequence, Forward-Backward eval-
quence, such as a sequence of words in a document (the                  uates all possible state sequences given the observation
values on T input nodes of the graphical model). Let S be               sequence. The “forward values” αt+1 (si ) are recursively
a set of FSM states, each of which is associated with a la-             defined similarly as in Eq. 2, except the max is replaced
bel (such as C OMPANY NAME). Let s = s1 , s2 , ...sT be                 by a summation. Thus we have
some sequence of states (the values on T output nodes).
CRFs define the conditional probability of a state se-                    αt+1 (si ) =        αt (s ) exp             λk fk (s , si , o, t)     .
quence given an input sequence as                                                        s                       k
                 1                                                      terminating in Zo = i αT (si ) from Eq. 1.
    p Λ (s|o) =    exp               λk fk (st−1 , st , o, t) ,
                Zo         t=1
                                                                            To estimate the probability that a field is extracted
                                                         (1)            correctly, we constrain the Forward-Backward algorithm
where Zo is a normalization factor over all state se-                   such that each path conforms to some subpath of con-
quences, fk (st−1 , st , o, t) is an arbitrary feature func-            straints C = st , st+1 . . . . Here, st ∈ C can be either a
tion over its arguments, and λk is a learned weight for                 positive constraint (the sequence must pass through st ) or
each feature function. Zo is efficiently calculated using                a negative constraint (the sequence must not pass through
dynamic programming. Inference (very much like the                      st ).
Viterbi algorithm in this case) is also a matter of dynamic                 In the context of information extraction, C corresponds
programming. Maximum aposteriori training of these                      to an extracted field. The positive constraints specify the
models is efficiently performed by hill-climbing methods                 observation tokens labeled inside the field, and the neg-
such as conjugate gradient, or its improved second-order                ative constraints specify the field boundary. For exam-
cousin, limited-memory BFGS.                                            ple, if we use states names B-T ITLE and I-J OB T ITLE to
                                                                        label tokens that begin and continue a J OB T ITLE field,
3     Field Confidence Estimation                                        and the system labels observation sequence o2 , . . . , o5
The Viterbi algorithm finds the most likely state sequence               as a J OB T ITLE field, then C = s2 = B-J OB T ITLE,
matching the observed word sequence. The word that                      s3 = . . . = s5 = I-J OB T ITLE, s6 = I-J OB T ITLE .
Viterbi matches with a particular FSM state is extracted                    The calculations of the forward values can be made to
as belonging to the corresponding database field. We can                 conform to C by the recursion αq (si ) =
obtain a numeric score for an entire sequence, and then
turn this into a probability for the entire sequence by nor-
                                                                              s   αt+1 (s ) exp    k    λk fk (s , si , o, t)     if si      st+1
malizing. However, to estimate the confidence of an indi-
vidual field, we desire the probability of a subsequence,                  0                                                       otherwise
marginalizing out the state selection for all other parts
of the sequence. A specialization of Forward-Backward,                     for all st+1 ∈ C, where the operator si st+1 means
termed Constrained Forward-Backward (CFB), returns                      si conforms to constraint st+1 . For time steps not con-
exactly this probability.                                               strained by C, Eq. 3 is used instead.
   Because CRFs are conditional models, Viterbi finds
                                                                           If αt+1 (si ) is the constrained forward value, then
the most likely state sequence given an observation se-
                                                                        Zo =        i αT (si ) is the value of the constrained lat-
quence, defined as s∗ = argmaxs p Λ (s|o). To avoid an
                                                                        tice, the set of all paths that conform to C. Our confi-
exponential-time search over all possible settings of s,
                                                                        dence estimate is obtained by normalizing Zo using Zo ,
Viterbi stores the probability of the most likely path at
                                                                        i.e. Zo /Zo .
time t that accounts for the first t observations and ends
in state si . Following traditional notation, we define this                We also implement an alternative method that uses the
probability to be δt (si ), where δ0 (si ) is the probability of        state probability distributions for each state in the ex-
starting in each state si , and the recursive formula is:               tracted field. Let γt (si ) = p(si |o1 , . . . , oT ) be the prob-
                                                                        ability of being in state i at time t given the observation
    δt+1 (si ) = max δt (s ) exp          λk fk (s , si , o, t)         sequence . We define the confidence measure G AMMA
                 s                                                                v
                                      k                                 to be i=u γi (si ), where u and v are the start and end
                                                                  (2)   indices of the extracted field.
4       Record Confidence Estimation                                                     Pearson’s r    Avg. Prec
                                                                            CFB            .573          .976
We can similarly use CFB to estimate the probability that                  MaxEnt          .571          .976
an entire record is labeled correctly. The procedure is                    Gamma           .418          .912
                                                                           Random          .012          .858
the same as in the previous section, except that C now
                                                                          WorstCase          –           .672
specifies the labels for all fields in the record.
   We also implement three alternative record confidence        Table 1: Evaluation of confidence estimates for field confi-
estimates. F IELD P RODUCT calculates the confidence of         dence. CFB and M AX E NT outperform competing methods.
each field in the record using CFB, then multiplies these
values together to obtain the record confidence. F IELD -                                 Pearson’s r   Avg. Prec
M IN instead uses the minimum field confidence as the                          CFB            .626         .863
                                                                           MaxEnt           .630         .867
record confidence. V ITERBI R ATIO uses the ratio of the
                                                                         FieldProduct       .608         .858
probabilities of the top two Viterbi paths, capturing how                  FieldMin         .588         .843
much more likely s∗ is than its closest alternative.                     ViterbiRatio       .313         .842
                                                                           Random           .043         .526
5       Reranking with Maximum Entropy                                    WorstCase           –          .304

We also trained two conditional maximum entropy clas-          Table 2: Evaluation of confidence estimates for record confi-
sifiers to classify fields and records as being labeled cor-     dence. CFB, M AX E NT again perform best.
rectly or incorrectly. The resulting posterior probabil-
ity of the “correct” label is used as the confidence mea-
sure. The approach is inspired by results from (Collins,          To evaluate confidence estimation, we use three meth-
2000), which show discriminative classifiers can improve        ods. The first is Pearson’s r, a correlation coefficient
the ranking of parses produced by a generative parser.         ranging from -1 to 1 that measures the correlation be-
   After initial experimentation, the most informative in-     tween a confidence score and whether or not the field
puts for the field confidence classifier were field length,        (or record) is correctly labeled. The second is average
the predicted label of the field, whether or not this field      precision, used in the Information Retrieval community
has been extracted elsewhere in this record, and the CFB       to evaluate ranked lists. It calculates the precision at
confidence estimate for this field. For the record confi-         each point in the ranked list where a relevant document
dence classifier, we incorporated the following features:       is found and then averages these values. Instead of rank-
record length, whether or not two fields were tagged with       ing documents by their relevance score, here we rank
the same label, and the CFB confidence estimate.                fields (and records) by their confidence score, where a
                                                               correctly labeled field is analogous to a relevant docu-
6       Experiments                                            ment. W ORST C ASE is the average precision obtained
                                                               by ranking all incorrect instances above all correct in-
2187 contact records (27,560 words) were collected from        stances. Tables 1 and 2 show that CFB and M AX E NT are
Web pages and email and 25 classes of data fields were          statistically similar, and that both outperform competing
hand-labeled.1 The features for the CRF consist of the         methods. Note that W ORST C ASE achieves a high aver-
token text, capitalization features, 24 regular expressions    age precision simply because so many fields are correctly
over the token text (e.g. C ONTAINS H YPHEN), and off-         labeled. In all experiments, R ANDOM assigns confidence
sets of these features within a window of size 5. We also      values chosen uniformly at random between 0 and 1.
use 19 lexicons, including “US Last Names,” “US First             The third measure is an accuracy-coverage graph. Bet-
Names,” and “State Names.” Feature induction is not            ter confidence estimates push the curve to the upper-right
used in these experiments. The CRF is trained on 60% of        of the graph. Figure 1 shows that CFB and M AX E NT
the data, and the remaining 40% is split evenly into de-       dramatically outperform G AMMA. Although omitted for
velopment and testing sets. The development set is used        space, similar results are also achieved on a noun-phrase
to train the maximum entropy classifiers, and the testing       chunking task (CFB r = .516, G AMMA r = .432) and
set is used to measure the accuracy of the confidence es-       a named-entity extraction task (CFB r = .508, G AMMA
timates. The CRF achieves an overall token accuracy of         r = .480).
87.32 on the testing data, with a field-level performance
of F1 = 84.11, precision = 85.43, and recall = 82.83.          7   Related Work
     The 25 fields are: FirstName, MiddleName, LastName,        While there has been previous work using probabilistic
NickName, Suffix, Title, JobTitle, CompanyName, Depart-
ment, AddressLine, City1, City2, State, Country, PostalCode,   estimates for token confidence, and heuristic estimates
HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mo-          for field confidence, to the best of our knowledge this
bile, Pager, VoiceMail, URL, Email, InstantMessage             work is the first to use a sound, probabilistic estimate for
                                                                              discriminative reranking by M AX E NT does not seem to
                                                                              improve performance, despite the benefit Collins (2000)

                                                                              has shown discriminative reranking to provide genera-
                                                                              tive parsers. We hypothesize this is because CRFs are
                                                                              already discriminative (not generative) models; further-
               0.94                                                           more, this may suggest that future discriminative parsing
                                                                              techniques will also have the benefits of discriminative

                                                                              reranking built-in directly.
                                                                              We thank the reviewers for helpful suggestions and refer-
                                                                              ences. This work was supported in part by the Center for
                                                                              Intelligent Information Retrieval, by the Advanced Research
                      0   0.2   0.4              0.6   0.8   1
                                                                              and Development Activity under contract number MDA904-
                                      coverage                                01-C-0984, by The Central Intelligence Agency, the Na-
                                                                              tional Security Agency and National Science Foundation un-
                                                                              der NSF grant #IIS-0326249, and by the Defense Advanced
                                                                              Research Projects Agency, through the Department of the Inte-
Figure 1: The precision-recall curve for fields shows that CFB                 rior, NBC, Acquisition Services Division, under contract num-
and M AX E NT outperform G AMMA.                                              ber NBCHD030010.

field and record confidence in information extraction.                          References
   Much of the work in confidence estimation                                   Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. 2002. Probabilistic com-
                                                                                 bination of text classifiers using reliability indicators: models and results. In
for IE has been in the active learning literature.                               Proceedings of the 25th annual international ACM SIGIR conference on Re-
Scheffer et al. (2001) derive confidence estimates using                          search and development in information retrieval, pages 207–214. ACM Press.
hidden Markov models in an information extraction                             Michael Collins. 2000. Discriminative reranking for natural language parsing. In
system. However, they do not estimate the confidence                              Proc. 17th International Conf. on Machine Learning, pages 175–182. Morgan
                                                                                 Kaufmann, San Francisco, CA.
of entire fields, only singleton tokens. They estimate
the confidence of a token by the difference between                            Simona Gandrabur and George Foster. 2003. Confidence estimation for text pre-
                                                                                 diction. In Proceedings of the Conference on Natural Language Learning
the probabilities of its first and second most likely                             (CoNLL 2003), Edmonton, Canada.
labels, whereas CFB considers the full distribution of
                                                                              A. Gunawardana, H. Hon, and L. Jiang. 1998. Word-based acoustic confidence
all suboptimal paths. Scheffer et al. (2001) also explore                        measures for large-vocabulary speech recognition. In Proc. ICSLP-98, pages
an idea similar to CFB to perform Baum-Welch training                            791–794, Sydney, Australia.
with partially labeled data, where the provided labels                        Trausti Kristjannson, Aron Culotta, Paul Viola, and Andrew McCallum. 2004.
are constraints. However, these constraints are again for                        Interactive information extraction with conditional random fields. To appear
                                                                                 in Nineteenth National Conference on Artificial Intelligence (AAAI 2004).
singleton tokens only.
   Rule-based extraction methods (Thompson et al.,                            John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional ran-
                                                                                 dom fields: Probabilistic models for segmenting and labeling sequence data.
1999) estimate confidence based on a rule’s coverage in                           In Proc. 18th International Conf. on Machine Learning, pages 282–289. Mor-
the training data. Other areas where confidence estima-                           gan Kaufmann, San Francisco, CA.
tion is used include document classification (Bennett et                       Andrew McCallum and David Jensen. 2003. A note on the unification of in-
al., 2002), where classifiers are built using meta-features                      formation extraction and data mining using conditional-probability, relational
                                                                                models. In IJCAI03 Workshop on Learning Statistical Models from Relational
of the document; speech recognition (Gunawardana et al.,                        Data.
1998), where the confidence of a recognized word is esti-
                                                                              Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hidden
mated by considering a list of commonly confused words;                          markov models for information extraction. In Advances in Intelligent Data
and machine translation (Gandrabur and Foster, 2003),                            Analysis, 4th International Conference, IDA 2001.
where neural networks are used to learn the probability of                    Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney. 1999. Ac-
a correct word translation using text features and knowl-                        tive learning for natural language parsing and information extraction. In Proc.
                                                                                 16th International Conf. on Machine Learning, pages 406–414. Morgan Kauf-
edge of alternate translations.                                                  mann, San Francisco, CA.

8   Conclusion
We have shown that CFB is a mathematically and em-
pirically sound confidence estimator for finite state in-
formation extraction systems, providing strong correla-
tion with correctness and obtaining an average precision
of 97.6% for estimating field correctness. Interestingly,

To top