Confidence Estimation for Information Extraction

Document Sample
Confidence Estimation for Information Extraction Powered By Docstoc
					                    Confidence Estimation for Information Extraction

                    Aron Culotta                                     Andrew McCallum
            Department of Computer Science                     Department of Computer Science
              University of Massachusetts                        University of Massachusetts
                 Amherst, MA 01003                                  Amherst, MA 01003

                       Abstract                             then automatically propagated in order to correct other
                                                            mistakes in the same record. Directing the user to
    Information extraction techniques automati-             the least confident field allows the system to improve
    cally create structured databases from un-              its performance with a minimal amount of user effort.
    structured data sources, such as the Web or             Kristjannson et al. (2004) show that using accurate con-
    newswire documents. Despite the successes of            fidence estimation reduces error rates by 46%.
    these systems, accuracy will always be imper-              Third, confidence estimates can improve performance
    fect. For many reasons, it is highly desirable to       of data mining algorithms that depend upon databases
    accurately estimate the confidence the system            created by information extraction systems (McCallum
    has in the correctness of each extracted field.          and Jensen, 2003). Confidence estimates provide data
    The information extraction system we evalu-             mining applications with a richer set of “bottom-up” hy-
    ate is based on a linear-chain conditional ran-         potheses, resulting in more accurate inferences. An ex-
    dom field (CRF), a probabilistic model which             ample of this occurs in the task of citation co-reference
    has performed well on information extraction            resolution. An information extraction system labels each
    tasks because of its ability to capture arbitrary,      field of a paper citation (e.g. AUTHOR , T ITLE), and then
    overlapping features of the input in a Markov           co-reference resolution merges disparate references to the
    model. We implement several techniques to es-           same paper. Attaching a confidence value to each field
    timate the confidence of both extracted fields            allows the system to examine alternate labelings for less
    and entire multi-field records, obtaining an av-         confident fields to improve performance.
    erage precision of 98% for retrieving correct              Sound probabilistic extraction models are most con-
    fields and 87% for multi-field records.                   ducive to accurate confidence estimation because of their
                                                            intelligent handling of uncertainty information. In this
                                                            work we use conditional random fields (Lafferty et al.,
1 Introduction                                              2001), a type of undirected graphical model, to automat-
                                                            ically label fields of contact records. Here, a record is an
Information extraction usually consists of tagging a se-
                                                            entire block of a person’s contact information, and a field
quence of words (e.g. a Web document) with semantic
                                                            is one element of that record (e.g. C OMPANY NAME). We
labels (e.g. P ERSON NAME , P HONE N UMBER) and de-
                                                            implement several techniques to estimate both field con-
positing these extracted fields into a database. Because
                                                            fidence and record confidence, obtaining an average pre-
automated information extraction will never be perfectly
                                                            cision of 98% for fields and 87% for records.
accurate, it is helpful to have an effective measure of
the confidence that the proposed database entries are cor-
                                                            2 Conditional Random Fields
rect. There are at least three important applications of
accurate confidence estimation. First, accuracy-coverage     Conditional random fields (Lafferty et al., 2001) are undi-
trade-offs are a common way to improve data integrity in    rected graphical models to calculate the conditional prob-
databases. Efficiently making these trade-offs requires an   ability of values on designated output nodes given val-
accurate prediction of correctness.                         ues on designated input nodes. In the special case in
   Second, confidence estimates are essential for inter-     which the designated output nodes are linked by edges in
active information extraction, in which users may cor-      a linear chain, CRFs make a first-order Markov indepen-
rect incorrectly extracted fields. These corrections are     dence assumption among output nodes, and thus corre-
spond to finite state machines (FSMs). In this case CRFs                   The Forward-Backward algorithm can be viewed as a
can be roughly understood as conditionally-trained hid-                 generalization of the Viterbi algorithm: instead of choos-
den Markov models, with additional flexibility to effec-                 ing the optimal state sequence, Forward-Backward eval-
tively take advantage of complex overlapping features.                  uates all possible state sequences given the observation
   Let o = o1 , o2 , ...oT be some observed input data se-              sequence. The “forward values” αt+1 (si ) are recursively
quence, such as a sequence of words in a document (the                  defined similarly as in Eq. 2, except the max is replaced
values on T input nodes of the graphical model). Let S be               by a summation. Thus we have
a set of FSM states, each of which is associated with a la-
bel (such as C OMPANY NAME). Let s = s1 , s2 , ...sT be                  αt+1 (si ) =        αt (s ) exp         λk fk (s , si , o, t)    .
some sequence of states (the values on T output nodes).                                 s                    k
CRFs define the conditional probability of a state se-                                                                           (3)
quence given an input sequence as                                       terminating in Zo = i αT (si ) from Eq. 1.
                                                                           To estimate the probability that a field is extracted
                            T                                           correctly, we constrain the Forward-Backward algorithm
  p Λ (s|o) =      exp               λk fk (st−1 , st , o, t) ,         such that each path conforms to some subpath of con-
                Zo         t=1   k                                      straints C = sq . . . sr from time step q to r. Here,
                                                         (1)            sq ∈ C can be either a positive constraint (the sequence
where Zo is a normalization factor over all state se-                   must pass through sq ) or a negative constraint (the se-
quences, fk (st−1 , st , o, t) is an arbitrary feature func-            quence must not pass through sq ).
tion over its arguments, and λk is a learned weight for                    In the context of information extraction, C corresponds
each feature function. Zo is efficiently calculated using                to an extracted field. The positive constraints specify the
dynamic programming. Inference (very much like the                      observation tokens labeled inside the field, and the neg-
Viterbi algorithm in this case) is also a matter of dynamic             ative constraints specify the field boundary. For exam-
programming. Maximum aposteriori training of these                      ple, if we use states names B-T ITLE and I-J OB T ITLE to
models is efficiently performed by hill-climbing methods                 label tokens that begin and continue a J OB T ITLE field,
such as conjugate gradient, or its improved second-order                and the system labels observation sequence o2 , . . . , o5
cousin, limited-memory BFGS.                                            as a J OB T ITLE field, then C = s2 = B-J OB T ITLE,
                                                                        s3 = . . . = s5 = I-J OB T ITLE, s6 = I-J OB T ITLE .
3 Field Confidence Estimation                                               The calculations of the forward values can be made to
The Viterbi algorithm finds the most likely state sequence               conform to C by the recursion αq (si ) =
matching the observed word sequence. The word that                         P h              “P                        ”i
Viterbi matches with a particular FSM state is extracted                    s αq−1 (s ) exp   k λk fk (s , si , o, t)          if si     sq
as belonging to the corresponding database field. We can                    0                                                   otherwise
obtain a numeric score for an entire sequence, and then
turn this into a probability for the entire sequence by nor-               for all sq ∈ C, where the operator si             sq means si
malizing. However, to estimate the confidence of an indi-                conforms to constraint sq . For time steps not constrained
vidual field, we desire the probability of a subsequence,                by C, Eq. 3 is used instead.
marginalizing out the state selection for all other parts                  If αt+1 (si ) is the constrained forward value, then
of the sequence. A specialization of Forward-Backward,                  Zo =        i αT (si ) is the value of the constrained lat-
termed Constrained Forward-Backward (CFB), returns                      tice, the set of all paths that conform to C. Our confi-
exactly this probability.                                               dence estimate is obtained by normalizing Zo using Zo ,
   Because CRFs are conditional models, Viterbi finds                    i.e. Zo − Zo .
the most likely state sequence given an observation se-                    We also implement an alternative method that uses the
quence, defined as s∗ = argmaxs p Λ (s|o). To avoid an                   state probability distributions for each state in the ex-
exponential-time search over all possible settings of s,                tracted field. Let γt (si ) = p(si |o1 , . . . , oT ) be the prob-
Viterbi stores the probability of the most likely path at               ability of being in state i at time t given the observation
time t that accounts for the first t observations and ends               sequence . We define the confidence measure G AMMA
in state si . Following traditional notation, we define this             to be i=u γi (si ), where u and v are the start and end
probability to be δt (si ), where δ0 (si ) is the probability of        indices of the extracted field.
starting in each state si , and the recursive formula is:
                                                                        4 Record Confidence Estimation
 δt+1 (si ) = max δt (s ) exp             λk fk (s , si , o, t)         We can similarly use CFB to estimate the probability that
                                      k                                 an entire record is labeled correctly. The procedure is
                                                                        the same as in the previous section, except that C now
terminating in s∗ = argmax [δT (si )].
                      s1 ≤si ≤sN                                        specifies the labels for all fields in the record.
   We also implement three alternative record confidence                                 Pearson’s r    Avg. Prec
estimates. F IELD P RODUCT calculates the confidence of                      CFB            .573          .976
                                                                           MaxEnt          .571          .976
each field in the record using CFB, then multiplies these                   Gamma           .418          .912
values together to obtain the record confidence. F IELD -                   Random          .012          .858
M IN instead uses the minimum field confidence as the                       WorstCase          –           .672
record confidence. V ITERBI R ATIO uses the ratio of the
probabilities of the top two Viterbi paths, capturing how      Table 1: Evaluation of confidence estimates for field confi-
                                                               dence. CFB and M AX E NT outperform competing methods.
much more likely s∗ is than its closest alternative.
                                                                                         Pearson’s r   Avg. Prec
5 Reranking with Maximum Entropy                                             CFB            .626         .863
                                                                           MaxEnt           .630         .867
We also trained two conditional maximum entropy clas-                    FieldProduct       .608         .858
sifiers to classify fields and records as being labeled cor-                 FieldMin         .588         .843
rectly or incorrectly. The resulting posterior probabil-                 ViterbiRatio       .313         .842
ity of the “correct” label is used as the confidence mea-                   Random           .043         .526
sure. The approach is inspired by results from (Collins,                  WorstCase           –          .304
2000), which show discriminative classifiers can improve
                                                               Table 2: Evaluation of confidence estimates for record confi-
the ranking of parses produced by a generative parser.         dence. CFB, M AX E NT again perform best.
   After initial experimentation, the most informative in-
puts for the field confidence classifier were field length,
the predicted label of the field, whether or not this field      to evaluate ranked lists. It calculates the precision at
has been extracted elsewhere in this record, and the CFB       each point in the ranked list where a relevant document
confidence estimate for this field. For the record confi-         is found and then averages these values. Instead of rank-
dence classifier, we incorporated the following features:       ing documents by their relevance score, here we rank
record length, whether or not two fields were tagged with       fields (and records) by their confidence score, where a
the same label, and the CFB confidence estimate.                correctly labeled field is analogous to a relevant docu-
                                                               ment. W ORST C ASE is the average precision obtained
6 Experiments                                                  by ranking all incorrect instances above all correct in-
                                                               stances. Tables 1 and 2 show that CFB and M AX E NT are
2187 contact records (27,560 words) were collected from
                                                               statistically similar, and that both outperform competing
Web pages and email and 25 classes of data fields were
                                                               methods. Note that W ORST C ASE achieves a high aver-
hand-labeled.1 The features for the CRF consist of the
                                                               age precision simply because so many fields are correctly
token text, capitalization features, 24 regular expressions
                                                               labeled. In all experiments, R ANDOM assigns confidence
over the token text (e.g. C ONTAINS H YPHEN), and off-
                                                               values chosen uniformly at random between 0 and 1.
sets of these features within a window of size 5. We also
                                                                  The third measure is an accuracy-coverage graph. Bet-
use 19 lexicons, including “US Last Names,” “US First
                                                               ter confidence estimates push the curve to the upper-right.
Names,” and “State Names.” Feature induction is not
                                                               Figure 1 shows that CFB and M AX E NT dramatically out-
used in these experiments. The CRF is trained on 60% of
                                                               perform G AMMA. Although omitted for space, similar
the data, and the remaining 40% is split evenly into de-
                                                               results are also achieved on a noun-phrase chunking task
velopment and testing sets. The development set is used
                                                               (CFB r = .516, G AMMA r = .432) and a named-entity
to train the maximum entropy classifiers, and the testing
                                                               extraction task (CFB r = .508, G AMMA r = .480).
set is used to measure the accuracy of the confidence es-
timates. The CRF achieves an overall token accuracy of         7 Related Work
87.32 on the testing data, with a field-level performance
of F1 = 84.11, precision = 85.43, and recall = 82.83.          While there has been previous work using probabilistic
   To evaluate confidence estimation, we use three meth-        estimates for token confidence, and heuristic estimates
ods. The first is Pearson’s r, a correlation coefficient         for field confidence, to the best of our knowledge this pa-
ranging from -1 to 1 that measures the correlation be-         per is the first to use a sound, probabilistic estimate for
tween a confidence score and whether or not the field            confidence of multi-word fields and records in informa-
(or record) is correctly labeled. The second is average        tion extraction.
precision, used in the Information Retrieval community            Much of the work in confidence estimation
                                                               for IE has been in the active learning literature.
      The 25 fields are: FirstName, MiddleName, LastName,       Scheffer et al. (2001) derive confidence estimates using
NickName, Suffix, Title, JobTitle, CompanyName, Depart-
ment, AddressLine, City1, City2, State, Country, PostalCode,   hidden Markov models in an information extraction
HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mo-          system. However, they do not estimate the confidence
bile, Pager, VoiceMail, URL, Email, InstantMessage             of entire fields, only singleton tokens. They estimate
             0.98                                               "Gamma"
                                                                            We thank the reviewers for helpful suggestions and refer-
                                                                            ences. This work was supported in part by the Center for
                                                                            Intelligent Information Retrieval, by the Advanced Research
                                                                            and Development Activity under contract number MDA904-
                                                                            01-C-0984, by The Central Intelligence Agency, the Na-

             0.92                                                           tional Security Agency and National Science Foundation un-
                                                                            der NSF grant #IIS-0326249, and by the Defense Advanced
              0.9                                                           Research Projects Agency, through the Department of the Inte-
                                                                            rior, NBC, Acquisition Services Division, under contract num-
                                                                            ber NBCHD030010.

                    0   0.2   0.4              0.6   0.8   1                References
                                                                            Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. 2002.
Figure 1: The precision-recall curve for fields shows that CFB                 Probabilistic combination of text classifiers using reliability
and M AX E NT outperform G AMMA.                                              indicators: models and results. In Proceedings of the 25th
                                                                              annual international ACM SIGIR conference on Research
                                                                              and development in information retrieval, pages 207–214.
the confidence of a token by the difference between                            ACM Press.
the probabilities of its first and second most likely                        Michael Collins. 2000. Discriminative reranking for natu-
labels, whereas CFB considers the full distribution of                        ral language parsing. In Proc. 17th International Conf. on
all suboptimal paths. Scheffer et al. (2001) also explore                     Machine Learning, pages 175–182. Morgan Kaufmann, San
an idea similar to CFB to perform Baum-Welch training                         Francisco, CA.
with partially labeled data, where the provided labels                      Simona Gandrabur and George Foster. 2003. Confidence esti-
are constraints. However, these constraints are again for                     mation for text prediction. In Proceedings of the Conference
singleton tokens only.                                                        on Natural Language Learning (CoNLL 2003), Edmonton,
   Rule-based extraction methods (Thompson et al.,                            Canada.
1999) estimate confidence based on a rule’s coverage in                      A. Gunawardana, H. Hon, and L. Jiang. 1998. Word-based
the training data. Other areas where confidence estima-                        acoustic confidence measures for large-vocabulary speech
tion is used include document classification (Bennett et                       recognition. In Proc. ICSLP-98, pages 791–794, Sydney,
al., 2002), where classifiers are built using meta-features                    Australia.
of the document; speech recognition (Gunawardana et al.,                    Trausti Kristjannson, Aron Culotta, Paul Viola, and Andrew
1998), where the confidence of a recognized word is esti-                      McCallum. 2004. Interactive information extraction with
mated by considering a list of commonly confused words;                       conditional random fields. To appear in Nineteenth National
and machine translation (Gandrabur and Foster, 2003),                         Conference on Artificial Intelligence (AAAI 2004).
where neural networks are used to learn the probability of                  John Lafferty, Andrew McCallum, and Fernando Pereira. 2001.
a correct word translation using text features and knowl-                     Conditional random fields: Probabilistic models for seg-
edge of alternate translations.                                               menting and labeling sequence data. In Proc. 18th Interna-
                                                                              tional Conf. on Machine Learning, pages 282–289. Morgan
8 Conclusion                                                                  Kaufmann, San Francisco, CA.
                                                                            Andrew McCallum and David Jensen. 2003. A note on the
We have shown that CFB is a mathematically and empir-                         unification of information extraction and data mining using
ically sound confidence estimator for finite state informa-                     conditional-probability, relational models. In IJCAI03 Work-
tion extraction systems, providing strong correlation with                    shop on Learning Statistical Models from Relational Data.
correctness and obtaining an average precision of 97.6%
                                                                            Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001.
for estimating field correctness. Unlike methods margin                        Active hidden markov models for information extraction.
maximization methods such as SVMs and M3 Ns (Taskar                           In Advances in Intelligent Data Analysis, 4th International
et al., 2003), CRFs are trained to maximize conditional                       Conference, IDA 2001.
probability and are thus more naturally appropriate for
                                                                            Ben Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max-
confidence estimation. Interestingly, reranking by M AX -                      margin markov networks. In Proceedings of Neural Infor-
E NT does not seem to improve performance, despite the                        mation Processing Systems Conference.
benefit Collins (2000) has shown discriminative rerank-
ing to provide generative parsers. We hypothesize this is                   Cynthia A. Thompson, Mary Elaine Califf, and Raymond J.
                                                                              Mooney. 1999. Active learning for natural language pars-
because CRFs are already discriminative (not joint, gen-                      ing and information extraction. In Proc. 16th International
erative) models; furthermore, this may suggest that future                    Conf. on Machine Learning, pages 406–414. Morgan Kauf-
discriminative parsing methods will also have the benefits                     mann, San Francisco, CA.
of discriminative reranking built-in directly.