Conﬁdence Estimation for Information Extraction Aron Culotta Andrew McCallum Department of Computer Science Department of Computer Science University of Massachusetts University of Massachusetts Amherst, MA 01003 Amherst, MA 01003 firstname.lastname@example.org email@example.com Abstract then automatically propagated in order to correct other mistakes in the same record. Directing the user to Information extraction techniques automati- the least conﬁdent ﬁeld allows the system to improve cally create structured databases from un- its performance with a minimal amount of user effort. structured data sources, such as the Web or Kristjannson et al. (2004) show that using accurate con- newswire documents. Despite the successes of ﬁdence estimation reduces error rates by 46%. these systems, accuracy will always be imper- Third, conﬁdence estimates can improve performance fect. For many reasons, it is highly desirable to of data mining algorithms that depend upon databases accurately estimate the conﬁdence the system created by information extraction systems (McCallum has in the correctness of each extracted ﬁeld. and Jensen, 2003). Conﬁdence estimates provide data The information extraction system we evalu- mining applications with a richer set of “bottom-up” hy- ate is based on a linear-chain conditional ran- potheses, resulting in more accurate inferences. An ex- dom ﬁeld (CRF), a probabilistic model which ample of this occurs in the task of citation co-reference has performed well on information extraction resolution. An information extraction system labels each tasks because of its ability to capture arbitrary, ﬁeld of a paper citation (e.g. AUTHOR , T ITLE), and then overlapping features of the input in a Markov co-reference resolution merges disparate references to the model. We implement several techniques to es- same paper. Attaching a conﬁdence value to each ﬁeld timate the conﬁdence of both extracted ﬁelds allows the system to examine alternate labelings for less and entire multi-ﬁeld records, obtaining an av- conﬁdent ﬁelds to improve performance. erage precision of 98% for retrieving correct Sound probabilistic extraction models are most con- ﬁelds and 87% for multi-ﬁeld records. ducive to accurate conﬁdence estimation because of their intelligent handling of uncertainty information. In this work we use conditional random ﬁelds (Lafferty et al., 1 Introduction 2001), a type of undirected graphical model, to automat- ically label ﬁelds of contact records. Here, a record is an Information extraction usually consists of tagging a se- entire block of a person’s contact information, and a ﬁeld quence of words (e.g. a Web document) with semantic is one element of that record (e.g. C OMPANY NAME). We labels (e.g. P ERSON NAME , P HONE N UMBER) and de- implement several techniques to estimate both ﬁeld con- positing these extracted ﬁelds into a database. Because ﬁdence and record conﬁdence, obtaining an average pre- automated information extraction will never be perfectly cision of 98% for ﬁelds and 87% for records. accurate, it is helpful to have an effective measure of the conﬁdence that the proposed database entries are cor- 2 Conditional Random Fields rect. There are at least three important applications of accurate conﬁdence estimation. First, accuracy-coverage Conditional random ﬁelds (Lafferty et al., 2001) are undi- trade-offs are a common way to improve data integrity in rected graphical models to calculate the conditional prob- databases. Efﬁciently making these trade-offs requires an ability of values on designated output nodes given val- accurate prediction of correctness. ues on designated input nodes. In the special case in Second, conﬁdence estimates are essential for inter- which the designated output nodes are linked by edges in active information extraction, in which users may cor- a linear chain, CRFs make a ﬁrst-order Markov indepen- rect incorrectly extracted ﬁelds. These corrections are dence assumption among output nodes, and thus corre- spond to ﬁnite state machines (FSMs). In this case CRFs The Forward-Backward algorithm can be viewed as a can be roughly understood as conditionally-trained hid- generalization of the Viterbi algorithm: instead of choos- den Markov models, with additional ﬂexibility to effec- ing the optimal state sequence, Forward-Backward eval- tively take advantage of complex overlapping features. uates all possible state sequences given the observation Let o = o1 , o2 , ...oT be some observed input data se- sequence. The “forward values” αt+1 (si ) are recursively quence, such as a sequence of words in a document (the deﬁned similarly as in Eq. 2, except the max is replaced values on T input nodes of the graphical model). Let S be by a summation. Thus we have a set of FSM states, each of which is associated with a la- bel (such as C OMPANY NAME). Let s = s1 , s2 , ...sT be αt+1 (si ) = αt (s ) exp λk fk (s , si , o, t) . some sequence of states (the values on T output nodes). s k CRFs deﬁne the conditional probability of a state se- (3) quence given an input sequence as terminating in Zo = i αT (si ) from Eq. 1. To estimate the probability that a ﬁeld is extracted 1 T correctly, we constrain the Forward-Backward algorithm p Λ (s|o) = exp λk fk (st−1 , st , o, t) , such that each path conforms to some subpath of con- Zo t=1 k straints C = sq . . . sr from time step q to r. Here, (1) sq ∈ C can be either a positive constraint (the sequence where Zo is a normalization factor over all state se- must pass through sq ) or a negative constraint (the se- quences, fk (st−1 , st , o, t) is an arbitrary feature func- quence must not pass through sq ). tion over its arguments, and λk is a learned weight for In the context of information extraction, C corresponds each feature function. Zo is efﬁciently calculated using to an extracted ﬁeld. The positive constraints specify the dynamic programming. Inference (very much like the observation tokens labeled inside the ﬁeld, and the neg- Viterbi algorithm in this case) is also a matter of dynamic ative constraints specify the ﬁeld boundary. For exam- programming. Maximum aposteriori training of these ple, if we use states names B-T ITLE and I-J OB T ITLE to models is efﬁciently performed by hill-climbing methods label tokens that begin and continue a J OB T ITLE ﬁeld, such as conjugate gradient, or its improved second-order and the system labels observation sequence o2 , . . . , o5 cousin, limited-memory BFGS. as a J OB T ITLE ﬁeld, then C = s2 = B-J OB T ITLE, s3 = . . . = s5 = I-J OB T ITLE, s6 = I-J OB T ITLE . 3 Field Conﬁdence Estimation The calculations of the forward values can be made to The Viterbi algorithm ﬁnds the most likely state sequence conform to C by the recursion αq (si ) = matching the observed word sequence. The word that P h “P ”i Viterbi matches with a particular FSM state is extracted s αq−1 (s ) exp k λk fk (s , si , o, t) if si sq as belonging to the corresponding database ﬁeld. We can 0 otherwise obtain a numeric score for an entire sequence, and then turn this into a probability for the entire sequence by nor- for all sq ∈ C, where the operator si sq means si malizing. However, to estimate the conﬁdence of an indi- conforms to constraint sq . For time steps not constrained vidual ﬁeld, we desire the probability of a subsequence, by C, Eq. 3 is used instead. marginalizing out the state selection for all other parts If αt+1 (si ) is the constrained forward value, then of the sequence. A specialization of Forward-Backward, Zo = i αT (si ) is the value of the constrained lat- termed Constrained Forward-Backward (CFB), returns tice, the set of all paths that conform to C. Our conﬁ- exactly this probability. dence estimate is obtained by normalizing Zo using Zo , Because CRFs are conditional models, Viterbi ﬁnds i.e. Zo − Zo . the most likely state sequence given an observation se- We also implement an alternative method that uses the quence, deﬁned as s∗ = argmaxs p Λ (s|o). To avoid an state probability distributions for each state in the ex- exponential-time search over all possible settings of s, tracted ﬁeld. Let γt (si ) = p(si |o1 , . . . , oT ) be the prob- Viterbi stores the probability of the most likely path at ability of being in state i at time t given the observation time t that accounts for the ﬁrst t observations and ends sequence . We deﬁne the conﬁdence measure G AMMA v in state si . Following traditional notation, we deﬁne this to be i=u γi (si ), where u and v are the start and end probability to be δt (si ), where δ0 (si ) is the probability of indices of the extracted ﬁeld. starting in each state si , and the recursive formula is: 4 Record Conﬁdence Estimation δt+1 (si ) = max δt (s ) exp λk fk (s , si , o, t) We can similarly use CFB to estimate the probability that s k an entire record is labeled correctly. The procedure is (2) the same as in the previous section, except that C now terminating in s∗ = argmax [δT (si )]. s1 ≤si ≤sN speciﬁes the labels for all ﬁelds in the record. We also implement three alternative record conﬁdence Pearson’s r Avg. Prec estimates. F IELD P RODUCT calculates the conﬁdence of CFB .573 .976 MaxEnt .571 .976 each ﬁeld in the record using CFB, then multiplies these Gamma .418 .912 values together to obtain the record conﬁdence. F IELD - Random .012 .858 M IN instead uses the minimum ﬁeld conﬁdence as the WorstCase – .672 record conﬁdence. V ITERBI R ATIO uses the ratio of the probabilities of the top two Viterbi paths, capturing how Table 1: Evaluation of conﬁdence estimates for ﬁeld conﬁ- dence. CFB and M AX E NT outperform competing methods. much more likely s∗ is than its closest alternative. Pearson’s r Avg. Prec 5 Reranking with Maximum Entropy CFB .626 .863 MaxEnt .630 .867 We also trained two conditional maximum entropy clas- FieldProduct .608 .858 siﬁers to classify ﬁelds and records as being labeled cor- FieldMin .588 .843 rectly or incorrectly. The resulting posterior probabil- ViterbiRatio .313 .842 ity of the “correct” label is used as the conﬁdence mea- Random .043 .526 sure. The approach is inspired by results from (Collins, WorstCase – .304 2000), which show discriminative classiﬁers can improve Table 2: Evaluation of conﬁdence estimates for record conﬁ- the ranking of parses produced by a generative parser. dence. CFB, M AX E NT again perform best. After initial experimentation, the most informative in- puts for the ﬁeld conﬁdence classiﬁer were ﬁeld length, the predicted label of the ﬁeld, whether or not this ﬁeld to evaluate ranked lists. It calculates the precision at has been extracted elsewhere in this record, and the CFB each point in the ranked list where a relevant document conﬁdence estimate for this ﬁeld. For the record conﬁ- is found and then averages these values. Instead of rank- dence classiﬁer, we incorporated the following features: ing documents by their relevance score, here we rank record length, whether or not two ﬁelds were tagged with ﬁelds (and records) by their conﬁdence score, where a the same label, and the CFB conﬁdence estimate. correctly labeled ﬁeld is analogous to a relevant docu- ment. W ORST C ASE is the average precision obtained 6 Experiments by ranking all incorrect instances above all correct in- stances. Tables 1 and 2 show that CFB and M AX E NT are 2187 contact records (27,560 words) were collected from statistically similar, and that both outperform competing Web pages and email and 25 classes of data ﬁelds were methods. Note that W ORST C ASE achieves a high aver- hand-labeled.1 The features for the CRF consist of the age precision simply because so many ﬁelds are correctly token text, capitalization features, 24 regular expressions labeled. In all experiments, R ANDOM assigns conﬁdence over the token text (e.g. C ONTAINS H YPHEN), and off- values chosen uniformly at random between 0 and 1. sets of these features within a window of size 5. We also The third measure is an accuracy-coverage graph. Bet- use 19 lexicons, including “US Last Names,” “US First ter conﬁdence estimates push the curve to the upper-right. Names,” and “State Names.” Feature induction is not Figure 1 shows that CFB and M AX E NT dramatically out- used in these experiments. The CRF is trained on 60% of perform G AMMA. Although omitted for space, similar the data, and the remaining 40% is split evenly into de- results are also achieved on a noun-phrase chunking task velopment and testing sets. The development set is used (CFB r = .516, G AMMA r = .432) and a named-entity to train the maximum entropy classiﬁers, and the testing extraction task (CFB r = .508, G AMMA r = .480). set is used to measure the accuracy of the conﬁdence es- timates. The CRF achieves an overall token accuracy of 7 Related Work 87.32 on the testing data, with a ﬁeld-level performance of F1 = 84.11, precision = 85.43, and recall = 82.83. While there has been previous work using probabilistic To evaluate conﬁdence estimation, we use three meth- estimates for token conﬁdence, and heuristic estimates ods. The ﬁrst is Pearson’s r, a correlation coefﬁcient for ﬁeld conﬁdence, to the best of our knowledge this pa- ranging from -1 to 1 that measures the correlation be- per is the ﬁrst to use a sound, probabilistic estimate for tween a conﬁdence score and whether or not the ﬁeld conﬁdence of multi-word ﬁelds and records in informa- (or record) is correctly labeled. The second is average tion extraction. precision, used in the Information Retrieval community Much of the work in conﬁdence estimation 1 for IE has been in the active learning literature. The 25 ﬁelds are: FirstName, MiddleName, LastName, Scheffer et al. (2001) derive conﬁdence estimates using NickName, Sufﬁx, Title, JobTitle, CompanyName, Depart- ment, AddressLine, City1, City2, State, Country, PostalCode, hidden Markov models in an information extraction HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mo- system. However, they do not estimate the conﬁdence bile, Pager, VoiceMail, URL, Email, InstantMessage of entire ﬁelds, only singleton tokens. They estimate 1 "Optimal" "CFB" Acknowledgments "MaxEnt" 0.98 "Gamma" "Random" We thank the reviewers for helpful suggestions and refer- ences. This work was supported in part by the Center for 0.96 Intelligent Information Retrieval, by the Advanced Research 0.94 and Development Activity under contract number MDA904- 01-C-0984, by The Central Intelligence Agency, the Na- accuracy 0.92 tional Security Agency and National Science Foundation un- der NSF grant #IIS-0326249, and by the Defense Advanced 0.9 Research Projects Agency, through the Department of the Inte- rior, NBC, Acquisition Services Division, under contract num- 0.88 ber NBCHD030010. 0.86 0.84 0 0.2 0.4 0.6 0.8 1 References coverage Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. 2002. Figure 1: The precision-recall curve for ﬁelds shows that CFB Probabilistic combination of text classiﬁers using reliability and M AX E NT outperform G AMMA. indicators: models and results. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 207–214. the conﬁdence of a token by the difference between ACM Press. the probabilities of its ﬁrst and second most likely Michael Collins. 2000. Discriminative reranking for natu- labels, whereas CFB considers the full distribution of ral language parsing. In Proc. 17th International Conf. on all suboptimal paths. Scheffer et al. (2001) also explore Machine Learning, pages 175–182. Morgan Kaufmann, San an idea similar to CFB to perform Baum-Welch training Francisco, CA. with partially labeled data, where the provided labels Simona Gandrabur and George Foster. 2003. Conﬁdence esti- are constraints. However, these constraints are again for mation for text prediction. In Proceedings of the Conference singleton tokens only. on Natural Language Learning (CoNLL 2003), Edmonton, Rule-based extraction methods (Thompson et al., Canada. 1999) estimate conﬁdence based on a rule’s coverage in A. Gunawardana, H. Hon, and L. Jiang. 1998. Word-based the training data. Other areas where conﬁdence estima- acoustic conﬁdence measures for large-vocabulary speech tion is used include document classiﬁcation (Bennett et recognition. In Proc. ICSLP-98, pages 791–794, Sydney, al., 2002), where classiﬁers are built using meta-features Australia. of the document; speech recognition (Gunawardana et al., Trausti Kristjannson, Aron Culotta, Paul Viola, and Andrew 1998), where the conﬁdence of a recognized word is esti- McCallum. 2004. Interactive information extraction with mated by considering a list of commonly confused words; conditional random ﬁelds. To appear in Nineteenth National and machine translation (Gandrabur and Foster, 2003), Conference on Artiﬁcial Intelligence (AAAI 2004). where neural networks are used to learn the probability of John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. a correct word translation using text features and knowl- Conditional random ﬁelds: Probabilistic models for seg- edge of alternate translations. menting and labeling sequence data. In Proc. 18th Interna- tional Conf. on Machine Learning, pages 282–289. Morgan 8 Conclusion Kaufmann, San Francisco, CA. Andrew McCallum and David Jensen. 2003. A note on the We have shown that CFB is a mathematically and empir- uniﬁcation of information extraction and data mining using ically sound conﬁdence estimator for ﬁnite state informa- conditional-probability, relational models. In IJCAI03 Work- tion extraction systems, providing strong correlation with shop on Learning Statistical Models from Relational Data. correctness and obtaining an average precision of 97.6% Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. for estimating ﬁeld correctness. Unlike methods margin Active hidden markov models for information extraction. maximization methods such as SVMs and M3 Ns (Taskar In Advances in Intelligent Data Analysis, 4th International et al., 2003), CRFs are trained to maximize conditional Conference, IDA 2001. probability and are thus more naturally appropriate for Ben Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max- conﬁdence estimation. Interestingly, reranking by M AX - margin markov networks. In Proceedings of Neural Infor- E NT does not seem to improve performance, despite the mation Processing Systems Conference. beneﬁt Collins (2000) has shown discriminative rerank- ing to provide generative parsers. We hypothesize this is Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney. 1999. Active learning for natural language pars- because CRFs are already discriminative (not joint, gen- ing and information extraction. In Proc. 16th International erative) models; furthermore, this may suggest that future Conf. on Machine Learning, pages 406–414. Morgan Kauf- discriminative parsing methods will also have the beneﬁts mann, San Francisco, CA. of discriminative reranking built-in directly.