VIEWS: 6 PAGES: 4 POSTED ON: 5/26/2011
Conﬁdence Estimation for Information Extraction Aron Culotta Andrew McCallum Department of Computer Science Department of Computer Science University of Massachusetts University of Massachusetts Amherst, MA 01002 Amherst, MA 01002 culotta@cs.umass.edu mccallum@cs.umass.edu Abstract then automatically propagated in order to correct other mistakes in the same record. Directing the user to Information extraction techniques automati- the least conﬁdent ﬁeld allows the system to improve cally create structured databases from un- its performance with a minimal amount of user effort. structured data sources, such as the Web or Kristjannson et al. (2004) show that using accurate con- newswire documents. Despite the successes of ﬁdence estimation reduces error rates by 46%. these systems, accuracy will always be imper- Third, conﬁdence estimates can improve performance fect. For many reasons, it is highly desirable to of data mining algorithms that depend upon databases accurately estimate the conﬁdence the system created by information extraction systems (McCallum has in the correctness of each extracted ﬁeld. and Jensen, 2003). Conﬁdence estimates provide data The information extraction system we evalu- mining applications with a richer set of “bottom-up” hy- ate is based on a linear-chain conditional ran- potheses, resulting in more accurate inferences. An ex- dom ﬁeld (CRF), a probabilistic model which ample of this occurs in the task of citation co-reference has performed well on information extraction resolution. An information extraction system labels each tasks because of its ability to capture arbitrary, ﬁeld of a paper citation (e.g. AUTHOR , T ITLE), and then overlapping features of the input in a Markov co-reference resolution merges disparate references to the model. We implement several techniques to es- same paper. Attaching a conﬁdence value to each ﬁeld timate the conﬁdence of both extracted ﬁelds allows the system to examine alternate labelings for less and entire multi-ﬁeld records, obtaining an av- conﬁdent ﬁelds to improve performance. erage precision of 98% for retrieving correct Sound probabilistic extraction models are most con- ﬁelds and 87% for multi-ﬁeld records. ducive to accurate conﬁdence estimation because of their intelligent handling of uncertainty information. In this 1 Introduction work we use conditional random ﬁelds (Lafferty et al., 2001), a type of undirected graphical model, to automat- Information extraction usually consists of tagging a se- ically label ﬁelds of contact records. Here, a record is an quence of words (e.g. a Web document) with semantic entire block of a person’s contact information, and a ﬁeld labels (e.g. P ERSON NAME , P HONE N UMBER) and de- is one element of that record (e.g. C OMPANY NAME). We positing these extracted ﬁelds into a database. Because implement several techniques to estimate both ﬁeld con- automated information extraction will never be perfectly ﬁdence and record conﬁdence, obtaining an average pre- accurate, it is helpful to have an effective measure of cision of 98% for ﬁelds and 87% for records. the conﬁdence that the proposed database entries are cor- rect. There are at least three important applications of 2 Conditional Random Fields accurate conﬁdence estimation. First, accuracy-coverage trade-offs are a common way to improve data integrity in Conditional random ﬁelds (Lafferty et al., 2001) are undi- databases. Efﬁciently making these trade-offs requires an rected graphical models to calculate the conditional prob- accurate prediction of correctness. ability of values on designated output nodes given val- Second, conﬁdence estimates are essential for inter- ues on designated input nodes. In the special case in active information extraction, in which users may cor- which the designated output nodes are linked by edges in rect incorrectly extracted ﬁelds. These corrections are a linear chain, CRFs make a ﬁrst-order Markov indepen- dence assumption among output nodes, and thus corre- terminating in p∗ = argmax[δT (si )]. We can backtrack i spond to ﬁnite state machines (FSMs). In this case CRFs through the dynamic programming table to recover s∗ . can be roughly understood as conditionally-trained hid- den Markov models, with additional ﬂexibility to effec- The Forward-Backward algorithm can be viewed as a tively take advantage of complex overlapping features. generalization of the Viterbi algorithm: instead of choos- Let o = o1 , o2 , ...oT be some observed input data se- ing the optimal state sequence, Forward-Backward eval- quence, such as a sequence of words in a document (the uates all possible state sequences given the observation values on T input nodes of the graphical model). Let S be sequence. The “forward values” αt+1 (si ) are recursively a set of FSM states, each of which is associated with a la- deﬁned similarly as in Eq. 2, except the max is replaced bel (such as C OMPANY NAME). Let s = s1 , s2 , ...sT be by a summation. Thus we have some sequence of states (the values on T output nodes). CRFs deﬁne the conditional probability of a state se- αt+1 (si ) = αt (s ) exp λk fk (s , si , o, t) . quence given an input sequence as s k (3) T 1 terminating in Zo = i αT (si ) from Eq. 1. p Λ (s|o) = exp λk fk (st−1 , st , o, t) , Zo t=1 To estimate the probability that a ﬁeld is extracted k (1) correctly, we constrain the Forward-Backward algorithm where Zo is a normalization factor over all state se- such that each path conforms to some subpath of con- quences, fk (st−1 , st , o, t) is an arbitrary feature func- straints C = st , st+1 . . . . Here, st ∈ C can be either a tion over its arguments, and λk is a learned weight for positive constraint (the sequence must pass through st ) or each feature function. Zo is efﬁciently calculated using a negative constraint (the sequence must not pass through dynamic programming. Inference (very much like the st ). Viterbi algorithm in this case) is also a matter of dynamic In the context of information extraction, C corresponds programming. Maximum aposteriori training of these to an extracted ﬁeld. The positive constraints specify the models is efﬁciently performed by hill-climbing methods observation tokens labeled inside the ﬁeld, and the neg- such as conjugate gradient, or its improved second-order ative constraints specify the ﬁeld boundary. For exam- cousin, limited-memory BFGS. ple, if we use states names B-T ITLE and I-J OB T ITLE to label tokens that begin and continue a J OB T ITLE ﬁeld, 3 Field Conﬁdence Estimation and the system labels observation sequence o2 , . . . , o5 The Viterbi algorithm ﬁnds the most likely state sequence as a J OB T ITLE ﬁeld, then C = s2 = B-J OB T ITLE, matching the observed word sequence. The word that s3 = . . . = s5 = I-J OB T ITLE, s6 = I-J OB T ITLE . Viterbi matches with a particular FSM state is extracted The calculations of the forward values can be made to as belonging to the corresponding database ﬁeld. We can conform to C by the recursion αq (si ) = obtain a numeric score for an entire sequence, and then turn this into a probability for the entire sequence by nor- s αt+1 (s ) exp k λk fk (s , si , o, t) if si st+1 malizing. However, to estimate the conﬁdence of an indi- vidual ﬁeld, we desire the probability of a subsequence, 0 otherwise marginalizing out the state selection for all other parts of the sequence. A specialization of Forward-Backward, for all st+1 ∈ C, where the operator si st+1 means termed Constrained Forward-Backward (CFB), returns si conforms to constraint st+1 . For time steps not con- exactly this probability. strained by C, Eq. 3 is used instead. Because CRFs are conditional models, Viterbi ﬁnds If αt+1 (si ) is the constrained forward value, then the most likely state sequence given an observation se- Zo = i αT (si ) is the value of the constrained lat- quence, deﬁned as s∗ = argmaxs p Λ (s|o). To avoid an tice, the set of all paths that conform to C. Our conﬁ- exponential-time search over all possible settings of s, dence estimate is obtained by normalizing Zo using Zo , Viterbi stores the probability of the most likely path at i.e. Zo /Zo . time t that accounts for the ﬁrst t observations and ends in state si . Following traditional notation, we deﬁne this We also implement an alternative method that uses the probability to be δt (si ), where δ0 (si ) is the probability of state probability distributions for each state in the ex- starting in each state si , and the recursive formula is: tracted ﬁeld. Let γt (si ) = p(si |o1 , . . . , oT ) be the prob- ability of being in state i at time t given the observation δt+1 (si ) = max δt (s ) exp λk fk (s , si , o, t) sequence . We deﬁne the conﬁdence measure G AMMA s v k to be i=u γi (si ), where u and v are the start and end (2) indices of the extracted ﬁeld. 4 Record Conﬁdence Estimation Pearson’s r Avg. Prec CFB .573 .976 We can similarly use CFB to estimate the probability that MaxEnt .571 .976 an entire record is labeled correctly. The procedure is Gamma .418 .912 Random .012 .858 the same as in the previous section, except that C now WorstCase – .672 speciﬁes the labels for all ﬁelds in the record. We also implement three alternative record conﬁdence Table 1: Evaluation of conﬁdence estimates for ﬁeld conﬁ- estimates. F IELD P RODUCT calculates the conﬁdence of dence. CFB and M AX E NT outperform competing methods. each ﬁeld in the record using CFB, then multiplies these values together to obtain the record conﬁdence. F IELD - Pearson’s r Avg. Prec M IN instead uses the minimum ﬁeld conﬁdence as the CFB .626 .863 MaxEnt .630 .867 record conﬁdence. V ITERBI R ATIO uses the ratio of the FieldProduct .608 .858 probabilities of the top two Viterbi paths, capturing how FieldMin .588 .843 much more likely s∗ is than its closest alternative. ViterbiRatio .313 .842 Random .043 .526 5 Reranking with Maximum Entropy WorstCase – .304 We also trained two conditional maximum entropy clas- Table 2: Evaluation of conﬁdence estimates for record conﬁ- siﬁers to classify ﬁelds and records as being labeled cor- dence. CFB, M AX E NT again perform best. rectly or incorrectly. The resulting posterior probabil- ity of the “correct” label is used as the conﬁdence mea- sure. The approach is inspired by results from (Collins, To evaluate conﬁdence estimation, we use three meth- 2000), which show discriminative classiﬁers can improve ods. The ﬁrst is Pearson’s r, a correlation coefﬁcient the ranking of parses produced by a generative parser. ranging from -1 to 1 that measures the correlation be- After initial experimentation, the most informative in- tween a conﬁdence score and whether or not the ﬁeld puts for the ﬁeld conﬁdence classiﬁer were ﬁeld length, (or record) is correctly labeled. The second is average the predicted label of the ﬁeld, whether or not this ﬁeld precision, used in the Information Retrieval community has been extracted elsewhere in this record, and the CFB to evaluate ranked lists. It calculates the precision at conﬁdence estimate for this ﬁeld. For the record conﬁ- each point in the ranked list where a relevant document dence classiﬁer, we incorporated the following features: is found and then averages these values. Instead of rank- record length, whether or not two ﬁelds were tagged with ing documents by their relevance score, here we rank the same label, and the CFB conﬁdence estimate. ﬁelds (and records) by their conﬁdence score, where a correctly labeled ﬁeld is analogous to a relevant docu- 6 Experiments ment. W ORST C ASE is the average precision obtained by ranking all incorrect instances above all correct in- 2187 contact records (27,560 words) were collected from stances. Tables 1 and 2 show that CFB and M AX E NT are Web pages and email and 25 classes of data ﬁelds were statistically similar, and that both outperform competing hand-labeled.1 The features for the CRF consist of the methods. Note that W ORST C ASE achieves a high aver- token text, capitalization features, 24 regular expressions age precision simply because so many ﬁelds are correctly over the token text (e.g. C ONTAINS H YPHEN), and off- labeled. In all experiments, R ANDOM assigns conﬁdence sets of these features within a window of size 5. We also values chosen uniformly at random between 0 and 1. use 19 lexicons, including “US Last Names,” “US First The third measure is an accuracy-coverage graph. Bet- Names,” and “State Names.” Feature induction is not ter conﬁdence estimates push the curve to the upper-right used in these experiments. The CRF is trained on 60% of of the graph. Figure 1 shows that CFB and M AX E NT the data, and the remaining 40% is split evenly into de- dramatically outperform G AMMA. Although omitted for velopment and testing sets. The development set is used space, similar results are also achieved on a noun-phrase to train the maximum entropy classiﬁers, and the testing chunking task (CFB r = .516, G AMMA r = .432) and set is used to measure the accuracy of the conﬁdence es- a named-entity extraction task (CFB r = .508, G AMMA timates. The CRF achieves an overall token accuracy of r = .480). 87.32 on the testing data, with a ﬁeld-level performance of F1 = 84.11, precision = 85.43, and recall = 82.83. 7 Related Work 1 The 25 ﬁelds are: FirstName, MiddleName, LastName, While there has been previous work using probabilistic NickName, Sufﬁx, Title, JobTitle, CompanyName, Depart- ment, AddressLine, City1, City2, State, Country, PostalCode, estimates for token conﬁdence, and heuristic estimates HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mo- for ﬁeld conﬁdence, to the best of our knowledge this bile, Pager, VoiceMail, URL, Email, InstantMessage work is the ﬁrst to use a sound, probabilistic estimate for discriminative reranking by M AX E NT does not seem to 1 "Optimal" improve performance, despite the beneﬁt Collins (2000) "CFB" 0.98 "MaxEnt" "Gamma" has shown discriminative reranking to provide genera- "Random" tive parsers. We hypothesize this is because CRFs are 0.96 already discriminative (not generative) models; further- 0.94 more, this may suggest that future discriminative parsing techniques will also have the beneﬁts of discriminative accuracy 0.92 reranking built-in directly. 0.9 Acknowledgments 0.88 We thank the reviewers for helpful suggestions and refer- 0.86 ences. This work was supported in part by the Center for Intelligent Information Retrieval, by the Advanced Research 0.84 0 0.2 0.4 0.6 0.8 1 and Development Activity under contract number MDA904- coverage 01-C-0984, by The Central Intelligence Agency, the Na- tional Security Agency and National Science Foundation un- der NSF grant #IIS-0326249, and by the Defense Advanced Research Projects Agency, through the Department of the Inte- Figure 1: The precision-recall curve for ﬁelds shows that CFB rior, NBC, Acquisition Services Division, under contract num- and M AX E NT outperform G AMMA. ber NBCHD030010. ﬁeld and record conﬁdence in information extraction. References Much of the work in conﬁdence estimation Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. 2002. Probabilistic com- bination of text classiﬁers using reliability indicators: models and results. In for IE has been in the active learning literature. Proceedings of the 25th annual international ACM SIGIR conference on Re- Scheffer et al. (2001) derive conﬁdence estimates using search and development in information retrieval, pages 207–214. ACM Press. hidden Markov models in an information extraction Michael Collins. 2000. Discriminative reranking for natural language parsing. In system. However, they do not estimate the conﬁdence Proc. 17th International Conf. on Machine Learning, pages 175–182. Morgan Kaufmann, San Francisco, CA. of entire ﬁelds, only singleton tokens. They estimate the conﬁdence of a token by the difference between Simona Gandrabur and George Foster. 2003. Conﬁdence estimation for text pre- diction. In Proceedings of the Conference on Natural Language Learning the probabilities of its ﬁrst and second most likely (CoNLL 2003), Edmonton, Canada. labels, whereas CFB considers the full distribution of A. Gunawardana, H. Hon, and L. Jiang. 1998. Word-based acoustic conﬁdence all suboptimal paths. Scheffer et al. (2001) also explore measures for large-vocabulary speech recognition. In Proc. ICSLP-98, pages an idea similar to CFB to perform Baum-Welch training 791–794, Sydney, Australia. with partially labeled data, where the provided labels Trausti Kristjannson, Aron Culotta, Paul Viola, and Andrew McCallum. 2004. are constraints. However, these constraints are again for Interactive information extraction with conditional random ﬁelds. To appear in Nineteenth National Conference on Artiﬁcial Intelligence (AAAI 2004). singleton tokens only. Rule-based extraction methods (Thompson et al., John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional ran- dom ﬁelds: Probabilistic models for segmenting and labeling sequence data. 1999) estimate conﬁdence based on a rule’s coverage in In Proc. 18th International Conf. on Machine Learning, pages 282–289. Mor- the training data. Other areas where conﬁdence estima- gan Kaufmann, San Francisco, CA. tion is used include document classiﬁcation (Bennett et Andrew McCallum and David Jensen. 2003. A note on the uniﬁcation of in- al., 2002), where classiﬁers are built using meta-features formation extraction and data mining using conditional-probability, relational models. In IJCAI03 Workshop on Learning Statistical Models from Relational of the document; speech recognition (Gunawardana et al., Data. 1998), where the conﬁdence of a recognized word is esti- Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hidden mated by considering a list of commonly confused words; markov models for information extraction. In Advances in Intelligent Data and machine translation (Gandrabur and Foster, 2003), Analysis, 4th International Conference, IDA 2001. where neural networks are used to learn the probability of Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney. 1999. Ac- a correct word translation using text features and knowl- tive learning for natural language parsing and information extraction. In Proc. 16th International Conf. on Machine Learning, pages 406–414. Morgan Kauf- edge of alternate translations. mann, San Francisco, CA. 8 Conclusion We have shown that CFB is a mathematically and em- pirically sound conﬁdence estimator for ﬁnite state in- formation extraction systems, providing strong correla- tion with correctness and obtaining an average precision of 97.6% for estimating ﬁeld correctness. Interestingly,