Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations

Lavelli, Alberto; Califf, Mary Elaine; Ciravegna, Fabio; Freitag, Dayne; Giuliano, Claudio; Kushmerick, Nicholas; Romano, Lorenza; Ireson, Neil

doi:10.1007/s10579-008-9079-3

Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations

Published: 05 December 2008

Volume 42, pages 361–393, (2008)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Alberto Lavelli¹,
Mary Elaine Califf²,
Fabio Ciravegna³,
Dayne Freitag⁴,
Claudio Giuliano¹,
Nicholas Kushmerick⁵,
Lorenza Romano¹ &
…
Neil Ireson³

534 Accesses
14 Citations
Explore all metrics

Abstract

We survey the evaluation methodology adopted in information extraction (IE), as defined in a few different efforts applying machine learning (ML) to IE. We identify a number of critical issues that hamper comparison of the results obtained by different researchers. Some of these issues are common to other NLP-related tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Some issues are specific to IE: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an IE task, these issues should be explicitly addressed, and a number of methodological characteristics should be clearly defined. To empirically verify the practical impact of the issues mentioned above, we perform a survey of the results of different algorithms when applied to a few standard datasets. The survey shows a serious lack of consensus on these issues, which makes it difficult to draw firm conclusions on a comparative evaluation of the algorithms. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. Widespread agreement on this proposal should lead to future IE comparative evaluations that are fair and reliable. To demonstrate the way the methodology is to be applied we have organized and run a comparative evaluation of ML-based IE systems (the Pascal Challenge on ML-based IE) where the principles described in this article are put into practice. In this article we describe the proposed methodology and its motivations. The Pascal evaluation is then described and its results presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information Extraction: Past, Present and Future

Statistical Relational Data Integration for Information Extraction

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Notes

The corpora for MUC-3 and MUC-4 are freely available in the MUC web site (http://www-nlpir.nist.gov/related\_projects/muc), while those of MUC-6 and MUC-7 can be purchased via the Linguistic Data Consortium (http://ldc.upenn.edu).
http://www.nist.gov/speech/tests/ace.
http://biocreative.sourceforge.net.
Note that the occurrences considered here are only those that can be interpreted without resorting to any kind of contextual reasoning. Hence, phenomena related to coreference resolution are not considered at all.
Although in Roth and Yih (2002) the results for Job Postings are also included. Moreover, Chieu and Ng (2002) report also results on Management Succession.
Note that here we are not taking into account the corpora made available during the MUC conferences which, because of the complexity of the IE tasks, have been not very often used in IE experiments after the MUC conferences. Hirschman (1998) provides an overview of such corpora and of the related IE tasks.
See footnote 14.
Downloadable from the RISE repository: http://www.isi.edu/info-agents/RISE/repository.html.
Califf (1998), Freitag and Kushmerick (2000), and Finn and Kushmerick (2004a, b) use exactly the same partitions as Freitag (1997).
What is written in their paper is not completely clear but they have confirmed to us that they have adopted the five run setup (personal communication).
Available from the RISE repository: http://www.isi.edu/info-agents/RISE/repository.html. The collection we refer to in the article is the following: http://www.isi.edu/info-agents/RISE/Jobs/SecondSetOfDocuments.tar.Z.
Available from ftp://ftp.cs.utexas.edu/pub/mooney/job-data/job600.tar.gz.
http://www.daviddlewis.com/resources/testcollections/reuters21578.
The “all slots” figures are obtained by aggregating the confusion matrices over all fields, rather than averaging results from field-specific confusion matrices. This approach is called “microaveraging” in the text classification literature.
PASCAL was a Network of Excellence on “Pattern Analysis, Statistical Modelling and Computational Learning” funded by the European Commission as part of FP6. In March 2008 the follow-up Network of Excellence PASCAL2 was started as part of FP7.
http://tcc.itc.it/research/textec/tools-resources/ties.html.
Weka is a collection of open source software implementing ML algorithms for data mining tasks, http://www.cs.waikato.ac.nz/ml/weka
Note that the swap of the outcomes is performed at the document level and not at the level of the single markup.

References

Califf, M. E. (1998). Relational learning techniques for natural language information extraction. Ph.D. thesis, University of Texas at Austin.
Califf, M., & Mooney, R. (2003). Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4, 177–210.
Article Google Scholar
Chieu, H. L., & Ng, H. T. (2002). Probabilistic reasoning for entity and relation recognition. In Proceedings of the 19th National Conference on Artificial Intelligence (AAAI 2002).
Chinchor, N., Hirschman, L., & Lewis, D. D. (1993). Evaluating message understanding systems: An analysis of the third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3), 409–449.
Google Scholar
Ciravegna, F. (2001a). Adaptive information extraction from text by rule induction and generalisation. In Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI-01). Seattle, WA.
Ciravegna, F. (2001b). (LP)², an adaptive algorithm for information extraction from web-related texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining. Seattle, WA.
Ciravegna, F., Dingli, A., Petrelli, D., & Wilks, Y. (2002). User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02).
Ciravegna, F., & Lavelli, A. (2004). LearningPinocchio: Adaptive information extraction for real world applications. Journal of Natural Language Engineering, 10(2), 145–165.
Article Google Scholar
Daelemans, W., & Hoste, V. (2002). Evaluation of machine learning methods for natural language processing tasks. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.
Daelemans, W., Hoste, V., Meulder, F. D., & Naudts, B. (2003). Combined optimization of feature selection and algorithm parameters in machine learning of language. In Proceedings of the 14th European Conference on Machine Learning (ECML 2003). Cavtat-Dubronik, Croatia.
De Sitter, A., & Daelemans, W. (2003). Information extraction via double classification. In Proceedings of the ECML/PKDD 2003 Workshop on Adaptive Text Extraction and Mining (ATEM 2003). Cavtat-Dubronik, Croatia.
Douthat, A. (1998). The Message Understanding Conference scoring software user’s manual. In Proceedings of the 7th Message Understanding Conference (MUC-7). http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_sw/muc_sw_manual.html.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Hall.
Google Scholar
Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005).
Finn, A., & Kushmerick, N. (2004a). Information extraction by convergent boundary classification. In Proceedings of the AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM 2004). San Jose, California.
Finn, A., & Kushmerick, N. (2004b). Multi-level boundary classification for information extraction. In Proceedings of the 15th European Conference on Machine Learning. Pisa, Italy.
Freitag, D. (1997). Using grammatical inference to improve precision in information extraction. In Proceedings of the ICML-97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition. Nashville, Tennessee.
Freitag, D. (1998). Machine learning for information extraction in informal domains. Ph.D. thesis, Carnegie Mellon University.
Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000). Austin, Texas.
Habert, B., Adda, G., Adda-Decker, M., de Mareuil, P. B., Ferrari, S., Ferret, O., Illouz, G., & Paroubek, P. (1998). Towards tokenization evaluation. In Proceedings of 1st International Conference on Language Resources and Evaluation (LREC-98). Granada, Spain.
Hirschman, L. (1998). The evolution of evaluation: Lessons from the Message Understanding Conferences. Computer Speech and Language, 12(4), 281–305.
Article Google Scholar
Hoste, V., Hendrickx, I., Daelemans, W., & van den Bosch, A. (2002). Parameter optimization for machine-learning of word sense disambiguation. Natural Language Engineering, 8(4), 311–325.
Article Google Scholar
Ireson, N., Ciravegna, F., Califf, M. E., Freitag, D., Kushmerick, N., & Lavelli, A. (2005). Evaluating machine learning for information extraction. In Proceedings of 22nd International Conference on Machine Learning (ICML 2005). Bonn, Germany.
Iria, J., & Ciravegna, F. (2006). A methodology and tool for representing language resources for information extraction. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy.
Kosala, R., & Blockeel, H. (2000). Instance-based wrapper induction. In Proceedings of the Tenth Belgian-Dutch Conference on Machine Learning (Benelearn 2000). Tilburg, The Netherlands, pp. 61–68.
Kushmerick, N. (2000). Wrapper induction: Efficency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.
Article Google Scholar
Li, Y., Bontcheva, K., & Cunningham, H. (2005a). SVM based learning system for information extraction. In J. Winkler, M. Niranjan, & N. Lawrence (Eds.), Deterministic and statistical methods in machine learning, Vol. 3635 of LNAI. (pp. 319–339). Springer Verlag.
Li, Y., Bontcheva, K., & Cunningham, H. (2005b). Using uneven margins SVM and perceptron for information extraction. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CONLL 2005).
Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999), Performance measures for information extraction. In Proceedings of the DARPA Broadcast News Workshop. http://www.nist.gov/speech/publications/darpa99/pdf/dir10.pdf.
Noreen, E. W. (1989). Computer Intensive Methods for Testing Hypotheses: An Introduction. New York: Wiley.
Google Scholar
Peshkin, L., & Pfeffer, A. (2003). Bayesian information extraction network. In Proceedings of 18th International Joint Conference on Artificial Intelligence (IJCAI 2003). Acapulco, Mexico.
RISE. (1998). A repository of online information sources used in information extraction tasks. [http://www.isi.edu/info-agents/RISE/index.html] Information Sciences Institute/USC.
Roth, D., & Yih, W. (2001). Relational learning via propositional algorithms: An information extraction case study. In Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI-01). Seattle, WA.
Roth, D., & Yih, W. (2002). Relational learning via propositional algorithms: An information extraction case study. Technical Report UIUCDCS-R-2002-2300, Department of Computer Science, University of Illinois at Urbana-Champaign.
Sigletos, G., Paliouros, G., Spyropoulos, C., & Hatzopoulos, M. (2005). Combining information extraction systems using voting and stacked generalization. Journal of Machine Learning Research, 6, 1751–1782.
Google Scholar
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), 233–272.
Article Google Scholar
Sutton, C., & McCallum, A. (2004). Collective segmentation and labeling of distant entities. In Proceedings of the ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields.

Download references

Acknowledgements

F. Ciravegna, C. Giuliano, N. Ireson, A. Lavelli and L. Romano were supported by the IST-Dot.Kom project (http://www.dot-kom.org), sponsored by the European Commission as part of the Framework V (grant IST-2001-34038). N. Kushmerick was supported by grant 101/F.01/C015 from Science Foundation Ireland and grant N00014-03-1-0274 from the US Office of Naval Research. We would like to thank Leon Peshkin for kindly providing us his own corrected version of the Seminar Announcement collection and Scott Wen-Tau Yih for his own tagged version of the Job Posting collection. We would also like to thank Hai Long Chieu, Leon Peshkin, and Scott Wen-Tau Yih for answering our questions concerning the settings of their experiments. We are also indebted to the anonymous reviewers of this article for their valuable comments.

Author information

Authors and Affiliations

FBK-irst, via Sommarive 18, 38100, Povo, TN, Italy
Alberto Lavelli, Claudio Giuliano & Lorenza Romano
Illinois State University, Normal, IL, USA
Mary Elaine Califf
University of Sheffield, Sheffield, UK
Fabio Ciravegna & Neil Ireson
Fair Isaac Corporation, San Diego, CA, USA
Dayne Freitag
Decho Corporation, Seattle, WA, USA
Nicholas Kushmerick

Authors

Alberto Lavelli
View author publications
You can also search for this author in PubMed Google Scholar
Mary Elaine Califf
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Ciravegna
View author publications
You can also search for this author in PubMed Google Scholar
Dayne Freitag
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Giuliano
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas Kushmerick
View author publications
You can also search for this author in PubMed Google Scholar
Lorenza Romano
View author publications
You can also search for this author in PubMed Google Scholar
Neil Ireson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alberto Lavelli.

Appendix

1.1 Statistical significance testing

The objective in many papers on IE is to show that some innovation leads to better performance than a reasonable baseline. Often this involves the comparison of two or more system variants, at least one of which constitutes the baseline, and one of which embodies the innovation. Typically, the preferred variant achieves the highest scores, if only by small margins, and often this is taken as sufficient evidence of general improvement, even though the test sets in many IE domains are relatively small.

Approximate randomization is a computer-intensive procedure for estimating the statistical significance of a score difference in cases where the predictions of two systems under comparison are aligned at the unit level (Noreen 1989). For example, Chinchor et al. (1993) used this procedure to assess the pairwise separation among participants of MUC3.

Table 5 presents pseudocode for the approximate randomization procedure. The procedure involves a large number (M) of passes through the test set. Each pass involves swapping the baseline and preferred outcomes on approximately half of the test documents, yielding two new “swapped” scores.^{Footnote 18} The fraction of passes for which this procedure widens the gap between systems is an estimate of the p value associated with the observed score difference. If this computed fraction is less than or equal to the desired confidence level (typically 0.05), we are justified in concluding that the observed difference in scores between baseline and preferred is significant.

Table 5 The approximate randomization procedure

Full size table

In many cases, a relevant baseline is difficult to establish or acquire for the purpose of a paired comparison. Often the most salient comparison is with numbers reported only in the literature. Confidence bounds are critical in such cases to ascertain the level of significance of a result. However, calculating confidence bounds on a score such as the F-measure is cumbersome and possibly dubious, since it is unclear what parametric assumptions to make. Fortunately, we can apply the bootstrap, another computer-intensive procedure, to model the distribution of possible F-measures and assess confidence bounds (Efron and Tibshirani, 1993).

Table 6 sketches this procedure. As in approximate randomization, we iterate a large number (M, typically at least 1000) of times. With each iteration, we calculate the statistic of interest (e.g., the F-measure) on a set of documents from the test set formed by sampling with replacement. The resulting score sample may then be used to assess confidence bounds. In an approach called the percentile bootstrap, these scores are binned by quantile. The upper and lower values of the confidence interval may then be read from this data. For example, the lower bound of the 90% confidence interval lies between the maximum score among the lowest 5% and the next score in an ordering from least to greatest. Obviously, in order for this computation to be valid, M must be sufficiently large. Additional caveats apply, and interested readers are referred to the Efron and Tibshirani introduction (1993).

Table 6 The bootstrap procedure

Full size table

1.2 Glossary

In the table below, we have listed the names/acronyms of the systems mentioned in the paper together with their full names and bibliographical references.

BIEN	Bayesian Information Extraction Network (Peshkin and Pfeffer 2003)
BWI	Boosted Wrapper Induction (Freitag and Kushmerick 2000)
CProb	Bayesian Prediction Combination (Freitag 1998)
Elie	Adaptive Information Extraction Algorithm (Finn and Kushmerick 2004a, b)
(LP)²	Adaptive Information Extraction Algorithm (Ciravegna 2001a)
ME₂	Maximum Entropy Classifier (Chieu and Ng 2002)
PAUM	Perceptron Algorithm with Uneven Margins (Li et al. 2005b)
RAPIER	Robust Automated Production of Information Extraction Rules (Califf 1998)
SNoW	Sparse Network of Winnows (Roth and Yih 2001, 2002)
SRV	Symbolic Relational Learner (Freitag 1998)
SVMUM	Support Vector Machine with Uneven Margins (Li et al. 2005a)
TIES	Trainable Information Extraction System
T-Rex	Trainable Relation Extraction (Iria and Ciravegna 2006)
WHISK	(Soderland 1999)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavelli, A., Califf, M.E., Ciravegna, F. et al. Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations. Lang Resources & Evaluation 42, 361–393 (2008). https://doi.org/10.1007/s10579-008-9079-3

Download citation

Received: 19 February 2008
Accepted: 11 November 2008
Published: 05 December 2008
Issue Date: December 2008
DOI: https://doi.org/10.1007/s10579-008-9079-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations

Abstract

Access this article

Similar content being viewed by others

Information Extraction: Past, Present and Future

Statistical Relational Data Integration for Information Extraction

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Statistical significance testing

1.2 Glossary

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations

Abstract

Access this article

Similar content being viewed by others

Information Extraction: Past, Present and Future

Statistical Relational Data Integration for Information Extraction

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Statistical significance testing

1.2 Glossary

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation