Abstract
Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying, easy administration and further automatic processing of the extracted data. The area of information extraction (IE) comprises techniques, algorithms and methods performing two important tasks: finding (identifying) the desired, relevant data and storing it in appropriate form for future use.
The rapidly increasing number and diversity of IE systems are the evidence of continuous activity and growing attention to this field. At the same time it is becoming more and more difficult to overview the scope of IE, to see advantages of certain approaches and differences to others. In this paper we identify and describe promising approaches to IE. Our focus is adaptive systems that can be customized for new domains through training or the use of external knowledge sources. Based on the observed origins and requirements of the examined IE techniques a classification of different types of adaptive IE systems is established.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aone, C., Halverson, L., Hampton, T., Ramos-Santacruz, M.: SRA: Description of the IE2 system used for MUC. In: Proceedings of the Seventh Message Understanding Conference (MUC-7) (1998)
Bagga, A., Chai, J.Y.: A trainable message understanding system. In: CoNLL, pp. 1–8 (1997)
Califf, M.E.: Relational Learning Techniques for Natural Language Extraction. PhD thesis, University of Texas at Austin (1998)
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Menlo Park, CA, pp. 6–11 (1998)
Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research 4, 177–210 (2003)
Cardie, C.: A case-based approach to knowledge acquisition for domain-specific sentence analysis. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 798–803. AAAI Press, Menlo Park (1993)
Chai, J.Y., Biermann, A.W.: The use of word sense disambiguation in an information extraction system. In: AAAI/IAAI (1999)
Chieu, H.L., Ng, H.T.: A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pp. 786–791 (2002)
Ciravegna, F.: (LP)2, an adaptive algorithm for information extraction from Web-related texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, USA (2001)
Ciravegna, F., Lavelli, A.: LearningPinocchio: Adaptive information extraction for real world applications. In: Proceedings of the 2nd Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND 2002), Frascati, Italy (2002)
Collier, R.: Automatic template creation for information extraction, an overview. Technical report, University of Sheffield (1996)
De Sitter, A., Daelemans, W.: Information extraction via double classification. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining, ATEM-2003 (2003)
Delisle, S., Barker, K., Delannoy, J.-F., Matwin, S., Szpakowicz, S.: From text to Horn clauses: Combining linguistic analysis and machine learning. In: 10th Canadian AI Conf. (1994)
Eikvil, L.: Information extraction from World Wide Web – A survey. Technical Report 945, Norwegian Computing Center (1999)
Embley, D.W., Campbell, D.M., Smith, R.D., Liddl, S.W.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Conference on Information and Knowledge Management (CIKM), pp. 52–59 (1998)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden Markov model: Analysis and applications. Machine Learning 32(1), 41–62 (1998)
Finn, A., Kushmerick, N.: Information extraction by convergent boundary classification. In: AAAI-2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)
Finn, A., Kushmerick, N.: Multi-level boundary classification for information extraction. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 111–122. Springer, Heidelberg (2004)
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University (1998)
Freitag, D.: Toward general-purpose learning for information extraction. In: Boitet, C., Whitelock, P. (eds.) Proc. 36th Annual Meeting of the Association for Computational Linguistics, San Francisco, CA, pp. 404–408 (1998)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)
Freitag, D., McCallum, A.K.: Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-1999 Workshop on Machine Learning for Information Extraction (1999)
Freitag, D., McCallum, A.K.: Information extraction with HMM structures learned by stochastic optimization. In: AAAI/IAAI, pp. 584–589 (2000)
Fürnkranz, J.: Separate-and-conquer rule learning. Artificial Intelligence Review 13(1), 3–54 (1999)
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM—semi-automatic creation of metadata. In: Gomez-Perez, A., Benjamins, V.R. (eds.) Proc. 13th International Conference on Knowledge Engineering and Management (2002)
Kauchak, D., Smarr, J., Elkan, C.: Sources of success for information extraction methods. Technical Report CS2002-0696, UC San Diego (2002)
Lafferty, J., McCallum, A.K., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML (2001)
Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: A critical survey of the methodology for IE evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)
Lavelli, A., Califf, M.-E., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: IE evaluation: Criticisms and recommendations. In: AAAI-2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2, 285–318 (1988)
McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003)
McCallum, A.K., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: ICML (2000)
McCallum, A.K., Jensen, D.: A note on the unification of information extraction and data mining using conditional-probability, relational models. In: IJCAI 2003 Workshop on Learning Statistical Models from Relational Data (2003)
Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R., and the Annotation Group.: Algorithms that learn to extract information—BBN: Description of the SIFT system as used for MUC. In: MUC-7 (1998)
Miller, S., Fox, H., Ramshaw, L., Weischedel, R.: A novel use of statistical parsing to extract information from text. In: ANLP-NAACL, pp. 226–233 (2000)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Muslea, I., Minton, S., Knoblock, C.A.: Active learning with strong and weak views: A case study on wrapper induction. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI 2003 (2003)
Nahm, U.Y., Mooney, R.J.: Using information extraction to aid the discovery of prediction rules from text. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA (2000)
Nobata, C., Sekine, S.: Towards automatic acquisition of patterns for information extraction. In: International Conference of Computer Processing of Oriental Languages (1999)
Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI (2003)
Quinlan, J.R., Cameron-Jones, R.M.: Induction of logic programs: FOIL and related systems. New Generation Computing 13(3,4), 287–312 (1995)
Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 1044–1049. The AAAI Press/MIT Press (1999)
Riloff, E., Schmelzenbach, M.: An empirical approach to conceptual case frame acquisition. In: Proceedings of the Sixth Workshop on Very Large Corpora. (1998)
RISE repository, http://www.isi.edu/info-agents/RISE/
Roth, D., Yih., W.-t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI (2001)
Roth, D., Yih, W.-t.: Probabilistic reasoning for entity & relation recognition. In: COLING 2002 (2002)
Scheffer, T., Decomain, C., Wrobel, S.: Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis (2001)
Scheffer, T., Wrobel, S., Popov, B., Ognianov, D., Decomain, C., Hoche, S.: Learning hidden Markov models for information extraction actively from partially labeled text. Künstliche Intelligenz (2) (2002)
Siefkes, C.: Incremental information extraction using tree-based context representations. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 510–521. Springer, Heidelberg (2005)
Skounakis, M., Craven, M., Ray, S.: Hierarchical hidden Markov models for information extraction. In: IJCAI (2003)
Soderland, S.: Learning Text Analysis Rules for Domain-specific Natural Language Processing. PhD thesis, University of Massachusetts, Amherst (1997)
Soderland, S.: Learning to extract text-based information from the World Wide Web. In: Proc. Third International Conference on Knowledge Discovery and Data Mining (KDD 1997), pp. 251–254 (1997)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1–3), 233–272 (1999)
Soderland, S.: Building a machine learning based text understanding system. In: Proc. IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001)
Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: CRYSTAL: Inducing a conceptual dictionary. In: Mellish, C. (ed.) Proc. 14th International Joint Conference on Artificial Intelligence, San Francisco, pp. 1314–1319 (1995)
Sudo, K., Sekine, S., Grishman, R.: Automatic pattern acquisition for Japanese information extraction. In: HLT 2001(2001)
Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proc. 16th International Conf. on Machine Learning, pp. 406–414 (1999)
Zavrel, J., Daelemans, W.: Feature-rich memory-based classification for shallow NLP and information extraction. In: Franke, J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining, Theoretical Aspects and Applications, pp. 33–54. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Siefkes, C., Siniakov, P. (2005). An Overview and Classification of Adaptive Approaches to Information Extraction. In: Spaccapietra, S. (eds) Journal on Data Semantics IV. Lecture Notes in Computer Science, vol 3730. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11603412_6
Download citation
DOI: https://doi.org/10.1007/11603412_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31001-3
Online ISBN: 978-3-540-31447-9
eBook Packages: Computer ScienceComputer Science (R0)