Abstract
Linguistic knowledge in Natural Language understanding systems is commonly stratified across several levels. This is true of Information Extraction as well. Typical state-of-the-art Information Extraction systems require syntactic-semantic patterns for locating facts or events in text; domain-specific word or concept classes for semantic generalization; and a specialized lexicon of terms that may not be found in general-purpose dictionaries, among other kinds of knowledge.
We describe an approach to unsupervised, or minimally supervised, knowledge acquisition. The approach is based on bootstrapping a comprehensive knowledge base from a small set of seed elements. Our approach is embodied in algorithms for discovery of patterns, concept classes, and lexicon, from raw un-annotated text.
We present the results of knowledge acquisition, and examine them in the context of prior work. We discuss problems in evaluating the quality of the acquired knowledge, and methodologies for evaluation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proc. 5th ACM Intl. Conf. Digital Libraries, DL 2000 (2000)
Appelt, D., Hobbs, J., Bear, J., Israel, D., Kameyama, M., Tyson, M.: SRI: Description of the JV-FASTUS System used for MUC-5. In: Proc. 5th Message Understanding Conf (MUC-5), Baltimore, MD. Morgan Kaufmann, San Francisco (1993)
Bikel, D., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a highperformance learning name-finder. In: Proc. 5th Applied Natural Language Processing Conf., Washington, DC (1997)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with cotraining. In: Proc. 11th Annl. Conf. Computational Learning Theory (COLT 1998), New York. ACM Press, New York (1998)
Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proc. 6th Workshop on Very Large Corpora, Montreal, Canada (1998)
Brin, S.: Extracting patterns and relations from the world wide web. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377. Springer, Heidelberg (1998)
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Menlo Park, CA. AAAI Press, Menlo Park (1998)
Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX (1998)
Cardie, C., Pierce, D.: Proposal for an interactive environment for information extraction. Technical Report TR98-1702, Cornell University (1998)
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proc. 17th Intl. Joint Conf. on AI (IJCAI 2001), Seattle, WA (2001)
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proc. Joint SIGDAT Conf. on EMNLP/VLC, College Park, MD (1999)
Dagan, I., Marcus, S., Markovitch, S.: Contextual word similarity and estimation from sparse data. In: Proceedings of the 31st Annual Meeting of the Assn. for Computational Linguistics, Columbus, OH, pp. 31–37 (1993)
Fisher, D., Soderland, S., McCarthy, J., Feng, F., Lehnert, W.: Description of the UMass system as used for MUC-6. In: Proc. 6th Message Understanding Conf. (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multiword terms: the C-value/NC-value method. Intl. Journal on Digital Libraries (3), 115–130 (2000)
Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: Proceedings of Workshop on Machine Learning and Information Extraction (AAAI 1999), Orlando, FL (1999)
Grishman, R., Macleod, C., Meyers, A.: Comlex Syntax: Building a computational lexicon. In: Proc. 15th Int’l Conf. Computational Linguistics (COLING 1994), Kyoto, Japan (1994)
Grishman, R., Huttunen, S., Yangarber, R.: Event extraction for infectious disease outbreaks. In: Proc. 2nd Human Language Technology Conf. (HLT 2002), San Diego, CA (2002)
Grishman, R.: The NYU system for MUC-6, or where’s the syntax? In: Proc. 6th Message Understanding Conf. (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)
Grishman, R.: Information extraction: Techniques and challenges. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS (LNAI), vol. 1299, Springer, Heidelberg (1997)
Hirschman, L., Grishman, R., Sager, N.: Grammatically-based automatic word class formation. Information Processing and Management 11(1/2), 39–57 (1975)
Justeson, J.S., Katz, S.M.: Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1), 9–27 (1995)
Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E., Soderland, S.: University of Massachusetts: MUC-4 test results and analysis. In: Proc. Fourth Message Understanding Conf., McLean, VA. Morgan Kaufmann, San Francisco (1992)
Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R., The Annotation Group: Algorithms that learn to extract information; BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA (1998)
Miller, G.A.: Wordnet a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Mitchell, T.: The role of unlabeled data in supervised learning. In: Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain (1999)
Proceedings of the 6th Message Understanding Conference (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)
Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, VA (1998), www.itl.nist.gov/iaui/894.02/related_projects/muc/
Nichols, J.: Secondary predicates. In: Proceedings of the 4th Annual Meeting of Berkeley Linguistics Society (1978)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of ACL 1993, Columbus, OH (1993)
Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proc. 16th Natl. Conf. on AI (AAAI 1999), Orlando, FL (1999)
Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proc. 11th Annl. Conf. Artificial Intelligence. The AAAI Press/MIT Press (1993)
Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proc. 13th Natl. Conf. on AI (AAAI 1996). The AAAI Press/MIT Press (1996)
Sasaki, Y.: Applying type-oriented ILP to IE rule generation. In: Proc. Workshop on Machine Learning and Information Extraction (AAAI 1999), Orlando, FL (1999)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 44(1-3), 233–272 (1999)
Strzalkowski, T., Wang, J.: A self-learning universal concept spotter. In: Proc. 16th Intl. Conf. Computational Linguistics (COLING 1996), Copenhagen, Denmark (1996)
Tapanainen, P., Järvinen, T.: A non-projective dependency parser. In: Proc. 5th Conf. Applied Natural Language Processing, Washington, D.C., ACL (1997)
Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proc. 16th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco (1999)
Wakao, T., Gaizauskas, R., Wilks, Y.: Evaluation of an algorithm for the recognition and classification of proper names. In: Proc. 16th Int’l Conf. on Computational Linguistics (COLING 1996), Copenhagen, Denmark (1996)
Yangarber, R., Grishman, R.: Customization of information extraction systems. In: Velardi, P. (ed.) International Workshop on Lexically Driven Information Extraction, Frascati, Italy, Università di Roma (1997)
Yangarber, R., Grishman, R.: NYU: Description of the Proteus/PET system as used for MUC-7 ST. In: MUC-7: 7th Message Understanding Conf., Columbia, MD (1998)
Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Unsupervised discovery of scenario-level patterns for information extraction. In: Proc. Conf. Applied Natural Language Processing (ANLP-NAACL 2000), Seattle, WA (2000a)
Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Automatic acquisition of domain knowledge for information extraction. In: Proc. 18th Intl. Conf. Computational Linguistics (COLING 2000), Saarbrücken, Germany (2000b)
Yangarber, R., Lin, W., Grishman, R.: Unsupervised learning of generalized names. In: Proc. 19th Intl. Conf. Computational Linguistics (COLING 2002), Taiwan (2002)
Yangarber, R.: Scenario Customization for Information Extraction. Ph.D. thesis, New York University, New York, NY (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yangarber, R. (2003). Acquisition of Domain Knowledge. In: Pazienza, M.T. (eds) Information Extraction in the Web Era. SCIE 2002. Lecture Notes in Computer Science(), vol 2700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45092-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-540-45092-4_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40579-5
Online ISBN: 978-3-540-45092-4
eBook Packages: Springer Book Archive