Acquisition of Domain Knowledge

Yangarber, Roman

doi:10.1007/978-3-540-45092-4_1

Roman Yangarber²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2700))

Included in the following conference series:

International Summer School on Information Extraction

440 Accesses
4 Citations

Abstract

Linguistic knowledge in Natural Language understanding systems is commonly stratified across several levels. This is true of Information Extraction as well. Typical state-of-the-art Information Extraction systems require syntactic-semantic patterns for locating facts or events in text; domain-specific word or concept classes for semantic generalization; and a specialized lexicon of terms that may not be found in general-purpose dictionaries, among other kinds of knowledge.

We describe an approach to unsupervised, or minimally supervised, knowledge acquisition. The approach is based on bootstrapping a comprehensive knowledge base from a small set of seed elements. Our approach is embodied in algorithms for discovery of patterns, concept classes, and lexicon, from raw un-annotated text.

We present the results of knowledge acquisition, and examine them in the context of prior work. We discuss problems in evaluating the quality of the acquired knowledge, and methodologies for evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Unsupervised Approach: Grammar Induction

Knowledge Harvesting: Achievements and Challenges

Rule Induction and Reasoning over Knowledge Graphs

References

Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proc. 5th ACM Intl. Conf. Digital Libraries, DL 2000 (2000)
Google Scholar
Appelt, D., Hobbs, J., Bear, J., Israel, D., Kameyama, M., Tyson, M.: SRI: Description of the JV-FASTUS System used for MUC-5. In: Proc. 5th Message Understanding Conf (MUC-5), Baltimore, MD. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Bikel, D., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a highperformance learning name-finder. In: Proc. 5th Applied Natural Language Processing Conf., Washington, DC (1997)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with cotraining. In: Proc. 11th Annl. Conf. Computational Learning Theory (COLT 1998), New York. ACM Press, New York (1998)
Google Scholar
Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proc. 6th Workshop on Very Large Corpora, Montreal, Canada (1998)
Google Scholar
Brin, S.: Extracting patterns and relations from the world wide web. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377. Springer, Heidelberg (1998)
Google Scholar
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Menlo Park, CA. AAAI Press, Menlo Park (1998)
Google Scholar
Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX (1998)
Google Scholar
Cardie, C., Pierce, D.: Proposal for an interactive environment for information extraction. Technical Report TR98-1702, Cornell University (1998)
Google Scholar
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proc. 17th Intl. Joint Conf. on AI (IJCAI 2001), Seattle, WA (2001)
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proc. Joint SIGDAT Conf. on EMNLP/VLC, College Park, MD (1999)
Google Scholar
Dagan, I., Marcus, S., Markovitch, S.: Contextual word similarity and estimation from sparse data. In: Proceedings of the 31st Annual Meeting of the Assn. for Computational Linguistics, Columbus, OH, pp. 31–37 (1993)
Google Scholar
Fisher, D., Soderland, S., McCarthy, J., Feng, F., Lehnert, W.: Description of the UMass system as used for MUC-6. In: Proc. 6th Message Understanding Conf. (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multiword terms: the C-value/NC-value method. Intl. Journal on Digital Libraries (3), 115–130 (2000)
Article Google Scholar
Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: Proceedings of Workshop on Machine Learning and Information Extraction (AAAI 1999), Orlando, FL (1999)
Google Scholar
Grishman, R., Macleod, C., Meyers, A.: Comlex Syntax: Building a computational lexicon. In: Proc. 15th Int’l Conf. Computational Linguistics (COLING 1994), Kyoto, Japan (1994)
Google Scholar
Grishman, R., Huttunen, S., Yangarber, R.: Event extraction for infectious disease outbreaks. In: Proc. 2nd Human Language Technology Conf. (HLT 2002), San Diego, CA (2002)
Google Scholar
Grishman, R.: The NYU system for MUC-6, or where’s the syntax? In: Proc. 6th Message Understanding Conf. (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Grishman, R.: Information extraction: Techniques and challenges. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS (LNAI), vol. 1299, Springer, Heidelberg (1997)
Google Scholar
Hirschman, L., Grishman, R., Sager, N.: Grammatically-based automatic word class formation. Information Processing and Management 11(1/2), 39–57 (1975)
Article Google Scholar
Justeson, J.S., Katz, S.M.: Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1), 9–27 (1995)
Article Google Scholar
Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E., Soderland, S.: University of Massachusetts: MUC-4 test results and analysis. In: Proc. Fourth Message Understanding Conf., McLean, VA. Morgan Kaufmann, San Francisco (1992)
Google Scholar
Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R., The Annotation Group: Algorithms that learn to extract information; BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA (1998)
Google Scholar
Miller, G.A.: Wordnet a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Article Google Scholar
Mitchell, T.: The role of unlabeled data in supervised learning. In: Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain (1999)
Google Scholar
Proceedings of the 6th Message Understanding Conference (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, VA (1998), www.itl.nist.gov/iaui/894.02/related_projects/muc/
Nichols, J.: Secondary predicates. In: Proceedings of the 4th Annual Meeting of Berkeley Linguistics Society (1978)
Article Google Scholar
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of ACL 1993, Columbus, OH (1993)
Google Scholar
Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proc. 16th Natl. Conf. on AI (AAAI 1999), Orlando, FL (1999)
Google Scholar
Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proc. 11th Annl. Conf. Artificial Intelligence. The AAAI Press/MIT Press (1993)
Google Scholar
Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proc. 13th Natl. Conf. on AI (AAAI 1996). The AAAI Press/MIT Press (1996)
Google Scholar
Sasaki, Y.: Applying type-oriented ILP to IE rule generation. In: Proc. Workshop on Machine Learning and Information Extraction (AAAI 1999), Orlando, FL (1999)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 44(1-3), 233–272 (1999)
Article Google Scholar
Strzalkowski, T., Wang, J.: A self-learning universal concept spotter. In: Proc. 16th Intl. Conf. Computational Linguistics (COLING 1996), Copenhagen, Denmark (1996)
Google Scholar
Tapanainen, P., Järvinen, T.: A non-projective dependency parser. In: Proc. 5th Conf. Applied Natural Language Processing, Washington, D.C., ACL (1997)
Google Scholar
Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proc. 16th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Wakao, T., Gaizauskas, R., Wilks, Y.: Evaluation of an algorithm for the recognition and classification of proper names. In: Proc. 16th Int’l Conf. on Computational Linguistics (COLING 1996), Copenhagen, Denmark (1996)
Google Scholar
Yangarber, R., Grishman, R.: Customization of information extraction systems. In: Velardi, P. (ed.) International Workshop on Lexically Driven Information Extraction, Frascati, Italy, Università di Roma (1997)
Google Scholar
Yangarber, R., Grishman, R.: NYU: Description of the Proteus/PET system as used for MUC-7 ST. In: MUC-7: 7th Message Understanding Conf., Columbia, MD (1998)
Google Scholar
Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Unsupervised discovery of scenario-level patterns for information extraction. In: Proc. Conf. Applied Natural Language Processing (ANLP-NAACL 2000), Seattle, WA (2000a)
Google Scholar
Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Automatic acquisition of domain knowledge for information extraction. In: Proc. 18th Intl. Conf. Computational Linguistics (COLING 2000), Saarbrücken, Germany (2000b)
Google Scholar
Yangarber, R., Lin, W., Grishman, R.: Unsupervised learning of generalized names. In: Proc. 19th Intl. Conf. Computational Linguistics (COLING 2002), Taiwan (2002)
Google Scholar
Yangarber, R.: Scenario Customization for Information Extraction. Ph.D. thesis, New York University, New York, NY (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Courant Institute of Mathematical Sciences, New York University, New York, USA
Roman Yangarber

Authors

Roman Yangarber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DISP, University of Tor Vergata, Via del Politecnico 1, Rome, Italy
Maria Teresa Pazienza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yangarber, R. (2003). Acquisition of Domain Knowledge. In: Pazienza, M.T. (eds) Information Extraction in the Web Era. SCIE 2002. Lecture Notes in Computer Science(), vol 2700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45092-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-45092-4_1
Published: 28 August 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40579-5
Online ISBN: 978-3-540-45092-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics