A Semi-supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data

Wu, Tianhao; Pottenger, William M.

doi:10.1007/3-540-36175-8_12

Tianhao Wu⁵ &
William M. Pottenger⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1198 Accesses
3 Citations

Abstract

In this article we present a semi-supervised algorithm for pattern discovery in information extraction from textual data. The patterns that are discovered take the form of regular expressions that generate regular languages. We term our approach ‘semi-supervised’ because it requires significantly less effort to develop a training set than other approaches. From the training data our algorithm automatically generates regular expressions that can be used on previously unseen data for information extraction. Our experiments show that the algorithm has good testing performance on many features that are important in the fight against terrorism.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Interpretation of text patterns

Article 22 February 2018

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

References

S. Soderland: Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3):233–272. (1999)
Article MATH Google Scholar
Eric Brill: Pattern-Based Disambiguation for Natural Language Processing. Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. (EMNLP/VLC-2000)
Google Scholar
Christopher D. Manning and Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press. (2000)
Google Scholar
Van Rijsbergen: Information Retrieval. Butterworths, London. (1979)
Google Scholar
Michael Chau, Jennifer J. Xu, Hsinchun Chen: Extracting Meaningful Entities from Police Narrative Reports. Proceedings of the National Conference for Digital Government Research. Los Angeles, California. (2002)
Google Scholar
Tianhao Wu, and William M. Pottenger: An extended version of “A Semi-supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data”. Lehigh University Computer Science and Engineering Technical Report LU-CSE-03-001. (2003)
Google Scholar
Hopcroft, J. and J. Ullman: Introduction to Automata Theory, Languages and Computation. Addison-Wesley. (1979)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, Lehigh University, USA
Tianhao Wu & William M. Pottenger

Authors

Tianhao Wu
View author publications
You can also search for this author in PubMed Google Scholar
William M. Pottenger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang
Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim
Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, T., Pottenger, W.M. (2003). A Semi-supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_12

Download citation

DOI: https://doi.org/10.1007/3-540-36175-8_12
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics