Abstract
In this article we present a semi-supervised algorithm for pattern discovery in information extraction from textual data. The patterns that are discovered take the form of regular expressions that generate regular languages. We term our approach ‘semi-supervised’ because it requires significantly less effort to develop a training set than other approaches. From the training data our algorithm automatically generates regular expressions that can be used on previously unseen data for information extraction. Our experiments show that the algorithm has good testing performance on many features that are important in the fight against terrorism.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Soderland: Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3):233–272. (1999)
Eric Brill: Pattern-Based Disambiguation for Natural Language Processing. Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. (EMNLP/VLC-2000)
Christopher D. Manning and Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press. (2000)
Van Rijsbergen: Information Retrieval. Butterworths, London. (1979)
Michael Chau, Jennifer J. Xu, Hsinchun Chen: Extracting Meaningful Entities from Police Narrative Reports. Proceedings of the National Conference for Digital Government Research. Los Angeles, California. (2002)
Tianhao Wu, and William M. Pottenger: An extended version of “A Semi-supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data”. Lehigh University Computer Science and Engineering Technical Report LU-CSE-03-001. (2003)
Hopcroft, J. and J. Ullman: Introduction to Automata Theory, Languages and Computation. Addison-Wesley. (1979)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wu, T., Pottenger, W.M. (2003). A Semi-supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_12
Download citation
DOI: https://doi.org/10.1007/3-540-36175-8_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive