Abstract
We examine an automated mechanism, which allows users to access this information in a structured manner by segmenting unformatted text records into structured elements, annotating these documents using XML tags and using specific query processing techniques. This research is the first step to make an automatic ontology generation system. Therefore, we focus on the explanation how we can automatically extract structure when seeded with a small number of training examples. We propose an approach based on Hidden Markov Models to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. We introduce two different HMM models for information extraction from different sources such as bibliography and Call for Papers documents as a training dataset. The proposed HMM learn to distinguish the fields, and then extract title, authors, conference / journal names, etc. from the text.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Faure, D., Poibeau, T.: First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX. In: The proceedings of the 14th European Conference on Artificial Intelligence, ECAI 2000, Berlin (2000)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)
Larocca Neto, J., Santos, A.D., Kaestner, C.A., Freitas, A.: Document clustering and text summarization. In: Proc. of 4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining (PADD 2000), pp. 41–55. The Practical Application Company, London (2000)
Mitra, M., Singhal, A., Buckley, C.: Automatic text summarization by paragraph extraction. In: Proceedings of the ACL 1997/EACL 1997 Workshop on Intelligent Scalable Text Summarization, Madrid (1997)
Rabiner, L.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1999)
Yaari, Y.: Segmentation of Expository Texts by Hierarchical Agglomerative Clustering. Technical Report, Bar-Ilan University Israel (1997)
Crespo, A., Jannink, J., Neuhold, E., Rys, M., Studer, R.: A survey of semi-automatic extraction and transformation, http://www-db.stanford.edu/crespo/publications/
Freitag, D., McCallum, A.: Information extraction using HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)
Liu, L., Pu, C., Han, W.: Xwrap - An xml-enabled wrapper construction system for web information sources. In: International Conference on Data Engineering, pp. 611–621 (2000)
Stanley, B., Andrew, M.: Machine learning of event segmentation for news on demand. Communications of the ACM 43(2) (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yeom, KW., Park, JH. (2005). An Approach of Information Extraction from Web Documents for Automatic Ontology Generation. In: Hao, Y., et al. Computational Intelligence and Security. CIS 2005. Lecture Notes in Computer Science(), vol 3801. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11596448_66
Download citation
DOI: https://doi.org/10.1007/11596448_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30818-8
Online ISBN: 978-3-540-31599-5
eBook Packages: Computer ScienceComputer Science (R0)