Abstract
Numerous data sources such as classified ads in online newspapers, electronic product catalogs and postal addresses are rife with unstructured text content. Typically such content is characterized by attribute value sequences having a common schema. In addition each sequence is unstructured free text without any separators between the attribute values. Hidden Markov Models (HMMs) have been used for creating structured content from such text sequences by identifying and extracting attribute values occurring in them. Extant approaches to creating “structured content from text sequences” based on HMMs use either completely labeled or completely unlabeled training data. The HMMs resulting from these two dominant approaches present contrasting trade offs w.r.t. labeling effort and recall/precision of the extracted attribute values. In this paper we propose a HMM based algorithm that uses partially labeled training data for creating structured content from text sequences. By exploiting the observation that partially labeled sequences give rise to independent subsequences we compose the HMMs corresponding to these subsequences to create structured content from the complete sequence. An interesting aspect of our approach is that it gives rise to a family of HMMs spanning the trade off spectrum. We present experimental evidence of the effectiveness of our algorithm on real-life data sets and demonstrate that it is indeed possible to bootstrap structured content creation from schematic text data sources using HMMs that require limited labeling effort and do so without compromising on the recall/precision performance metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adelberg, B.: Nodose: A tool for semi-automatically extracting structured and semi-structured data from text documents. In: ACM Conf. on Management of Data, SIGMOD (1998)
Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: ACM Conf. on Knowledge Discovery and Data Mining, SIGKDD (2004)
Baum, L.: An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process. Inequalities 3, 1–8 (1972)
Bikel, Schwartz, Weischedel: An algorithm that learns what’s in a name. Machine Learning 34(1), 211–231 (1999)
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: ACM Conf. on Management of Data, SIGMOD (2001)
Brants, T.: Cascaded markov models. In: European Chapter of the Association for Computational Linguistics, EACL (1999)
Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: National Conf. on Artificial Intelligence, AAAI (1999)
Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Intl. Colloqium on Grammatical Inference and Applications, ICGI (1994)
Cohen, W.: Fast effective rule induction. In: Intl. Conf. on Machine Learning, ICML (1995)
Cohen, W., Schapire, R., Singer, Y.: Learning to order things. Journal of Artificial Intelligence Research 10, 243–270 (1999)
Cohen, W., Singer, Y.: Simple, fast, and effective rule learner. In: National Conf. on Artificial Intelligence, AAAI (1999)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(B), 1–38 (1977)
Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden markov model: Analysis and applications. Machine Learning 32, 41–62 (1998)
Freitag, D., McCallum, A.: Information extrcation using hmms and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)
Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: National Conf. on Artificial Intelligence, AAAI (2000)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Intl. Conf. on Machine Learning, ICML (2001)
Leek, T.: Information extraction using hidden markov models. In: Master’s thesis UC San Diego (1997)
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy markov models for information extraction and segmentation. In: Intl. Conf. on Machine Learning, ICML (2000)
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of IEEE 77(2) (1989)
Scheffer, T., Wrobel, S.: Active learning of partially hidden markov models. In: ECML/PKDD Workshop on Instance Selection (2001)
Skounakis, M., Craven, M., Ray, S.: Hierarchical hidden markov models for information extraction. In: Intl. Joint Conf. on Artificial Intelligence, IJCAI (2003)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3) (1999)
Stolcke, A., Omohundro, S.: Hidden markov model induction by bayesian model merging. In: Advances in Neural Information Processing Systems, NIPS (1992)
Viterbi, A.: Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, 260–267 (1967)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mukherjee, S., Ramakrishnan, I.V. (2004). Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30469-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-30469-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23662-7
Online ISBN: 978-3-540-30469-2
eBook Packages: Springer Book Archive