Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences

Mukherjee, Saikat; Ramakrishnan, I. V.

doi:10.1007/978-3-540-30469-2_6

Saikat Mukherjee¹⁸ &
I. V. Ramakrishnan¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3291))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

578 Accesses

Abstract

Numerous data sources such as classified ads in online newspapers, electronic product catalogs and postal addresses are rife with unstructured text content. Typically such content is characterized by attribute value sequences having a common schema. In addition each sequence is unstructured free text without any separators between the attribute values. Hidden Markov Models (HMMs) have been used for creating structured content from such text sequences by identifying and extracting attribute values occurring in them. Extant approaches to creating “structured content from text sequences” based on HMMs use either completely labeled or completely unlabeled training data. The HMMs resulting from these two dominant approaches present contrasting trade offs w.r.t. labeling effort and recall/precision of the extracted attribute values. In this paper we propose a HMM based algorithm that uses partially labeled training data for creating structured content from text sequences. By exploiting the observation that partially labeled sequences give rise to independent subsequences we compose the HMMs corresponding to these subsequences to create structured content from the complete sequence. An interesting aspect of our approach is that it gives rise to a family of HMMs spanning the trade off spectrum. We present experimental evidence of the effectiveness of our algorithm on real-life data sets and demonstrate that it is indeed possible to bootstrap structured content creation from schematic text data sources using HMMs that require limited labeling effort and do so without compromising on the recall/precision performance metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Sequence Identification with Trees and Co-Occurrence Graphs

Automatic Generation of Semi-structured Documents

A new algorithm to train hidden Markov models for biological sequences with partial labels

Article Open access 26 March 2021

References

Adelberg, B.: Nodose: A tool for semi-automatically extracting structured and semi-structured data from text documents. In: ACM Conf. on Management of Data, SIGMOD (1998)
Google Scholar
Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: ACM Conf. on Knowledge Discovery and Data Mining, SIGKDD (2004)
Google Scholar
Baum, L.: An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process. Inequalities 3, 1–8 (1972)
Google Scholar
Bikel, Schwartz, Weischedel: An algorithm that learns what’s in a name. Machine Learning 34(1), 211–231 (1999)
Article MATH Google Scholar
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: ACM Conf. on Management of Data, SIGMOD (2001)
Google Scholar
Brants, T.: Cascaded markov models. In: European Chapter of the Association for Computational Linguistics, EACL (1999)
Google Scholar
Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: National Conf. on Artificial Intelligence, AAAI (1999)
Google Scholar
Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Intl. Colloqium on Grammatical Inference and Applications, ICGI (1994)
Google Scholar
Cohen, W.: Fast effective rule induction. In: Intl. Conf. on Machine Learning, ICML (1995)
Google Scholar
Cohen, W., Schapire, R., Singer, Y.: Learning to order things. Journal of Artificial Intelligence Research 10, 243–270 (1999)
MATH MathSciNet Google Scholar
Cohen, W., Singer, Y.: Simple, fast, and effective rule learner. In: National Conf. on Artificial Intelligence, AAAI (1999)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(B), 1–38 (1977)
MATH MathSciNet Google Scholar
Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden markov model: Analysis and applications. Machine Learning 32, 41–62 (1998)
Article MATH Google Scholar
Freitag, D., McCallum, A.: Information extrcation using hmms and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)
Google Scholar
Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: National Conf. on Artificial Intelligence, AAAI (2000)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Intl. Conf. on Machine Learning, ICML (2001)
Google Scholar
Leek, T.: Information extraction using hidden markov models. In: Master’s thesis UC San Diego (1997)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy markov models for information extraction and segmentation. In: Intl. Conf. on Machine Learning, ICML (2000)
Google Scholar
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of IEEE 77(2) (1989)
Google Scholar
Scheffer, T., Wrobel, S.: Active learning of partially hidden markov models. In: ECML/PKDD Workshop on Instance Selection (2001)
Google Scholar
Skounakis, M., Craven, M., Ray, S.: Hierarchical hidden markov models for information extraction. In: Intl. Joint Conf. on Artificial Intelligence, IJCAI (2003)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3) (1999)
Google Scholar
Stolcke, A., Omohundro, S.: Hidden markov model induction by bayesian model merging. In: Advances in Neural Information Processing Systems, NIPS (1992)
Google Scholar
Viterbi, A.: Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, 260–267 (1967)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, U.S.A.
Saikat Mukherjee & I. V. Ramakrishnan

Authors

Saikat Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
I. V. Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Vrije Universiteit Brussel (VUB), Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, VIC 3001, Melbourne, Australia
Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukherjee, S., Ramakrishnan, I.V. (2004). Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30469-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-30469-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23662-7
Online ISBN: 978-3-540-30469-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics