Skip to main content

Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences

  • Conference paper
On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE (OTM 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3291))

Abstract

Numerous data sources such as classified ads in online newspapers, electronic product catalogs and postal addresses are rife with unstructured text content. Typically such content is characterized by attribute value sequences having a common schema. In addition each sequence is unstructured free text without any separators between the attribute values. Hidden Markov Models (HMMs) have been used for creating structured content from such text sequences by identifying and extracting attribute values occurring in them. Extant approaches to creating “structured content from text sequences” based on HMMs use either completely labeled or completely unlabeled training data. The HMMs resulting from these two dominant approaches present contrasting trade offs w.r.t. labeling effort and recall/precision of the extracted attribute values. In this paper we propose a HMM based algorithm that uses partially labeled training data for creating structured content from text sequences. By exploiting the observation that partially labeled sequences give rise to independent subsequences we compose the HMMs corresponding to these subsequences to create structured content from the complete sequence. An interesting aspect of our approach is that it gives rise to a family of HMMs spanning the trade off spectrum. We present experimental evidence of the effectiveness of our algorithm on real-life data sets and demonstrate that it is indeed possible to bootstrap structured content creation from schematic text data sources using HMMs that require limited labeling effort and do so without compromising on the recall/precision performance metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adelberg, B.: Nodose: A tool for semi-automatically extracting structured and semi-structured data from text documents. In: ACM Conf. on Management of Data, SIGMOD (1998)

    Google Scholar 

  2. Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: ACM Conf. on Knowledge Discovery and Data Mining, SIGKDD (2004)

    Google Scholar 

  3. Baum, L.: An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process. Inequalities 3, 1–8 (1972)

    Google Scholar 

  4. Bikel, Schwartz, Weischedel: An algorithm that learns what’s in a name. Machine Learning 34(1), 211–231 (1999)

    Article  MATH  Google Scholar 

  5. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: ACM Conf. on Management of Data, SIGMOD (2001)

    Google Scholar 

  6. Brants, T.: Cascaded markov models. In: European Chapter of the Association for Computational Linguistics, EACL (1999)

    Google Scholar 

  7. Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: National Conf. on Artificial Intelligence, AAAI (1999)

    Google Scholar 

  8. Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Intl. Colloqium on Grammatical Inference and Applications, ICGI (1994)

    Google Scholar 

  9. Cohen, W.: Fast effective rule induction. In: Intl. Conf. on Machine Learning, ICML (1995)

    Google Scholar 

  10. Cohen, W., Schapire, R., Singer, Y.: Learning to order things. Journal of Artificial Intelligence Research 10, 243–270 (1999)

    MATH  MathSciNet  Google Scholar 

  11. Cohen, W., Singer, Y.: Simple, fast, and effective rule learner. In: National Conf. on Artificial Intelligence, AAAI (1999)

    Google Scholar 

  12. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(B), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  13. Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden markov model: Analysis and applications. Machine Learning 32, 41–62 (1998)

    Article  MATH  Google Scholar 

  14. Freitag, D., McCallum, A.: Information extrcation using hmms and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)

    Google Scholar 

  15. Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: National Conf. on Artificial Intelligence, AAAI (2000)

    Google Scholar 

  16. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Intl. Conf. on Machine Learning, ICML (2001)

    Google Scholar 

  17. Leek, T.: Information extraction using hidden markov models. In: Master’s thesis UC San Diego (1997)

    Google Scholar 

  18. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy markov models for information extraction and segmentation. In: Intl. Conf. on Machine Learning, ICML (2000)

    Google Scholar 

  19. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of IEEE 77(2) (1989)

    Google Scholar 

  20. Scheffer, T., Wrobel, S.: Active learning of partially hidden markov models. In: ECML/PKDD Workshop on Instance Selection (2001)

    Google Scholar 

  21. Skounakis, M., Craven, M., Ray, S.: Hierarchical hidden markov models for information extraction. In: Intl. Joint Conf. on Artificial Intelligence, IJCAI (2003)

    Google Scholar 

  22. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3) (1999)

    Google Scholar 

  23. Stolcke, A., Omohundro, S.: Hidden markov model induction by bayesian model merging. In: Advances in Neural Information Processing Systems, NIPS (1992)

    Google Scholar 

  24. Viterbi, A.: Error bounds for convolutional codes and an asymtotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, 260–267 (1967)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mukherjee, S., Ramakrishnan, I.V. (2004). Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30469-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30469-2_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23662-7

  • Online ISBN: 978-3-540-30469-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics