Skip to main content

Probabilistic Model for Structured Document Mapping

Application to Automatic HTML to XML Conversion

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4571))

Abstract

We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: From visual to semantic structures. In: ICDE (2002)

    Google Scholar 

  2. Zhang, S., Dyreson, C.: Polymorphic xml restructuring. In: IIWeb 2006: Workshop on Information Integration on the Web (2006)

    Google Scholar 

  3. Wisniewski, G., Gallinari, P.: From layout to semantic: a reranking model for mapping web documents to mediated xml representations. In: Proceedings of the 8th RIAO International Conference on Large-Scale Semantic Access to Content (2007)

    Google Scholar 

  4. Chidlovskii, B., Fuselier, J.: Supervised learning for the legacy document conversion. In: DocEng 2004. Proceedings of the 2004 ACM symposium on Document engineering, New York, NY, USA, pp. 220–228. ACM Press, New York (2004)

    Chapter  Google Scholar 

  5. Doan, A., Halevy, A.: Semantic integration research in the database community: A brief survey. AI Magazine, Special Issue on Semantic Integration (2005)

    Google Scholar 

  6. Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Information Processing and Management  (2004)

    Google Scholar 

  7. Denoyer, L.: Xml document mining challenge. Technical report, LIP6 (2005)

    Google Scholar 

  8. Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (1996)

    Google Scholar 

  9. Malouf, R.: A comparison of algorithms for maximum entropy parameter estimation. In: COLING-02. proceeding of the 6th conference on Natural language learning, Morristown, NJ, USA, pp. 1–7. Association for Computational Linguistics (2002)

    Google Scholar 

  10. Chidlovskii, B., Fuselier, J.: A Probabilistic Learning Method for XML Annotation of Documents. In: IJCAI (2005)

    Google Scholar 

  11. Denoyer, L., Wisniewski, G., Gallinari, P.: Document structure matching for heterogeneous corpora. In: Workshop SIGIR 2004. Workshop on IR and XML, Sheffield (2004)

    Google Scholar 

  12. Daumé III, H., Marcu, D.: Learning as search optimization: approximate large margin methods for structured prediction. In: ICML 2005. Proceedings of the 22nd international conference on Machine learning, New York, NY, USA, pp. 169–176. ACM Press, New York (2005)

    Chapter  Google Scholar 

  13. Fuhr, N., Govert, N., Kazai, G., Lalmas, M.: Inex: Initiative for the evaluation of xml retrieval. In: SIGIR 2002 Workshop on XML and Information Retrieval (2002)

    Google Scholar 

  14. Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Mach. Learn. 50, 279–301 (2003)

    Article  MATH  Google Scholar 

  15. McCallum, A.: Information extraction: distilling structured data from unstructured text. Queue 3, 48–57 (2005)

    Article  Google Scholar 

  16. Young-Lai, M., Tompa, F.W.: Stochastic grammatical inference of text database structure. Machine Learning (2000)

    Google Scholar 

  17. Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden markov model: Analysis and applications. Machine Learning 32, 41–62 (1998)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wisniewski, G., Maes, F., Denoyer, L., Gallinari, P. (2007). Probabilistic Model for Structured Document Mapping. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2007. Lecture Notes in Computer Science(), vol 4571. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73499-4_64

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73499-4_64

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73498-7

  • Online ISBN: 978-3-540-73499-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics