Skip to main content

Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model

  • Chapter
Modeling, Learning, and Processing of Text Technological Data Structures

Part of the book series: Studies in Computational Intelligence ((SCI,volume 370))

  • 880 Accesses

Abstract

We address the problem of learning to map automatically flat and semi-structured documents onto a mediated target XML schema. This problem is motivated by the recent development of applications for searching and mining semi-structured document sources and corpora. Academic research has mainly dealt with homogeneous collections. In practical applications, data come from multiple heterogeneous sources and mining such collections requires defining a mapping or correspondence between the different document formats. Automating the design of such mappings has rapidly become a key issue for these applications. We propose a machine learning approach to this problem where the mapping is learned from pairs of input and corresponding target documents provided by a user. The mapping process is formalized as a Markov Decision Process, and training is performed through a classical machine learning framework known as Reinforcement Learning. The resulting model is able to cope with complex mappings while keeping a linear complexity. We describe a set of experiments on several corpora representative of different mapping tasks and show that the method is able to learn mappings with a high accuracy on different corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Boukottaya, A., Vanoirbeek, C.: Schema matching for transforming structured documents. In: ACM DOCENG, pp. 101–110 (2005)

    Google Scholar 

  2. Castano, S., Antonellis, V.D., di Vimercati, S.D.C.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)

    Article  Google Scholar 

  3. Chidlovskii, B., Fuselier, J.: A probabilistic learning method for xml annotation of documents. In: IJCAI (2005)

    Google Scholar 

  4. Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: From visual to semantic structure. In: ICDE, pp. 53–63 (2002)

    Google Scholar 

  5. Collins, M., Roark, B.: Incremental parsing with the perceptron algorithm. In: ACL 2004, Barcelona, Spain, pp. 111–118 (2004)

    Google Scholar 

  6. Denoyer, L., Gallinari, P.: The wikipedia xml corpus. In: SIGIR Forum (2006)

    Google Scholar 

  7. Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2005 and inex 2006. In: SIGIR Forum, pp. 79–90 (2007)

    Google Scholar 

  8. Doan, A., Domingos, P., Levy, A.Y.: Learning source description for data integration. In: WebDB (Informal Proceedings), pp. 81–86 (2000)

    Google Scholar 

  9. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD Conference, pp. 509–520 (2001)

    Google Scholar 

  10. Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Maching Learning 50(3), 279–301 (2003), doi: http://dx.doi.org/10.1023/A:1021765902788

    Article  MATH  Google Scholar 

  11. Embley, D.W., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Workshop on Information Integration on the Web, pp. 110–117 (2001)

    Google Scholar 

  12. Fuhr, N., Govert, N., Kazai, G., Lalmas, M.: INDEX: Initiative for the Evaluation of XML Retrieval. In: SIGIR 2002 Workshop on XML and IR (2002)

    Google Scholar 

  13. Fuhr, N., Gövert, N., Kazai, G., Lalmas, M. (eds.): Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9-11 (2002)

    Google Scholar 

  14. Howard, R.A.: Dynamic Programming and Markov Processes. Technology Press-Wiley, Cambridge, Massachusetts (1960)

    MATH  Google Scholar 

  15. Leinonen, P.: Automating xml document structure transformations. In: ACM DOCENG, pp. 26–28 (2003)

    Google Scholar 

  16. Li, W.S., Clifton, C.: Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)

    Article  MATH  Google Scholar 

  17. Maes, F.: Choose-reward algorithms incremental structured prediction, learning for search and learning based programming. PhD in Computer Science, University Pierre and Marie Curie, LIP6 (2009)

    Google Scholar 

  18. Palopoli, L., Saccà, D., Ursino, D.: Semi-automatic semantic discovery of properties from database schemas. In: IDEAS, pp. 244–253 (1998)

    Google Scholar 

  19. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  20. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. J. Data Semantics IV, 146–171 (2005)

    Google Scholar 

  21. Su, H., Kuno, H.A., Rundensteiner, E.A.: Automating the transformation of xml documents. In: WIDM, pp. 68–75 (2001)

    Google Scholar 

  22. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT Press, Cambridge (1998), http:www.amazon.co.ukexecobidosASIN0262193981citeulike-21

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Maes, F., Denoyer, L., Gallinari, P. (2011). Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22613-7_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22612-0

  • Online ISBN: 978-3-642-22613-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics