Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model

Maes, Francis; Denoyer, Ludovic; Gallinari, Patrick

doi:10.1007/978-3-642-22613-7_13

Francis Maes⁷,
Ludovic Denoyer⁷ &
Patrick Gallinari⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 370))

880 Accesses

Abstract

We address the problem of learning to map automatically flat and semi-structured documents onto a mediated target XML schema. This problem is motivated by the recent development of applications for searching and mining semi-structured document sources and corpora. Academic research has mainly dealt with homogeneous collections. In practical applications, data come from multiple heterogeneous sources and mining such collections requires defining a mapping or correspondence between the different document formats. Automating the design of such mappings has rapidly become a key issue for these applications. We propose a machine learning approach to this problem where the mapping is learned from pairs of input and corresponding target documents provided by a user. The mapping process is formalized as a Markov Decision Process, and training is performed through a classical machine learning framework known as Reinforcement Learning. The resulting model is able to cope with complex mappings while keeping a linear complexity. We describe a set of experiments on several corpora representative of different mapping tasks and show that the method is able to learn mappings with a high accuracy on different corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Article 04 August 2017

Focused Crawling Through Reinforcement Learning

Inferring a Relax NG Schema from XML Documents

References

Boukottaya, A., Vanoirbeek, C.: Schema matching for transforming structured documents. In: ACM DOCENG, pp. 101–110 (2005)
Google Scholar
Castano, S., Antonellis, V.D., di Vimercati, S.D.C.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)
Article Google Scholar
Chidlovskii, B., Fuselier, J.: A probabilistic learning method for xml annotation of documents. In: IJCAI (2005)
Google Scholar
Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: From visual to semantic structure. In: ICDE, pp. 53–63 (2002)
Google Scholar
Collins, M., Roark, B.: Incremental parsing with the perceptron algorithm. In: ACL 2004, Barcelona, Spain, pp. 111–118 (2004)
Google Scholar
Denoyer, L., Gallinari, P.: The wikipedia xml corpus. In: SIGIR Forum (2006)
Google Scholar
Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2005 and inex 2006. In: SIGIR Forum, pp. 79–90 (2007)
Google Scholar
Doan, A., Domingos, P., Levy, A.Y.: Learning source description for data integration. In: WebDB (Informal Proceedings), pp. 81–86 (2000)
Google Scholar
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD Conference, pp. 509–520 (2001)
Google Scholar
Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Maching Learning 50(3), 279–301 (2003), doi: http://dx.doi.org/10.1023/A:1021765902788
Article MATH Google Scholar
Embley, D.W., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Workshop on Information Integration on the Web, pp. 110–117 (2001)
Google Scholar
Fuhr, N., Govert, N., Kazai, G., Lalmas, M.: INDEX: Initiative for the Evaluation of XML Retrieval. In: SIGIR 2002 Workshop on XML and IR (2002)
Google Scholar
Fuhr, N., Gövert, N., Kazai, G., Lalmas, M. (eds.): Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9-11 (2002)
Google Scholar
Howard, R.A.: Dynamic Programming and Markov Processes. Technology Press-Wiley, Cambridge, Massachusetts (1960)
MATH Google Scholar
Leinonen, P.: Automating xml document structure transformations. In: ACM DOCENG, pp. 26–28 (2003)
Google Scholar
Li, W.S., Clifton, C.: Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)
Article MATH Google Scholar
Maes, F.: Choose-reward algorithms incremental structured prediction, learning for search and learning based programming. PhD in Computer Science, University Pierre and Marie Curie, LIP6 (2009)
Google Scholar
Palopoli, L., Saccà, D., Ursino, D.: Semi-automatic semantic discovery of properties from database schemas. In: IDEAS, pp. 244–253 (1998)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Article MATH Google Scholar
Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. J. Data Semantics IV, 146–171 (2005)
Google Scholar
Su, H., Kuno, H.A., Rundensteiner, E.A.: Automating the transformation of xml documents. In: WIDM, pp. 68–75 (2001)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT Press, Cambridge (1998), http:www.amazon.co.ukexecobidosASIN0262193981citeulike-21
Google Scholar

Download references

Author information

Authors and Affiliations

LIP6, 104 Avenue du président Kennedy, 75016, Paris, France
Francis Maes, Ludovic Denoyer & Patrick Gallinari

Authors

Francis Maes
View author publications
You can also search for this author in PubMed Google Scholar
Ludovic Denoyer
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Linguistics and Literature, Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, Germany
Alexander Mehler
Institute of Cognitive Science, University of Osnabrück, Albrechtstr. 28, 49076, Osnabrück, Germany
Kai-Uwe Kühnberger
Angewandte Sprachwissenschaft und, Justus-Liebig-Universität Gießen, Computerlinguistik, Otto-Behaghel-Straße 10D, 35394, Gießen, Germany
Henning Lobin & Harald Lüngen &
Institut für deutsche Sprache und Literatur, Technical University Dortmund, Emil-Figge-Straße 50, 44227, Dortmund, Germany
Angelika Storrer
SFB 441 Linguistic Data Structures, Eberhard Karls Universität Tübingen, Nauklerstraße 35, 72074, Tübingen, Germany
Andreas Witt

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Maes, F., Denoyer, L., Gallinari, P. (2011). Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-22613-7_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22612-0
Online ISBN: 978-3-642-22613-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Focused Crawling Through Reinforcement Learning

Inferring a Relax NG Schema from XML Documents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Focused Crawling Through Reinforcement Learning

Inferring a Relax NG Schema from XML Documents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation