Abstract
We address the problem of learning to map automatically flat and semi-structured documents onto a mediated target XML schema. This problem is motivated by the recent development of applications for searching and mining semi-structured document sources and corpora. Academic research has mainly dealt with homogeneous collections. In practical applications, data come from multiple heterogeneous sources and mining such collections requires defining a mapping or correspondence between the different document formats. Automating the design of such mappings has rapidly become a key issue for these applications. We propose a machine learning approach to this problem where the mapping is learned from pairs of input and corresponding target documents provided by a user. The mapping process is formalized as a Markov Decision Process, and training is performed through a classical machine learning framework known as Reinforcement Learning. The resulting model is able to cope with complex mappings while keeping a linear complexity. We describe a set of experiments on several corpora representative of different mapping tasks and show that the method is able to learn mappings with a high accuracy on different corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Boukottaya, A., Vanoirbeek, C.: Schema matching for transforming structured documents. In: ACM DOCENG, pp. 101–110 (2005)
Castano, S., Antonellis, V.D., di Vimercati, S.D.C.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)
Chidlovskii, B., Fuselier, J.: A probabilistic learning method for xml annotation of documents. In: IJCAI (2005)
Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: From visual to semantic structure. In: ICDE, pp. 53–63 (2002)
Collins, M., Roark, B.: Incremental parsing with the perceptron algorithm. In: ACL 2004, Barcelona, Spain, pp. 111–118 (2004)
Denoyer, L., Gallinari, P.: The wikipedia xml corpus. In: SIGIR Forum (2006)
Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2005 and inex 2006. In: SIGIR Forum, pp. 79–90 (2007)
Doan, A., Domingos, P., Levy, A.Y.: Learning source description for data integration. In: WebDB (Informal Proceedings), pp. 81–86 (2000)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD Conference, pp. 509–520 (2001)
Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Maching Learning 50(3), 279–301 (2003), doi: http://dx.doi.org/10.1023/A:1021765902788
Embley, D.W., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Workshop on Information Integration on the Web, pp. 110–117 (2001)
Fuhr, N., Govert, N., Kazai, G., Lalmas, M.: INDEX: Initiative for the Evaluation of XML Retrieval. In: SIGIR 2002 Workshop on XML and IR (2002)
Fuhr, N., Gövert, N., Kazai, G., Lalmas, M. (eds.): Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9-11 (2002)
Howard, R.A.: Dynamic Programming and Markov Processes. Technology Press-Wiley, Cambridge, Massachusetts (1960)
Leinonen, P.: Automating xml document structure transformations. In: ACM DOCENG, pp. 26–28 (2003)
Li, W.S., Clifton, C.: Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)
Maes, F.: Choose-reward algorithms incremental structured prediction, learning for search and learning based programming. PhD in Computer Science, University Pierre and Marie Curie, LIP6 (2009)
Palopoli, L., Saccà , D., Ursino, D.: Semi-automatic semantic discovery of properties from database schemas. In: IDEAS, pp. 244–253 (1998)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. J. Data Semantics IV, 146–171 (2005)
Su, H., Kuno, H.A., Rundensteiner, E.A.: Automating the transformation of xml documents. In: WIDM, pp. 68–75 (2001)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT Press, Cambridge (1998), http:www.amazon.co.ukexecobidosASIN0262193981citeulike-21
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Maes, F., Denoyer, L., Gallinari, P. (2011). Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-22613-7_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22612-0
Online ISBN: 978-3-642-22613-7
eBook Packages: EngineeringEngineering (R0)